使用Pig对数据进行处理和分析

本文将针对一份数据提出多个数据处理的相关问题,并介绍如何使用Apache Pig来解决相关的问题,建议读者根据具体问题具体实践后再查看我分享的解题Pig代码。

数据

数据下载

本文用到的数据是1999年至2004年的The Daily Show嘉宾的历史数据。下载地址

数据描述

YEAR – The year the episode aired.

GoogleKnowlege_Occupation -Their occupation or office, according to Google’s Knowledge Graph. On the other hand, if they are not in there, how Stewart introduced them on the program.

Show – Air date of the episode. Not unique, as some shows had more than one guest

Group – A larger group designation for the occupation. For instance, U.S senators, U.S presidents, and former presidents are all under “politicians”

Raw_Guest_List – The person or list of people who appeared on the show, according to Wikipedia. The GoogleKnowlege_Occupation only refers to one of them in a given row.

问题

问题1

在指定的时间范围内,根据GoogleKnowlege_Occupation的不同,找出最多的5种类型。

pig代码

1
2
3
4
5
6
7
8
9
10
11
12
A = load 'dialy_show_guests' using PigStorage(',') 
AS (year:chararray,occupation:chararray,date:chararray,
groupName:chararray,gusetlist:chararray);
C = foreach A generate occupation,ToDate(date,'MM/dd/yy') as date;
D = filter C by ((date> ToDate('1/11/99','MM/dd/yy'))
AND (date<ToDate('6/11/99','MM/dd/yy')));
E = group D by occupation;
F = foreach E generate group, COUNT(D) as cnt;
describe F;
G = order F by cnt desc;
H = limit G 5;
DUMP H;

输出结果

(actor,28)
(actress,20)
(comedian,4)
(television actress,3)
(singer,2)

问题2

根据Group,找出每年参加该show的politician的人数。

pig 代码

1
2
3
4
5
6
7
8
9
A = load 'dialy_show_guests' using PigStorage(',') 
AS (year:chararray,occupation:chararray,date:chararray,
groupName:chararray,gusetlist:chararray);
B = FILTER A by groupName=='Politician';
C = foreach B generate ToDate(date,'MM/dd/yy') as date;
D = FOREACH C GENERATE GetYear(date) as y;
E = GROUP D by y;
F = FOREACH E GENERATE group, COUNT(D) as cnt;
DUMP F;

在做这题的时候,没有发现数据已经提供了year数据域,于是我使用date数据域并提取了其中的年份,有些多此一举。

输出结果

(1999,2)
(2000,13)
(2001,3)
(2002,8)
(2003,14)
(2004,32)
(2005,22)
(2006,25)
(2007,21)
(2008,27)

问题3

在数据描述中,Group数据类型与GoogleKnowledge occupation数据类型,是包含关系,本题是要统计每个group中有多少种不同的GoogleKnowledge occupation。

pig 代码

1
2
3
4
5
6
7
8
A = load 'dialy_show_guests' using PigStorage(',') 
AS (year:chararray,occupation:chararray,date:chararray,
groupName:chararray,gusetlist:chararray);
B = FOREACH A GENERATE occupation, groupName;
C = GROUP B by groupName;
D = FOREACH C {o = DISTINCT B.occupation; GENERATE group, COUNT(o) as cnt;};
E = ORDER D BY cnt DESC ;
DUMP E;

输出结果

(Politician,105)
(Media,80)
(Athletics,29)
(Government,27)
(Musician,25)
(Academic,24)
(Misc,16)

问题4

找出各个Group与GoogleKnowledge occupation组合类后的参加show的人数。

pig 代码

1
2
3
4
5
6
7
A = load 'dialy_show_guests' using PigStorage(',') 
AS (year:chararray,occupation:chararray,date:chararray,
groupName:chararray,gusetlist:chararray);
B = GROUP A BY (groupName, occupation);
C = FOREACH B GENERATE group, COUNT(A) as cnt;
D = ORDER C BY cnt desc;
DUMP D;

输出结果

((Acting,actor),596)
((Acting,actress),271)
((Media,journalist),180)
((Media,author),102)
((Media,Journalist),72)
((Comedy,comedian),64)
((Politician,us senator),50)
((Media,Author),48)
((Media,television host),39)
((Comedy,Comedian),39)
((Media,writer),30)

Have a nice day!