Analyzing twitter data with hadoop

49
1 Analyzing Twi,er Data with Hadoop Data Science Maryland, May 2013 Joey Echeverria | Director Federal FTS [email protected] | @fwiffo ©2013 Cloudera, Inc.

Transcript of Analyzing twitter data with hadoop

Page 1: Analyzing twitter data with hadoop

1

Analyzing  Twi,er  Data  with  Hadoop  Data  Science  Maryland,  May  2013  Joey  Echeverria  |  Director  Federal  FTS  [email protected]  |  @fwiffo  

©2013 Cloudera, Inc.

Page 2: Analyzing twitter data with hadoop

About  Joey  

•  Director  Federal  FTS  •  Technical  guy  playing  manager  

•  2  years  @  Cloudera  •  5  years  of  Hadoop  •  Local  

2

Page 3: Analyzing twitter data with hadoop

Analyzing  Twi,er  Data  with  Hadoop  

BUILDING  A  BIG  DATA  SOLUTION  

3   ©2013 Cloudera, Inc.

Page 4: Analyzing twitter data with hadoop

Big  Data  

•  Big  •  Larger  volume  than  you’ve  handled  before  

•  No  litmus  test  •  High  value,  under  uRlized  

•  Data  •  Structured  •  Unstructured  •  Semi-­‐structured  

•  Hadoop  •  Distributed  file  system  •  Distributed,  batch  computaRon  

4 ©2013 Cloudera, Inc.

Page 5: Analyzing twitter data with hadoop

Data  Management  Systems  

5 ©2013 Cloudera, Inc.

Data  Source   Data  Storage  Data  

IngesRon  

Data  Processing  

Page 6: Analyzing twitter data with hadoop

RelaRonal  Data  Management  Systems  

6 ©2013 Cloudera, Inc.

Data  Source   RDBMS  ETL  

ReporRng  

Page 7: Analyzing twitter data with hadoop

A  Canonical  Hadoop  Architecture  

7 ©2013 Cloudera, Inc.

Data  Source   HDFS  Flume  

Hive  (Impala)  

Page 8: Analyzing twitter data with hadoop

Analyzing  Twi,er  Data  with  Hadoop  

AN  EXAMPLE  USE  CASE  

8   ©2013 Cloudera, Inc.

Page 9: Analyzing twitter data with hadoop

Analyzing  Twi,er  

•  Social  media  popular  with  markeRng  teams  •  Twi,er  is  an  effecRve  tool  for  promoRon  •  Who  is  influenRal?  

•  Tweets  •  Followers  •  Retweets  

•  Similar  to  e-­‐mail  forwarding  

•  Which  twi,er  user  gets  the  most  retweets?  •  Who  is  influenRal  in  our  industry?  

9 ©2013 Cloudera, Inc.

Page 10: Analyzing twitter data with hadoop

Analyzing  Twi,er  Data  with  Hadoop  

HOW  DO  WE  ANSWER  THESE  QUESTIONS?  

10   ©2013 Cloudera, Inc.

Page 11: Analyzing twitter data with hadoop

Techniques  

•  SQL  •  Filtering  •  AggregaRon  •  SorRng  

•  Complex  data  •  Deeply  nested  •  Variable  schema  

11

Page 12: Analyzing twitter data with hadoop

Architecture  

12 ©2013 Cloudera, Inc.

Twi,er  

HDFS  Flume   Hive  

Custom  Flume  Source  

Sink  to  HDFS  

JSON  SerDe  Parses  Data  

Oozie  

Add  ParRRons  Hourly  

Impala  

Queries  

Queries  and  ETL  

Page 13: Analyzing twitter data with hadoop

Analyzing  Twi,er  Data  with  Hadoop  

TWITTER  SOURCE  

13   ©2013 Cloudera, Inc.

Page 14: Analyzing twitter data with hadoop

Flume  

•  Streaming  data  flow  •  Sources  

•  Push  or  pull  

•  Sinks  •  Event  based  

14 ©2013 Cloudera, Inc.

Page 15: Analyzing twitter data with hadoop

Pulling  Data  From  Twi,er  

•  Custom  source,  using  twi,er4j  •  Sources  process  data  as  discrete  events  

Page 16: Analyzing twitter data with hadoop

Loading  Data  Into  HDFS  

•  HDFS  Sink  comes  stock  with  Flume  •  Easily  separate  files  by  creaRon  Rme  

•  hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/

Page 17: Analyzing twitter data with hadoop

Flume  Source  

17 ©2013 Cloudera, Inc.

public class TwitterSource extends AbstractSource! implements EventDrivenSource, Configurable { ! ... ! // The initialization method for the Source. The context contains all ! // the Flume configuration info ! @Override ! public void configure(Context context) { ! ... ! } ! ... ! // Start processing events. Uses the Twitter Streaming API to sample ! // Twitter, and process tweets. ! @Override ! public void start() { ! ... ! } ! ... ! // Stops Source's event processing and shuts down the Twitter stream. ! @Override ! public void stop() { ! ... ! } !} !!

Page 18: Analyzing twitter data with hadoop

Twi,er  API  

•  Callback  mechanism  for  catching  new  tweets  

18 ©2013 Cloudera, Inc.

/** The actual Twitter stream. It's set up to collect raw JSON data */ !private final TwitterStream twitterStream = new TwitterStreamFactory( ! new ConfigurationBuilder().setJSONStoreEnabled(true).build()) ! .getInstance(); !... !// The StatusListener is a twitter4j API that can be added to a stream, !// and will call a method every time a message is sent to the stream. !StatusListener listener = new StatusListener() { ! // The onStatus method is executed every time a new tweet comes in. ! public void onStatus(Status status) { ! ... ! } !} !... !// Set up the stream's listener (defined above), and set any necessary !// security information. !twitterStream.addListener(listener); !twitterStream.setOAuthConsumer(consumerKey, consumerSecret); !AccessToken token = new AccessToken(accessToken, accessTokenSecret); !twitterStream.setOAuthAccessToken(token); !!

Page 19: Analyzing twitter data with hadoop

JSON  Data  

•  JSON  data  is  processed  as  an  event  and  wri,en  to  HDFS  

19 ©2013 Cloudera, Inc.

public void onStatus(Status status) { ! // The EventBuilder is used to build an event using the headers and ! // the raw JSON of a tweet !! headers.put("timestamp", String.valueOf( ! status.getCreatedAt().getTime())); ! Event event = EventBuilder.withBody( ! DataObjectFactory.getRawJSON(status).getBytes(), headers); ! ! channel.processEvent(event); !} !

Page 20: Analyzing twitter data with hadoop

Analyzing  Twi,er  Data  with  Hadoop  

FLUME  DEMO  

20   ©2013 Cloudera, Inc.

Page 21: Analyzing twitter data with hadoop

Analyzing  Twi,er  Data  with  Hadoop  

HIVE  

21   ©2013 Cloudera, Inc.

Page 22: Analyzing twitter data with hadoop

What  is  Hive?  

•  Created  at  Facebook  •  HiveQL  

•  SQL  like  interface  

•  Hive  interpreter  converts  HiveQL  to  MapReduce  code  

•  Returns  results  to  the  client  

22   ©2013 Cloudera, Inc.

Page 23: Analyzing twitter data with hadoop

Hive  Details  

•  Schema  on  read  •  Scalar  types  (int,  float,  double,  boolean,  string)  •  Complex  types  (struct,  map,  array)  •  Metastore  contains  table  definiRons  

•  Stored  in  a  relaRonal  database  •  Similar  to  catalog  tables  in  other  DBs  

23

Page 24: Analyzing twitter data with hadoop

Complex  Data  

24 ©2013 Cloudera, Inc.

SELECT !  t.retweet_screen_name, !  sum(retweets) AS total_retweets, !  count(*) AS tweet_count!FROM (SELECT !   retweeted_status.user.screen_name AS retweet_screen_name, !     retweeted_status.text, !     max(retweeted_status.retweet_count) AS retweets! FROM tweets !   GROUP BY ! retweeted_status.user.screen_name, !       retweeted_status.text) t !GROUP BY t.retweet_screen_name!ORDER BY total_retweets DESC !LIMIT 10; !

Page 25: Analyzing twitter data with hadoop

Analyzing  Twi,er  Data  with  Hadoop  

JSON  INTERLUDE  

25   ©2013 Cloudera, Inc.

Page 26: Analyzing twitter data with hadoop

What  is  JSON?  

•  Complex,  semi-­‐structured  data  •  Based  on  JavaScript’s  data  syntax  •  Rich,  nested  data  types:  

•  number  •  string  •  Array  •  object  •  true,  false  •  null  

26 ©2013 Cloudera, Inc.

Page 27: Analyzing twitter data with hadoop

What  is  JSON?  

27 ©2013 Cloudera, Inc.

{ ! "retweeted_status": { ! "contributors": null, ! "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", ! "retweeted": false, ! "entities": { ! "hashtags": [ ! { ! "text": "Crowdsourcing", ! "indices": [0, 14] ! }, ! { ! "text": "bigdata", ! "indices": [129,137] ! } ! ], ! "user_mentions": [] ! } ! } !} !

Page 28: Analyzing twitter data with hadoop

Hive  Serializers  and  Deserializers    

•  Instructs  Hive  on  how  to  interpret  data  •  JSONSerDe  

28 ©2013 Cloudera, Inc.

Page 29: Analyzing twitter data with hadoop

Analyzing  Twi,er  Data  with  Hadoop  

HIVE  DEMO  

29   ©2013 Cloudera, Inc.

Page 30: Analyzing twitter data with hadoop

Analyzing  Twi,er  Data  with  Hadoop  

IT’S  A  TRAP!  

30   ©2013 Cloudera, Inc.

Photo  from  h,p://www.flickr.com/photos/vanf/6798065626/  Some  rights  reserved  

Page 31: Analyzing twitter data with hadoop

Not  a  Database  

31 ©2013 Cloudera, Inc.

RDBMS Hive

Language Generally >= SQL-92

Subset of SQL-92 plus Hive specific extensions

Update Capabilities INSERT, UPDATE, DELETE

INSERT OVERWRITE no UPDATE, DELETE

Transactions Yes No Latency Sub-second Minutes Indexes Yes Yes Data size Terabytes Petabytes

Page 32: Analyzing twitter data with hadoop

ETL  

•  Hive  works  great  for  SQL-­‐based  ETL  •  JSON  -­‐>  SequenceFiles  

32 ©2013 Cloudera, Inc.

Page 33: Analyzing twitter data with hadoop

Analyzing  Twi,er  Data  with  Hadoop  

IMPALA  

33   ©2013 Cloudera, Inc.

Page 34: Analyzing twitter data with hadoop

Cloudera  Impala  

34

Real-­‐Time  Query  for  Data  Stored  in  Hadoop.  

Supports  Hive  SQL  

4-­‐30X  faster  than  Hive  over  MapReduce  

Uses  exisRng  drivers,  integrates  with  exisRng  metastore,  works  with  leading  BI  tools  

Flexible,  cost-­‐effecRve,  no  lock-­‐in  

Deploy  &  operate  with  Cloudera  Enterprise  RTQ  

Supports  mulRple  storage  engines  &    file  formats  

©2013 Cloudera, Inc.

Page 35: Analyzing twitter data with hadoop

Benefits  of  Cloudera  Impala  

35

Real-­‐Time  Query  for  Data  Stored  in  Hadoop  

• Real-­‐Rme  queries  run  directly  on  source  data  • No  ETL  delays  • No  jumping  between  data  silos  

•   No  double  storage  with  EDW/RDBMS  •   Unlock  analysis  on  more  data  •   No  need  to  create  and  maintain  complex  ETL  between  systems  •   No  need  to  preplan  schemas  

•   All  data  available  for  interacRve  queries  • No  loss  of  fidelity  from  fixed  data  schemas  

• Single  metadata  store  from  originaRon    through  analysis  • No  need  to  hunt  through  mulRple  data  silos  

©2013 Cloudera, Inc.

Page 36: Analyzing twitter data with hadoop

Cloudera  Impala  Details  

36   ©2013 Cloudera, Inc.

HDFS  DN  

Query  Exec  Engine  

Query  Coordinator  

Query  Planner  

HBase  

ODBC  

SQL  App  

HDFS  DN  

Query  Exec  Engine  

Query  Coordinator  

Query  Planner  

HBase  HDFS  DN  

Query  Exec  Engine  

Query  Coordinator  

Query  Planner  

HBase  

Fully  MPP  Distributed  

Local  Direct  Reads  

State  Store  

HDFS  NN  Hive  Metastore   YARN  

Common  Hive  SQL  and  interface  

Unified  metadata  and  scheduler  

Low-­‐latency  scheduler  and  cache  (low-­‐impact  failures)  

Page 37: Analyzing twitter data with hadoop

Analyzing  Twi,er  Data  with  Hadoop  

IMPALA  DEMO  

37   ©2013 Cloudera, Inc.

Page 38: Analyzing twitter data with hadoop

Analyzing  Twi,er  Data  with  Hadoop  

OOZIE  AUTOMATION  

38   ©2013 Cloudera, Inc.

Page 39: Analyzing twitter data with hadoop

Oozie:  Everything  in  its  Right  Place  

Page 40: Analyzing twitter data with hadoop

Oozie  for  ParRRon  Management  

•  Once  an  hour,  add  a  parRRon  •  Takes  advantage  of  advanced  Hive  funcRonality  

Page 41: Analyzing twitter data with hadoop

Analyzing  Twi,er  Data  with  Hadoop  

OOZIE  DEMO  

41   ©2013 Cloudera, Inc.

Page 42: Analyzing twitter data with hadoop

Analyzing  Twi,er  Data  with  Hadoop  

PUTTING  IT  ALL  TOGETHER  

42   ©2013 Cloudera, Inc.

Page 43: Analyzing twitter data with hadoop

Complete  Architecture  

43 ©2013 Cloudera, Inc.

Twi,er  

HDFS  Flume   Hive  

Custom  Flume  Source  

Sink  to  HDFS  

JSON  SerDe  Parses  Data  

Oozie  

Add  ParRRons  Hourly  

Impala  

Queries  

Queries  and  ETL  

Page 44: Analyzing twitter data with hadoop

Analyzing  Twi,er  Data  with  Hadoop  

MORE  DEMOS  

44   ©2013 Cloudera, Inc.

Page 45: Analyzing twitter data with hadoop

What  next?  

•  Cloudera  University  •  Download  Hadoop!  •  CDH  available  at  www.cloudera.com  •  Cloudera  provides  pre-­‐loaded  VMs  

•  h,ps://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Free+EdiRon+Demo+VM  

•  Clone  the  source  repo  •  h,ps://github.com/cloudera/cdh-­‐twi,er-­‐example  

Page 46: Analyzing twitter data with hadoop

My  personal  preference  

•  Cloudera  Manager  •  h,ps://ccp.cloudera.com/display/SUPPORT/Downloads  

•  Free  up  to  50    unlimited  nodes!  

Page 47: Analyzing twitter data with hadoop

Shout  Out  

•  Jon  Natkins  •  @na,yice  •  Blog  posts  

•  h,p://blog.cloudera.com/blog/2013/09/analyzing-­‐twi,er-­‐data-­‐with-­‐hadoop/  

•  h,p://blog.cloudera.com/blog/2013/10/analyzing-­‐twi,er-­‐data-­‐with-­‐hadoop-­‐part-­‐2-­‐gathering-­‐data-­‐with-­‐flume/  

•  h,p://blog.cloudera.com/blog/2013/11/analyzing-­‐twi,er-­‐data-­‐with-­‐hadoop-­‐part-­‐3-­‐querying-­‐semi-­‐structured-­‐data-­‐with-­‐hive/  

Page 48: Analyzing twitter data with hadoop

QuesRons?  

•  Contact  me!  •  Joey  Echeverria  •  [email protected]  •  @fwiffo  

•  We’re  hiring!  

Page 49: Analyzing twitter data with hadoop

49 ©2013 Cloudera, Inc.