Analyzing twitter data with hadoop

Post on 12-Jun-2015

929 views 2 download

Transcript of Analyzing twitter data with hadoop

1

Analyzing  Twi,er  Data  with  Hadoop  Data  Science  Maryland,  May  2013  Joey  Echeverria  |  Director  Federal  FTS  joey@cloudera.com  |  @fwiffo  

©2013 Cloudera, Inc.

About  Joey  

•  Director  Federal  FTS  •  Technical  guy  playing  manager  

•  2  years  @  Cloudera  •  5  years  of  Hadoop  •  Local  

2

Analyzing  Twi,er  Data  with  Hadoop  

BUILDING  A  BIG  DATA  SOLUTION  

3   ©2013 Cloudera, Inc.

Big  Data  

•  Big  •  Larger  volume  than  you’ve  handled  before  

•  No  litmus  test  •  High  value,  under  uRlized  

•  Data  •  Structured  •  Unstructured  •  Semi-­‐structured  

•  Hadoop  •  Distributed  file  system  •  Distributed,  batch  computaRon  

4 ©2013 Cloudera, Inc.

Data  Management  Systems  

5 ©2013 Cloudera, Inc.

Data  Source   Data  Storage  Data  

IngesRon  

Data  Processing  

RelaRonal  Data  Management  Systems  

6 ©2013 Cloudera, Inc.

Data  Source   RDBMS  ETL  

ReporRng  

A  Canonical  Hadoop  Architecture  

7 ©2013 Cloudera, Inc.

Data  Source   HDFS  Flume  

Hive  (Impala)  

Analyzing  Twi,er  Data  with  Hadoop  

AN  EXAMPLE  USE  CASE  

8   ©2013 Cloudera, Inc.

Analyzing  Twi,er  

•  Social  media  popular  with  markeRng  teams  •  Twi,er  is  an  effecRve  tool  for  promoRon  •  Who  is  influenRal?  

•  Tweets  •  Followers  •  Retweets  

•  Similar  to  e-­‐mail  forwarding  

•  Which  twi,er  user  gets  the  most  retweets?  •  Who  is  influenRal  in  our  industry?  

9 ©2013 Cloudera, Inc.

Analyzing  Twi,er  Data  with  Hadoop  

HOW  DO  WE  ANSWER  THESE  QUESTIONS?  

10   ©2013 Cloudera, Inc.

Techniques  

•  SQL  •  Filtering  •  AggregaRon  •  SorRng  

•  Complex  data  •  Deeply  nested  •  Variable  schema  

11

Architecture  

12 ©2013 Cloudera, Inc.

Twi,er  

HDFS  Flume   Hive  

Custom  Flume  Source  

Sink  to  HDFS  

JSON  SerDe  Parses  Data  

Oozie  

Add  ParRRons  Hourly  

Impala  

Queries  

Queries  and  ETL  

Analyzing  Twi,er  Data  with  Hadoop  

TWITTER  SOURCE  

13   ©2013 Cloudera, Inc.

Flume  

•  Streaming  data  flow  •  Sources  

•  Push  or  pull  

•  Sinks  •  Event  based  

14 ©2013 Cloudera, Inc.

Pulling  Data  From  Twi,er  

•  Custom  source,  using  twi,er4j  •  Sources  process  data  as  discrete  events  

Loading  Data  Into  HDFS  

•  HDFS  Sink  comes  stock  with  Flume  •  Easily  separate  files  by  creaRon  Rme  

•  hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/

Flume  Source  

17 ©2013 Cloudera, Inc.

public class TwitterSource extends AbstractSource! implements EventDrivenSource, Configurable { ! ... ! // The initialization method for the Source. The context contains all ! // the Flume configuration info ! @Override ! public void configure(Context context) { ! ... ! } ! ... ! // Start processing events. Uses the Twitter Streaming API to sample ! // Twitter, and process tweets. ! @Override ! public void start() { ! ... ! } ! ... ! // Stops Source's event processing and shuts down the Twitter stream. ! @Override ! public void stop() { ! ... ! } !} !!

Twi,er  API  

•  Callback  mechanism  for  catching  new  tweets  

18 ©2013 Cloudera, Inc.

/** The actual Twitter stream. It's set up to collect raw JSON data */ !private final TwitterStream twitterStream = new TwitterStreamFactory( ! new ConfigurationBuilder().setJSONStoreEnabled(true).build()) ! .getInstance(); !... !// The StatusListener is a twitter4j API that can be added to a stream, !// and will call a method every time a message is sent to the stream. !StatusListener listener = new StatusListener() { ! // The onStatus method is executed every time a new tweet comes in. ! public void onStatus(Status status) { ! ... ! } !} !... !// Set up the stream's listener (defined above), and set any necessary !// security information. !twitterStream.addListener(listener); !twitterStream.setOAuthConsumer(consumerKey, consumerSecret); !AccessToken token = new AccessToken(accessToken, accessTokenSecret); !twitterStream.setOAuthAccessToken(token); !!

JSON  Data  

•  JSON  data  is  processed  as  an  event  and  wri,en  to  HDFS  

19 ©2013 Cloudera, Inc.

public void onStatus(Status status) { ! // The EventBuilder is used to build an event using the headers and ! // the raw JSON of a tweet !! headers.put("timestamp", String.valueOf( ! status.getCreatedAt().getTime())); ! Event event = EventBuilder.withBody( ! DataObjectFactory.getRawJSON(status).getBytes(), headers); ! ! channel.processEvent(event); !} !

Analyzing  Twi,er  Data  with  Hadoop  

FLUME  DEMO  

20   ©2013 Cloudera, Inc.

Analyzing  Twi,er  Data  with  Hadoop  

HIVE  

21   ©2013 Cloudera, Inc.

What  is  Hive?  

•  Created  at  Facebook  •  HiveQL  

•  SQL  like  interface  

•  Hive  interpreter  converts  HiveQL  to  MapReduce  code  

•  Returns  results  to  the  client  

22   ©2013 Cloudera, Inc.

Hive  Details  

•  Schema  on  read  •  Scalar  types  (int,  float,  double,  boolean,  string)  •  Complex  types  (struct,  map,  array)  •  Metastore  contains  table  definiRons  

•  Stored  in  a  relaRonal  database  •  Similar  to  catalog  tables  in  other  DBs  

23

Complex  Data  

24 ©2013 Cloudera, Inc.

SELECT !  t.retweet_screen_name, !  sum(retweets) AS total_retweets, !  count(*) AS tweet_count!FROM (SELECT !   retweeted_status.user.screen_name AS retweet_screen_name, !     retweeted_status.text, !     max(retweeted_status.retweet_count) AS retweets! FROM tweets !   GROUP BY ! retweeted_status.user.screen_name, !       retweeted_status.text) t !GROUP BY t.retweet_screen_name!ORDER BY total_retweets DESC !LIMIT 10; !

Analyzing  Twi,er  Data  with  Hadoop  

JSON  INTERLUDE  

25   ©2013 Cloudera, Inc.

What  is  JSON?  

•  Complex,  semi-­‐structured  data  •  Based  on  JavaScript’s  data  syntax  •  Rich,  nested  data  types:  

•  number  •  string  •  Array  •  object  •  true,  false  •  null  

26 ©2013 Cloudera, Inc.

What  is  JSON?  

27 ©2013 Cloudera, Inc.

{ ! "retweeted_status": { ! "contributors": null, ! "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", ! "retweeted": false, ! "entities": { ! "hashtags": [ ! { ! "text": "Crowdsourcing", ! "indices": [0, 14] ! }, ! { ! "text": "bigdata", ! "indices": [129,137] ! } ! ], ! "user_mentions": [] ! } ! } !} !

Hive  Serializers  and  Deserializers    

•  Instructs  Hive  on  how  to  interpret  data  •  JSONSerDe  

28 ©2013 Cloudera, Inc.

Analyzing  Twi,er  Data  with  Hadoop  

HIVE  DEMO  

29   ©2013 Cloudera, Inc.

Analyzing  Twi,er  Data  with  Hadoop  

IT’S  A  TRAP!  

30   ©2013 Cloudera, Inc.

Photo  from  h,p://www.flickr.com/photos/vanf/6798065626/  Some  rights  reserved  

Not  a  Database  

31 ©2013 Cloudera, Inc.

RDBMS Hive

Language Generally >= SQL-92

Subset of SQL-92 plus Hive specific extensions

Update Capabilities INSERT, UPDATE, DELETE

INSERT OVERWRITE no UPDATE, DELETE

Transactions Yes No Latency Sub-second Minutes Indexes Yes Yes Data size Terabytes Petabytes

ETL  

•  Hive  works  great  for  SQL-­‐based  ETL  •  JSON  -­‐>  SequenceFiles  

32 ©2013 Cloudera, Inc.

Analyzing  Twi,er  Data  with  Hadoop  

IMPALA  

33   ©2013 Cloudera, Inc.

Cloudera  Impala  

34

Real-­‐Time  Query  for  Data  Stored  in  Hadoop.  

Supports  Hive  SQL  

4-­‐30X  faster  than  Hive  over  MapReduce  

Uses  exisRng  drivers,  integrates  with  exisRng  metastore,  works  with  leading  BI  tools  

Flexible,  cost-­‐effecRve,  no  lock-­‐in  

Deploy  &  operate  with  Cloudera  Enterprise  RTQ  

Supports  mulRple  storage  engines  &    file  formats  

©2013 Cloudera, Inc.

Benefits  of  Cloudera  Impala  

35

Real-­‐Time  Query  for  Data  Stored  in  Hadoop  

• Real-­‐Rme  queries  run  directly  on  source  data  • No  ETL  delays  • No  jumping  between  data  silos  

•   No  double  storage  with  EDW/RDBMS  •   Unlock  analysis  on  more  data  •   No  need  to  create  and  maintain  complex  ETL  between  systems  •   No  need  to  preplan  schemas  

•   All  data  available  for  interacRve  queries  • No  loss  of  fidelity  from  fixed  data  schemas  

• Single  metadata  store  from  originaRon    through  analysis  • No  need  to  hunt  through  mulRple  data  silos  

©2013 Cloudera, Inc.

Cloudera  Impala  Details  

36   ©2013 Cloudera, Inc.

HDFS  DN  

Query  Exec  Engine  

Query  Coordinator  

Query  Planner  

HBase  

ODBC  

SQL  App  

HDFS  DN  

Query  Exec  Engine  

Query  Coordinator  

Query  Planner  

HBase  HDFS  DN  

Query  Exec  Engine  

Query  Coordinator  

Query  Planner  

HBase  

Fully  MPP  Distributed  

Local  Direct  Reads  

State  Store  

HDFS  NN  Hive  Metastore   YARN  

Common  Hive  SQL  and  interface  

Unified  metadata  and  scheduler  

Low-­‐latency  scheduler  and  cache  (low-­‐impact  failures)  

Analyzing  Twi,er  Data  with  Hadoop  

IMPALA  DEMO  

37   ©2013 Cloudera, Inc.

Analyzing  Twi,er  Data  with  Hadoop  

OOZIE  AUTOMATION  

38   ©2013 Cloudera, Inc.

Oozie:  Everything  in  its  Right  Place  

Oozie  for  ParRRon  Management  

•  Once  an  hour,  add  a  parRRon  •  Takes  advantage  of  advanced  Hive  funcRonality  

Analyzing  Twi,er  Data  with  Hadoop  

OOZIE  DEMO  

41   ©2013 Cloudera, Inc.

Analyzing  Twi,er  Data  with  Hadoop  

PUTTING  IT  ALL  TOGETHER  

42   ©2013 Cloudera, Inc.

Complete  Architecture  

43 ©2013 Cloudera, Inc.

Twi,er  

HDFS  Flume   Hive  

Custom  Flume  Source  

Sink  to  HDFS  

JSON  SerDe  Parses  Data  

Oozie  

Add  ParRRons  Hourly  

Impala  

Queries  

Queries  and  ETL  

Analyzing  Twi,er  Data  with  Hadoop  

MORE  DEMOS  

44   ©2013 Cloudera, Inc.

What  next?  

•  Cloudera  University  •  Download  Hadoop!  •  CDH  available  at  www.cloudera.com  •  Cloudera  provides  pre-­‐loaded  VMs  

•  h,ps://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Free+EdiRon+Demo+VM  

•  Clone  the  source  repo  •  h,ps://github.com/cloudera/cdh-­‐twi,er-­‐example  

My  personal  preference  

•  Cloudera  Manager  •  h,ps://ccp.cloudera.com/display/SUPPORT/Downloads  

•  Free  up  to  50    unlimited  nodes!  

Shout  Out  

•  Jon  Natkins  •  @na,yice  •  Blog  posts  

•  h,p://blog.cloudera.com/blog/2013/09/analyzing-­‐twi,er-­‐data-­‐with-­‐hadoop/  

•  h,p://blog.cloudera.com/blog/2013/10/analyzing-­‐twi,er-­‐data-­‐with-­‐hadoop-­‐part-­‐2-­‐gathering-­‐data-­‐with-­‐flume/  

•  h,p://blog.cloudera.com/blog/2013/11/analyzing-­‐twi,er-­‐data-­‐with-­‐hadoop-­‐part-­‐3-­‐querying-­‐semi-­‐structured-­‐data-­‐with-­‐hive/  

QuesRons?  

•  Contact  me!  •  Joey  Echeverria  •  joey@cloudera.com  •  @fwiffo  

•  We’re  hiring!  

49 ©2013 Cloudera, Inc.