Analyzing twitter data with hadoop

Analyzing Twi,er Data with Hadoop Data Science Maryland, May 2013 Joey Echeverria | Director Federal FTS joey@cloudera.com | @fwiffo

About Joey

•  Director Federal FTS •  Technical guy playing manager

•  2 years @ Cloudera •  5 years of Hadoop •  Local

Analyzing Twi,er Data with Hadoop

BUILDING A BIG DATA SOLUTION

Big Data

•  Big •  Larger volume than you’ve handled before

•  No litmus test •  High value, under uRlized

•  Data •  Structured •  Unstructured •  Semi-‐structured

•  Hadoop •  Distributed file system •  Distributed, batch computaRon

Data Management Systems

Data Source Data Storage Data

IngesRon

Data Processing

RelaRonal Data Management Systems

Data Source RDBMS ETL

ReporRng

A Canonical Hadoop Architecture

Data Source HDFS Flume

Hive (Impala)

AN EXAMPLE USE CASE

Analyzing Twi,er

•  Social media popular with markeRng teams •  Twi,er is an effecRve tool for promoRon •  Who is influenRal?

•  Tweets •  Followers •  Retweets

•  Similar to e-‐mail forwarding

•  Which twi,er user gets the most retweets? •  Who is influenRal in our industry?

HOW DO WE ANSWER THESE QUESTIONS?

Techniques

•  SQL •  Filtering •  AggregaRon •  SorRng

•  Complex data •  Deeply nested •  Variable schema

Architecture

Twi,er

HDFS Flume Hive

Custom Flume Source

Sink to HDFS

JSON SerDe Parses Data

Add ParRRons Hourly

Impala

Queries

Queries and ETL

TWITTER SOURCE

•  Streaming data flow •  Sources

•  Push or pull

•  Sinks •  Event based

Pulling Data From Twi,er

•  Custom source, using twi,er4j •  Sources process data as discrete events

Loading Data Into HDFS

•  HDFS Sink comes stock with Flume •  Easily separate files by creaRon Rme

•  hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/

Flume Source

public class TwitterSource extends AbstractSource! implements EventDrivenSource, Configurable { ! ... ! // The initialization method for the Source. The context contains all ! // the Flume configuration info ! @Override ! public void configure(Context context) { ! ... ! } ! ... ! // Start processing events. Uses the Twitter Streaming API to sample ! // Twitter, and process tweets. ! @Override ! public void start() { ! ... ! } ! ... ! // Stops Source's event processing and shuts down the Twitter stream. ! @Override ! public void stop() { ! ... ! } !} !!

Twi,er API

•  Callback mechanism for catching new tweets

/** The actual Twitter stream. It's set up to collect raw JSON data */ !private final TwitterStream twitterStream = new TwitterStreamFactory( ! new ConfigurationBuilder().setJSONStoreEnabled(true).build()) ! .getInstance(); !... !// The StatusListener is a twitter4j API that can be added to a stream, !// and will call a method every time a message is sent to the stream. !StatusListener listener = new StatusListener() { ! // The onStatus method is executed every time a new tweet comes in. ! public void onStatus(Status status) { ! ... ! } !} !... !// Set up the stream's listener (defined above), and set any necessary !// security information. !twitterStream.addListener(listener); !twitterStream.setOAuthConsumer(consumerKey, consumerSecret); !AccessToken token = new AccessToken(accessToken, accessTokenSecret); !twitterStream.setOAuthAccessToken(token); !!

JSON Data

•  JSON data is processed as an event and wri,en to HDFS

public void onStatus(Status status) { ! // The EventBuilder is used to build an event using the headers and ! // the raw JSON of a tweet !! headers.put("timestamp", String.valueOf( ! status.getCreatedAt().getTime())); ! Event event = EventBuilder.withBody( ! DataObjectFactory.getRawJSON(status).getBytes(), headers); ! ! channel.processEvent(event); !} !

FLUME DEMO

What is Hive?

•  Created at Facebook •  HiveQL

•  SQL like interface

•  Hive interpreter converts HiveQL to MapReduce code

•  Returns results to the client

Hive Details

•  Schema on read •  Scalar types (int, float, double, boolean, string) •  Complex types (struct, map, array) •  Metastore contains table definiRons

•  Stored in a relaRonal database •  Similar to catalog tables in other DBs

Complex Data

SELECT ! t.retweet_screen_name, ! sum(retweets) AS total_retweets, ! count(*) AS tweet_count!FROM (SELECT ! retweeted_status.user.screen_name AS retweet_screen_name, ! retweeted_status.text, ! max(retweeted_status.retweet_count) AS retweets! FROM tweets ! GROUP BY ! retweeted_status.user.screen_name, ! retweeted_status.text) t !GROUP BY t.retweet_screen_name!ORDER BY total_retweets DESC !LIMIT 10; !

JSON INTERLUDE

What is JSON?

•  Complex, semi-‐structured data •  Based on JavaScript’s data syntax •  Rich, nested data types:

•  number •  string •  Array •  object •  true, false •  null

What is JSON?

{ ! "retweeted_status": { ! "contributors": null, ! "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", ! "retweeted": false, ! "entities": { ! "hashtags": [ ! { ! "text": "Crowdsourcing", ! "indices": [0, 14] ! }, ! { ! "text": "bigdata", ! "indices": [129,137] ! } ! ], ! "user_mentions": [] ! } ! } !} !

Hive Serializers and Deserializers

•  Instructs Hive on how to interpret data •  JSONSerDe

HIVE DEMO

IT’S A TRAP!

Photo from h,p://www.flickr.com/photos/vanf/6798065626/ Some rights reserved

Not a Database

RDBMS Hive

Language Generally >= SQL-92

Subset of SQL-92 plus Hive specific extensions

Update Capabilities INSERT, UPDATE, DELETE

INSERT OVERWRITE no UPDATE, DELETE

Transactions Yes No Latency Sub-second Minutes Indexes Yes Yes Data size Terabytes Petabytes

•  Hive works great for SQL-‐based ETL •  JSON -‐> SequenceFiles

IMPALA

Cloudera Impala

Real-‐Time Query for Data Stored in Hadoop.

Supports Hive SQL

4-‐30X faster than Hive over MapReduce

Uses exisRng drivers, integrates with exisRng metastore, works with leading BI tools

Flexible, cost-‐effecRve, no lock-‐in

Deploy & operate with Cloudera Enterprise RTQ

Supports mulRple storage engines & file formats

Benefits of Cloudera Impala

Real-‐Time Query for Data Stored in Hadoop

• Real-‐Rme queries run directly on source data • No ETL delays • No jumping between data silos

•  No double storage with EDW/RDBMS •  Unlock analysis on more data •  No need to create and maintain complex ETL between systems •  No need to preplan schemas

•  All data available for interacRve queries • No loss of fidelity from fixed data schemas

• Single metadata store from originaRon through analysis • No need to hunt through mulRple data silos

Cloudera Impala Details

HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

SQL App

HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

Fully MPP Distributed

Local Direct Reads

State Store

HDFS NN Hive Metastore YARN

Common Hive SQL and interface

Unified metadata and scheduler

Low-‐latency scheduler and cache (low-‐impact failures)

IMPALA DEMO

OOZIE AUTOMATION

Oozie: Everything in its Right Place

Oozie for ParRRon Management

•  Once an hour, add a parRRon •  Takes advantage of advanced Hive funcRonality

OOZIE DEMO

PUTTING IT ALL TOGETHER

Complete Architecture

Twi,er

HDFS Flume Hive

Custom Flume Source

Sink to HDFS

JSON SerDe Parses Data

Add ParRRons Hourly

Impala

Queries

Queries and ETL

MORE DEMOS

What next?

•  Cloudera University •  Download Hadoop! •  CDH available at www.cloudera.com •  Cloudera provides pre-‐loaded VMs

•  h,ps://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Free+EdiRon+Demo+VM

•  Clone the source repo •  h,ps://github.com/cloudera/cdh-‐twi,er-‐example

My personal preference

•  Cloudera Manager •  h,ps://ccp.cloudera.com/display/SUPPORT/Downloads

•  Free up to 50 unlimited nodes!

Shout Out

•  Jon Natkins •  @na,yice •  Blog posts

•  h,p://blog.cloudera.com/blog/2013/09/analyzing-‐twi,er-‐data-‐with-‐hadoop/

•  h,p://blog.cloudera.com/blog/2013/10/analyzing-‐twi,er-‐data-‐with-‐hadoop-‐part-‐2-‐gathering-‐data-‐with-‐flume/

•  h,p://blog.cloudera.com/blog/2013/11/analyzing-‐twi,er-‐data-‐with-‐hadoop-‐part-‐3-‐querying-‐semi-‐structured-‐data-‐with-‐hive/

QuesRons?

•  Contact me! •  Joey Echeverria •  joey@cloudera.com •  @fwiffo

•  We’re hiring!

Analyzing twitter data with hadoop

Documents

Transcript of Analyzing twitter data with hadoop

Stuart Pérez A12729. Agenda Que es Hadoop Porque usarlo Componentes de Hadoop HDFS MapReduce Cluster Hadoop (HDFS + MR) Hadoop Scheduler Conclusiones.

Apache Hadoop - Conceitos teóricos e práticos, evolução e ... · Apache Hadoop Apache Hadoop Conceitosteóricosepráticos,evolução enovaspossibilidades DanielCordeiro Departamento

Analyzing and Reporting Test Results

Analyzing Running Records

Analyzing Headlines

Hadoopエンタープライズソリューションセミナー2012: Hadoop＆RabbitMQを利用したTwitter全量リアルタイム解析

ANALYZING REGULATION AND MARKETING ETHICS.pptx

Analyzing Human Solving Methods

Hadoop Trends & Hadoop on EC2

Kotler and Armstrong Chapter 3: Analyzing the Marketing ... · PDF fileKotler and Armstrong Chapter 3: Analyzing the Marketing ... Analyzing the Marketing Environment ... •International

Hadoop pig

Analyzing Tweets of Mexican Bicentenario

Twitter クライアント “Termtter” の紹介と収集したソーシャルデータを Fluentd + Hadoop で分析する話

Curso Hadoop. FcoJavierLahozSevilla v1.0.pdf · Introducción+a Hadoop. InstalaciónenAWS • Parte+1.+Introducción+a Hadoop+ – ¿Que+es+Hadoop?+ – Versionesde+Hadoop+ – Gesón

Analyzing emails from etailers

Big$Data$Processing$using$ Hadoop$ - prace.it4i.czprace.it4i.cz/sites/prace.it4i.cz/files/files/hadoop-10-2015... · Original Hadoop distributed grep Hadoop+BlobSeer sort Execution

Analyzing stakeholders sahil chawla

Hadoop Overview

ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch

Hadoop ecosystem - hadoop 生態系