Application architectures with hadoop – big data techcon 2014

1

Headline Goes Here Speaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

ApplicaAon Architectures with Hadoop Mark Grover | SoGware Engineer Jonathan Seidman | SoluAons Architect, Partner Engineering April 1, 2014

Page 2: Application architectures with hadoop – big data techcon 2014

About Us • Mark

•  CommiOer on Apache Bigtop, commiOer and PPMC member on Apache Sentry (incubaAng).

•  Contributor to Hadoop, Hive, Spark, Sqoop, Flume. •  @mark_grover

•  Jonathan •  SoluAons Architect, Partner Engineering Team. •  Co-‐founder of Chicago Hadoop User Group and Chicago Big Data. •  [email protected] •  @jseidman

Page 3: Application architectures with hadoop – big data techcon 2014

Co-‐authoring O’Reilly book

•  Titled ‘Hadoop ApplicaAon Architectures’ • How to build end-‐to-‐end soluAons using Apache Hadoop and related tools • Updates on TwiOer: @hadooparchbook •  hOp://www.hadooparchitecturebook.com

Page 4: Application architectures with hadoop – big data techcon 2014

Challenges of Hadoop ImplementaAon

Page 5: Application architectures with hadoop – big data techcon 2014

Challenges of Hadoop ImplementaAon

Page 6: Application architectures with hadoop – big data techcon 2014

6

Click Stream Analysis

Case Study

Page 7: Application architectures with hadoop – big data techcon 2014

Click Stream Analysis

7

Log Files

DWH X

Page 8: Application architectures with hadoop – big data techcon 2014

Web Log Example

[2012/09/22 20:56:04.294 -0500] "GET /info/ HTTP/1.1" 200 701 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en)" "age=38&gender=1&incomeCategory=5&session=983040389&user=627735038&region=8&userType=1” [2012/09/23 14:12:52.294 -0500] "GET /wish/remove/275 HTTP/1.1" 200 701 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; en-us) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16" "age=63&gender=1&incomeCategory=1&session=1561203915&user=1364334488&region=4&userType=1"

Page 9: Application architectures with hadoop – big data techcon 2014

Hadoop Architectural ConsideraAons

•  Storage managers? •  HDFS? HBase?

•  Data storage and modeling: •  File formats? Compression? Schema design?

•  Data movement •  How do we actually get the data into Hadoop? How do we get it out?

•  Metadata •  How do we manage data about the data?

•  Data access and processing •  How will the data be accessed once in Hadoop? How can we transform it? How do we query it?

•  OrchestraAon •  How do we manage the workflow for all of this?

Page 10: Application architectures with hadoop – big data techcon 2014

10

Data Storage and Modeling

Page 11: Application architectures with hadoop – big data techcon 2014

Data Storage – Storage Manager consideraAons

•  Popular storage managers for Hadoop •  Hadoop Distributed File System (HDFS) •  HBase

Page 12: Application architectures with hadoop – big data techcon 2014

Data Storage – HDFS vs HBase

HDFS

•  Stores data directly as files •  Fast scans •  Poor random reads/writes

HBase

•  Stores data as Hfiles on HDFS •  Slow scans •  Fast random reads/writes

Page 13: Application architectures with hadoop – big data techcon 2014

Data Storage – Storage Manager consideraAons

• We choose HDFS •  AnalyAcal needs in this case served beOer by fast scans.

Page 14: Application architectures with hadoop – big data techcon 2014

14

Data Storage Format

Page 15: Application architectures with hadoop – big data techcon 2014

Data Storage – Format ConsideraAons

•  Store as plain text? •  Sure, well supported by Hadoop. •  Text can easily be processed by MapReduce, loaded into Hive for analysis, and so on.

•  But… • Will begin to consume lots of space in HDFS. • May not be opAmal for processing by tools in the Hadoop ecosystem.

Page 16: Application architectures with hadoop – big data techcon 2014

Data Storage – Format ConsideraAons

•  But, we can compress the text files… •  Gzip – supported by Hadoop, but not spliOable. •  Bzip2 – hey, spliOable! Great compression! But decompression is slooowww.

•  LZO – spliOable (with some work), good compress/de-‐compress performance. Good choice for storing text files on Hadoop.

•  Snappy – provides a good tradeoff between size and speed.

Page 17: Application architectures with hadoop – big data techcon 2014

Data Storage – More About Snappy

•  Designed at Google to provide high compression speeds with reasonable compression.

•  Not the highest compression, but provides very good performance for processing on Hadoop.

•  Snappy is not spliOable though, which brings us to…

Page 18: Application architectures with hadoop – big data techcon 2014

SequenceFile

• Stores records as binary key/value pairs.

• SequenceFile “blocks” can be compressed.

• This enables spliOability with non-‐spliOable compression.

Page 19: Application architectures with hadoop – big data techcon 2014

Avro

• Kinda SequenceFile on Steroids.

• Self-‐documenAng – stores schema in header.

• Provides very efficient storage.

• Supports spliOable compression.

Page 20: Application architectures with hadoop – big data techcon 2014

Our Format Choices…

• Avro with Snappy •  Snappy provides opAmized compression. •  Avro provides compact storage, self-‐documenAng files, and supports schema evoluAon.

•  Avro also provides beOer failure handling than other choices. •  SequenceFiles would also be a good choice, and are directly supported by ingesAon tools in the ecosystem.

•  But only supports Java.

Page 21: Application architectures with hadoop – big data techcon 2014

21

HDFS Schema Design

Page 22: Application architectures with hadoop – big data techcon 2014

Recommended HDFS Schema Design

• How to lay out data on HDFS?

Page 23: Application architectures with hadoop – big data techcon 2014

Recommended HDFS Schema Design

/user/<username> -‐ User specific data, jars, conf files /etl – Data in various stages of ETL workflow /tmp – temp data from tools or shared between users /data – shared data for the enAre organizaAon /app – Everything but data: UDF jars, HQL files, Oozie workflows

Page 24: Application architectures with hadoop – big data techcon 2014

24

Advanced HDFS Schema Design

Page 25: Application architectures with hadoop – big data techcon 2014

What is ParAAoning?

25

dataset col=val1/file.txt col=val2/file.txt . . . col=valn/file.txt

dataset file1.txt file2.txt . . . filen.txt

Un-‐parAAoned HDFS directory structure

ParAAoned HDFS directory structure

Page 26: Application architectures with hadoop – big data techcon 2014

What is ParAAoning?

26

clicks dt=2014-‐01-‐01/clicks.txt dt=2014-‐01-‐02/clicks.txt . . . dt=2014-‐03-‐31/clicks.txt

clicks clicks-‐2014-‐01-‐01.txt clicks-‐2014-‐01-‐02.txt . . . clicks-‐2014-‐03-‐31.txt

Un-‐parAAoned HDFS directory structure

ParAAoned HDFS directory structure

Page 27: Application architectures with hadoop – big data techcon 2014

ParAAoning

•  Split the dataset into smaller consumable chunks •  Rudimentary form of “indexing” •  <data set name>/<parAAon_column_name=parAAon_column_value>/{files}

Page 28: Application architectures with hadoop – big data techcon 2014

ParAAoning consideraAons

• What column to bucket by? •  HDFS is append only. •  Don’t have too many parAAons (<10,000) •  Don’t have too many small files in the parAAons (more than block size generally)

• We decided to parAAon by 1mestamp

Page 29: Application architectures with hadoop – big data techcon 2014

What is buckeAng?

29

clicks dt=2014-‐01-‐01/clicks.txt dt=2014-‐01-‐02/clicks.txt

Un-‐bucketed HDFS directory structure

clicks dt=2014-‐01-‐01/file0.txt dt=2014-‐01-‐01/file1.txt dt=2014-‐01-‐01/file2.txt dt=2014-‐01-‐01/file3.txt dt=2014-‐01-‐02/file0.txt dt=2014-‐01-‐02/file1.txt dt=2014-‐01-‐02/file2.txt dt=2014-‐01-‐02/file3.txt

Bucketed HDFS directory structure

Page 30: Application architectures with hadoop – big data techcon 2014

BuckeAng

•  Hash-‐bucketed files within each parAAon based on a parAcular column

•  Useful when sampling •  In some joins, pre-‐reqs:

•  Datasets bucketed on the same key as the join key •  Number of buckets are the same or one is a mulAple of the other

Page 31: Application architectures with hadoop – big data techcon 2014

BuckeAng consideraAons?

• Which column to bucket on? •  How many buckets? • We decided to bucket based on cookie

Page 32: Application architectures with hadoop – big data techcon 2014

De-‐normalizing consideraAons

•  In general, big data joins are expensive • When to de-‐normalize?

•  Decided to join the smaller dimension tables •  Big fact tables are sAll joined

Page 33: Application architectures with hadoop – big data techcon 2014

33

Data IngesAon

Page 34: Application architectures with hadoop – big data techcon 2014

File Transfers

• “hadoop fs –put <file>” • Reliable, but not resilient to failure.

• Other opAons are mountable HDFS, for example NFSv3.

Page 35: Application architectures with hadoop – big data techcon 2014

Streaming IngesAon

•  Flume •  Reliable, distributed, and available system for efficient collecAon, aggregaAon and movement of streaming data, e.g. logs.

•  Ka{a •  Reliable and distributed publish-‐subscribe messaging system.

Page 36: Application architectures with hadoop – big data techcon 2014

Flume vs. Ka{a

• Purpose built for Hadoop data ingest.

• Pre-‐built sinks for HDFS, HBase, etc.

• Supports transformaAon of data in-‐flight.

• General pub-‐sub messaging framework.

• Hadoop not supported, requires 3rd-‐party component (Camus).

• Just a message transport (a very fast one).

Page 37: Application architectures with hadoop – big data techcon 2014

Flume vs. Ka{a

•  BoOom line: •  Flume very well integrated with Hadoop ecosystem, well suited to ingesAon of sources such as log files.

•  Ka{a is a highly reliable and scalable enterprise messaging system, and great for scaling out to mulAple consumers.

Page 38: Application architectures with hadoop – big data techcon 2014

A Quick IntroducAon to Flume

38

Flume Agent

Source Channel Sink DesAnaAon External Source

Web Server TwiOer JMS System logs …

Consumes events and forwards to channels

Stores events unAl consumed by sinks – file, memory, JDBC

Removes event from channel and puts into external desAnaAon

JVM process hosAng components

Page 39: Application architectures with hadoop – big data techcon 2014

A Quick IntroducAon to Flume

•  Reliable – events are stored in channel unAl delivered to next stage. •  Recoverable – events can be persisted to disk and recovered in the event of failure.

39

Flume Agent

Source Channel Sink DesAnaAon

Page 40: Application architectures with hadoop – big data techcon 2014

A Quick IntroducAon to Flume

• DeclaraAve • No coding required. •  ConfiguraAon specifies how components are wired together.

Page 41: Application architectures with hadoop – big data techcon 2014

A Brief Discussion of Flume PaOerns – Fan-‐in

• Flume agent runs on each of our servers.

• These agents send data to mulAple agents to provide reliability.

• Flume provides support for load balancing.

Page 42: Application architectures with hadoop – big data techcon 2014

A Brief Discussion of Flume PaOerns – Spli~ng

• Common need is to split data on ingest.

• For example: •  Sending data to mulAple clusters for DR.

•  To mulAple desAnaAons. • Flume also supports parAAoning, which is key to our implementaAon.

Page 43: Application architectures with hadoop – big data techcon 2014

Sqoop Overview

• Apache project designed to ease import and export of data between Hadoop and external data stores such as relaAonal databases.

• Great for doing bulk imports and exports of data between HDFS, Hive and HBase and an external data store. Not suited for ingesAng event based data.

Page 44: Application architectures with hadoop – big data techcon 2014

IngesAon Decisions

• Historical Data •  Smaller files: file transfer •  Larger files: Flume with spooling directory source.

•  Incoming Data •  Flume with the spooling directory source.

Page 45: Application architectures with hadoop – big data techcon 2014

45

Data Processing and Access

Page 46: Application architectures with hadoop – big data techcon 2014

Data flow

46

Raw data

ParAAoned clickstream

data

Other data (Financial, CRM, etc.)

Aggregated dataset #2

Aggregated dataset #1

Page 47: Application architectures with hadoop – big data techcon 2014

Data processing tools

47

• Hive •  Impala •  Pig, etc.

Page 48: Application architectures with hadoop – big data techcon 2014

Hive

48

• Open source data warehouse system for Hadoop •  Converts SQL-‐like queries to MapReduce jobs

• Work is being done to move this away from MR •  Stores metadata in Hive metastore •  Can create tables over HDFS or HBase data • Access available via JDBC/ODBC

Page 49: Application architectures with hadoop – big data techcon 2014

Impala

49

•  Real-‐Ame open source SQL query engine for Hadoop • Doesn’t build on MapReduce • WriOen in C++, uses LLVM for run-‐Ame code generaAon •  Can create tables over HDFS or HBase data • Accesses Hive metastore for metadata • Access available via JDBC/ODBC

Page 50: Application architectures with hadoop – big data techcon 2014

Pig

50

• Higher level abstracAon over MapReduce (like Hive) • Write transformaAons in scripAng language – Pig LaAn •  Can access Hive metastore via HCatalog for metadata

Page 51: Application architectures with hadoop – big data techcon 2014

Data Processing consideraAons

51

• We chose Hive for ETL and Impala for interac1ve BI.

Page 52: Application architectures with hadoop – big data techcon 2014

52

Metadata Management

Page 53: Application architectures with hadoop – big data techcon 2014

What is Metadata?

53

• Metadata is data about the data •  Format in which data is stored •  Compression codec •  LocaAon of the data •  Is the data parAAoned/bucketed/sorted?

Page 54: Application architectures with hadoop – big data techcon 2014

Metadata in Hive

54

Hive Metastore

Page 55: Application architectures with hadoop – big data techcon 2014

Metadata

55

• Hive metastore has become the de-‐facto metadata repository • HCatalog makes Hive metastore accessible to other applicaAons (Pig, MapReduce, custom apps, etc.)