Application architectures with hadoop – big data techcon 2014

1

Headline Goes Here Speaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

ApplicaAon Architectures with Hadoop Mark Grover | SoGware Engineer Jonathan Seidman | SoluAons Architect, Partner Engineering April 1, 2014

©2014 Cloudera, Inc. All Rights Reserved.

About Us • Mark

•  CommiOer on Apache Bigtop, commiOer and PPMC member on Apache Sentry (incubaAng).

•  Contributor to Hadoop, Hive, Spark, Sqoop, Flume. •  @mark_grover

•  Jonathan •  SoluAons Architect, Partner Engineering Team. •  Co-‐founder of Chicago Hadoop User Group and Chicago Big Data. •  [email protected] •  @jseidman

2 ©2014 Cloudera, Inc. All Rights Reserved.

Co-‐authoring O’Reilly book

•  Titled ‘Hadoop ApplicaAon Architectures’ • How to build end-‐to-‐end soluAons using Apache Hadoop and related tools • Updates on TwiOer: @hadooparchbook •  hOp://www.hadooparchitecturebook.com

©2014 Cloudera, Inc. All Rights Reserved. 3

Challenges of Hadoop ImplementaAon


6

Click Stream Analysis

Case Study


Click Stream Analysis

7

Log Files

DWH X


Web Log Example


[2012/09/22 20:56:04.294 -0500] "GET /info/ HTTP/1.1" 200 701 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en)" "age=38&gender=1&incomeCategory=5&session=983040389&user=627735038&region=8&userType=1” [2012/09/23 14:12:52.294 -0500] "GET /wish/remove/275 HTTP/1.1" 200 701 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; en-us) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16" "age=63&gender=1&incomeCategory=1&session=1561203915&user=1364334488&region=4&userType=1"

Hadoop Architectural ConsideraAons

•  Storage managers? •  HDFS? HBase?

•  Data storage and modeling: •  File formats? Compression? Schema design?

•  Data movement •  How do we actually get the data into Hadoop? How do we get it out?

•  Metadata •  How do we manage data about the data?

•  Data access and processing •  How will the data be accessed once in Hadoop? How can we transform it? How do we query it?

•  OrchestraAon •  How do we manage the workflow for all of this?


10

Data Storage and Modeling


Data Storage – Storage Manager consideraAons

•  Popular storage managers for Hadoop •  Hadoop Distributed File System (HDFS) •  HBase


Data Storage – HDFS vs HBase

HDFS

•  Stores data directly as files •  Fast scans •  Poor random reads/writes

HBase

•  Stores data as Hfiles on HDFS •  Slow scans •  Fast random reads/writes


Data Storage – Storage Manager consideraAons

• We choose HDFS •  AnalyAcal needs in this case served beOer by fast scans.


14

Data Storage Format


Data Storage – Format ConsideraAons

•  Store as plain text? •  Sure, well supported by Hadoop. •  Text can easily be processed by MapReduce, loaded into Hive for analysis, and so on.

•  But… • Will begin to consume lots of space in HDFS. • May not be opAmal for processing by tools in the Hadoop ecosystem.


Data Storage – Format ConsideraAons

•  But, we can compress the text files… •  Gzip – supported by Hadoop, but not spliOable. •  Bzip2 – hey, spliOable! Great compression! But decompression is slooowww.

•  LZO – spliOable (with some work), good compress/de-‐compress performance. Good choice for storing text files on Hadoop.

•  Snappy – provides a good tradeoff between size and speed.


Data Storage – More About Snappy

•  Designed at Google to provide high compression speeds with reasonable compression.

•  Not the highest compression, but provides very good performance for processing on Hadoop.

•  Snappy is not spliOable though, which brings us to…


SequenceFile

• Stores records as binary key/value pairs.

• SequenceFile “blocks” can be compressed.

• This enables spliOability with non-‐spliOable compression.


Avro

• Kinda SequenceFile on Steroids.

• Self-‐documenAng – stores schema in header.

• Provides very efficient storage.

• Supports spliOable compression.


Our Format Choices…

• Avro with Snappy •  Snappy provides opAmized compression. •  Avro provides compact storage, self-‐documenAng files, and supports schema evoluAon.

•  Avro also provides beOer failure handling than other choices. •  SequenceFiles would also be a good choice, and are directly supported by ingesAon tools in the ecosystem.

•  But only supports Java.


21

HDFS Schema Design


Recommended HDFS Schema Design

• How to lay out data on HDFS?


Recommended HDFS Schema Design

/user/<username> -‐ User specific data, jars, conf files /etl – Data in various stages of ETL workflow /tmp – temp data from tools or shared between users /data – shared data for the enAre organizaAon /app – Everything but data: UDF jars, HQL files, Oozie workflows


24

Advanced HDFS Schema Design


What is ParAAoning?

25

dataset col=val1/file.txt col=val2/file.txt . . . col=valn/file.txt

dataset file1.txt file2.txt . . . filen.txt

Un-‐parAAoned HDFS directory structure

ParAAoned HDFS directory structure


What is ParAAoning?

26

clicks dt=2014-‐01-‐01/clicks.txt dt=2014-‐01-‐02/clicks.txt . . . dt=2014-‐03-‐31/clicks.txt

clicks clicks-‐2014-‐01-‐01.txt clicks-‐2014-‐01-‐02.txt . . . clicks-‐2014-‐03-‐31.txt

Un-‐parAAoned HDFS directory structure

ParAAoned HDFS directory structure


ParAAoning

•  Split the dataset into smaller consumable chunks •  Rudimentary form of “indexing” •  <data set name>/<parAAon_column_name=parAAon_column_value>/{files}


ParAAoning consideraAons

• What column to bucket by? •  HDFS is append only. •  Don’t have too many parAAons (<10,000) •  Don’t have too many small files in the parAAons (more than block size generally)

• We decided to parAAon by 1mestamp


What is buckeAng?

29

clicks dt=2014-‐01-‐01/clicks.txt dt=2014-‐01-‐02/clicks.txt

Un-‐bucketed HDFS directory structure

clicks dt=2014-‐01-‐01/file0.txt dt=2014-‐01-‐01/file1.txt dt=2014-‐01-‐01/file2.txt dt=2014-‐01-‐01/file3.txt dt=2014-‐01-‐02/file0.txt dt=2014-‐01-‐02/file1.txt dt=2014-‐01-‐02/file2.txt dt=2014-‐01-‐02/file3.txt

Bucketed HDFS directory structure


BuckeAng

•  Hash-‐bucketed files within each parAAon based on a parAcular column

•  Useful when sampling •  In some joins, pre-‐reqs:

•  Datasets bucketed on the same key as the join key •  Number of buckets are the same or one is a mulAple of the other


BuckeAng consideraAons?

• Which column to bucket on? •  How many buckets? • We decided to bucket based on cookie


De-‐normalizing consideraAons

•  In general, big data joins are expensive • When to de-‐normalize?

•  Decided to join the smaller dimension tables •  Big fact tables are sAll joined


33

Data IngesAon


File Transfers

• “hadoop fs –put <file>” • Reliable, but not resilient to failure.

• Other opAons are mountable HDFS, for example NFSv3.


Streaming IngesAon

•  Flume •  Reliable, distributed, and available system for efficient collecAon, aggregaAon and movement of streaming data, e.g. logs.

•  Ka{a •  Reliable and distributed publish-‐subscribe messaging system.


Flume vs. Ka{a

• Purpose built for Hadoop data ingest.

• Pre-‐built sinks for HDFS, HBase, etc.

• Supports transformaAon of data in-‐flight.

• General pub-‐sub messaging framework.

• Hadoop not supported, requires 3rd-‐party component (Camus).

• Just a message transport (a very fast one).


Flume vs. Ka{a

•  BoOom line: •  Flume very well integrated with Hadoop ecosystem, well suited to ingesAon of sources such as log files.

•  Ka{a is a highly reliable and scalable enterprise messaging system, and great for scaling out to mulAple consumers.


A Quick IntroducAon to Flume

38

Flume Agent

Source Channel Sink DesAnaAon External Source

Web Server TwiOer JMS System logs …

Consumes events and forwards to channels

Stores events unAl consumed by sinks – file, memory, JDBC

Removes event from channel and puts into external desAnaAon

JVM process hosAng components



•  Reliable – events are stored in channel unAl delivered to next stage. •  Recoverable – events can be persisted to disk and recovered in the event of failure.

39

Flume Agent

Source Channel Sink DesAnaAon



• DeclaraAve • No coding required. •  ConfiguraAon specifies how components are wired together.


A Brief Discussion of Flume PaOerns – Fan-‐in

• Flume agent runs on each of our servers.

• These agents send data to mulAple agents to provide reliability.

• Flume provides support for load balancing.


A Brief Discussion of Flume PaOerns – Spli~ng

• Common need is to split data on ingest.

• For example: •  Sending data to mulAple clusters for DR.

•  To mulAple desAnaAons. • Flume also supports parAAoning, which is key to our implementaAon.


Sqoop Overview

• Apache project designed to ease import and export of data between Hadoop and external data stores such as relaAonal databases.

• Great for doing bulk imports and exports of data between HDFS, Hive and HBase and an external data store. Not suited for ingesAng event based data.


IngesAon Decisions

• Historical Data •  Smaller files: file transfer •  Larger files: Flume with spooling directory source.

•  Incoming Data •  Flume with the spooling directory source.


45

Data Processing and Access


Data flow

46

Raw data

ParAAoned clickstream

data

Other data (Financial, CRM, etc.)

Aggregated dataset #2

Aggregated dataset #1


Data processing tools

47

• Hive •  Impala •  Pig, etc.


Hive

48

• Open source data warehouse system for Hadoop •  Converts SQL-‐like queries to MapReduce jobs

• Work is being done to move this away from MR •  Stores metadata in Hive metastore •  Can create tables over HDFS or HBase data • Access available via JDBC/ODBC


Impala

49

•  Real-‐Ame open source SQL query engine for Hadoop • Doesn’t build on MapReduce • WriOen in C++, uses LLVM for run-‐Ame code generaAon •  Can create tables over HDFS or HBase data • Accesses Hive metastore for metadata • Access available via JDBC/ODBC


Pig

50

• Higher level abstracAon over MapReduce (like Hive) • Write transformaAons in scripAng language – Pig LaAn •  Can access Hive metastore via HCatalog for metadata


Data Processing consideraAons

51

• We chose Hive for ETL and Impala for interac1ve BI.


52

Metadata Management


What is Metadata?

53

• Metadata is data about the data •  Format in which data is stored •  Compression codec •  LocaAon of the data •  Is the data parAAoned/bucketed/sorted?


Metadata in Hive

54

Hive Metastore


Metadata

55

• Hive metastore has become the de-‐facto metadata repository • HCatalog makes Hive metastore accessible to other applicaAons (Pig, MapReduce, custom apps, etc.)


Hive + HCatalog


57

OrchestraAon


OrchestraAon

• Once the data is in Hadoop, we need a way to manage workflows in our architecture.

•  Scheduling and tracking MapReduce jobs, Hive jobs, etc. •  Several opAons here:

•  Cron •  Oozie, Azkaban •  3rd-‐party tools, Talend, Pentaho, InformaAca, enterprise schedulers.


Oozie

• Supports defining and execuAng a sequence of jobs.

• Can trigger jobs based on external dependencies or schedules.


60

Final Architecture


Final Architecture – High Level Overview

61

Data Sources IngesAon

Data Storage/Processing

Data ReporAng/Analysis



62





Final Architecture – IngesAon

63

Web App Avro Agent Web App Avro Agent




Flume Agent

Flume Agent

Flume Agent

Flume Agent

Fan-‐in PaOern

MulA Agents for Failover and rolling restarts

HDFS



64





Final Architecture – Storage and Processing

65

/etl/weblogs/20140331/ /etl/weblogs/20140401/ …

Data Processing /data/markeAng/clickstream/bouncerate/ /data/markeAng/clickstream/aOribuAon/ …



66





Final Architecture – Data Access

67

Hive/Impala

BI/AnalyAcs Tools

DWH Sqoop

Local Disk

R, etc.

DB import tool

JDBC/ODBC


Contact info • Mark Grover

• @mark_grover •  www.linkedin.com/in/grovermark

•  Jonathan Seidman •  [email protected] • @jseidman •  hOps://www.linkedin.com/pub/jonathan-‐seidman/1/26a/959 •  hOp://www.slideshare.net/jseidman

•  Slides at slideshare.net/hadooparchbook


Application architectures with hadoop – big data techcon 2014

Technology

Transcript of Application architectures with hadoop – big data techcon 2014