1
Headline Goes Here Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY PRIOR TO 10/23/12
ApplicaAon Architectures with Hadoop Mark Grover | SoGware Engineer Jonathan Seidman | SoluAons Architect, Partner Engineering April 1, 2014
©2014 Cloudera, Inc. All Rights Reserved.
About Us • Mark
• CommiOer on Apache Bigtop, commiOer and PPMC member on Apache Sentry (incubaAng).
• Contributor to Hadoop, Hive, Spark, Sqoop, Flume. • @mark_grover
• Jonathan • SoluAons Architect, Partner Engineering Team. • Co-‐founder of Chicago Hadoop User Group and Chicago Big Data. • [email protected] • @jseidman
2 ©2014 Cloudera, Inc. All Rights Reserved.
Co-‐authoring O’Reilly book
• Titled ‘Hadoop ApplicaAon Architectures’ • How to build end-‐to-‐end soluAons using Apache Hadoop and related tools • Updates on TwiOer: @hadooparchbook • hOp://www.hadooparchitecturebook.com
©2014 Cloudera, Inc. All Rights Reserved. 3
Challenges of Hadoop ImplementaAon
4 ©2014 Cloudera, Inc. All Rights Reserved.
Challenges of Hadoop ImplementaAon
5 ©2014 Cloudera, Inc. All Rights Reserved.
6
Click Stream Analysis
Case Study
©2014 Cloudera, Inc. All Rights Reserved.
Click Stream Analysis
7
Log Files
DWH X
©2014 Cloudera, Inc. All Rights Reserved.
Web Log Example
©2014 Cloudera, Inc. All Rights Reserved. 8
[2012/09/22 20:56:04.294 -0500] "GET /info/ HTTP/1.1" 200 701 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en)" "age=38&gender=1&incomeCategory=5&session=983040389&user=627735038®ion=8&userType=1” [2012/09/23 14:12:52.294 -0500] "GET /wish/remove/275 HTTP/1.1" 200 701 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; en-us) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16" "age=63&gender=1&incomeCategory=1&session=1561203915&user=1364334488®ion=4&userType=1"
Hadoop Architectural ConsideraAons
• Storage managers? • HDFS? HBase?
• Data storage and modeling: • File formats? Compression? Schema design?
• Data movement • How do we actually get the data into Hadoop? How do we get it out?
• Metadata • How do we manage data about the data?
• Data access and processing • How will the data be accessed once in Hadoop? How can we transform it? How do we query it?
• OrchestraAon • How do we manage the workflow for all of this?
9 ©2014 Cloudera, Inc. All Rights Reserved.
10
Data Storage and Modeling
©2014 Cloudera, Inc. All Rights Reserved.
Data Storage – Storage Manager consideraAons
• Popular storage managers for Hadoop • Hadoop Distributed File System (HDFS) • HBase
11 ©2014 Cloudera, Inc. All Rights Reserved.
Data Storage – HDFS vs HBase
HDFS
• Stores data directly as files • Fast scans • Poor random reads/writes
HBase
• Stores data as Hfiles on HDFS • Slow scans • Fast random reads/writes
12 ©2014 Cloudera, Inc. All Rights Reserved.
Data Storage – Storage Manager consideraAons
• We choose HDFS • AnalyAcal needs in this case served beOer by fast scans.
13 ©2014 Cloudera, Inc. All Rights Reserved.
14
Data Storage Format
©2014 Cloudera, Inc. All Rights Reserved.
Data Storage – Format ConsideraAons
• Store as plain text? • Sure, well supported by Hadoop. • Text can easily be processed by MapReduce, loaded into Hive for analysis, and so on.
• But… • Will begin to consume lots of space in HDFS. • May not be opAmal for processing by tools in the Hadoop ecosystem.
15 ©2014 Cloudera, Inc. All Rights Reserved.
Data Storage – Format ConsideraAons
• But, we can compress the text files… • Gzip – supported by Hadoop, but not spliOable. • Bzip2 – hey, spliOable! Great compression! But decompression is slooowww.
• LZO – spliOable (with some work), good compress/de-‐compress performance. Good choice for storing text files on Hadoop.
• Snappy – provides a good tradeoff between size and speed.
16 ©2014 Cloudera, Inc. All Rights Reserved.
Data Storage – More About Snappy
• Designed at Google to provide high compression speeds with reasonable compression.
• Not the highest compression, but provides very good performance for processing on Hadoop.
• Snappy is not spliOable though, which brings us to…
17 ©2014 Cloudera, Inc. All Rights Reserved.
SequenceFile
• Stores records as binary key/value pairs.
• SequenceFile “blocks” can be compressed.
• This enables spliOability with non-‐spliOable compression.
18 ©2014 Cloudera, Inc. All Rights Reserved.
Avro
• Kinda SequenceFile on Steroids.
• Self-‐documenAng – stores schema in header.
• Provides very efficient storage.
• Supports spliOable compression.
19 ©2014 Cloudera, Inc. All Rights Reserved.
Our Format Choices…
• Avro with Snappy • Snappy provides opAmized compression. • Avro provides compact storage, self-‐documenAng files, and supports schema evoluAon.
• Avro also provides beOer failure handling than other choices. • SequenceFiles would also be a good choice, and are directly supported by ingesAon tools in the ecosystem.
• But only supports Java.
20 ©2014 Cloudera, Inc. All Rights Reserved.
21
HDFS Schema Design
©2014 Cloudera, Inc. All Rights Reserved.
Recommended HDFS Schema Design
• How to lay out data on HDFS?
22 ©2014 Cloudera, Inc. All Rights Reserved.
Recommended HDFS Schema Design
/user/<username> -‐ User specific data, jars, conf files /etl – Data in various stages of ETL workflow /tmp – temp data from tools or shared between users /data – shared data for the enAre organizaAon /app – Everything but data: UDF jars, HQL files, Oozie workflows
23 ©2014 Cloudera, Inc. All Rights Reserved.
24
Advanced HDFS Schema Design
©2014 Cloudera, Inc. All Rights Reserved.
What is ParAAoning?
25
dataset col=val1/file.txt col=val2/file.txt . . . col=valn/file.txt
dataset file1.txt file2.txt . . . filen.txt
Un-‐parAAoned HDFS directory structure
ParAAoned HDFS directory structure
©2014 Cloudera, Inc. All Rights Reserved.
What is ParAAoning?
26
clicks dt=2014-‐01-‐01/clicks.txt dt=2014-‐01-‐02/clicks.txt . . . dt=2014-‐03-‐31/clicks.txt
clicks clicks-‐2014-‐01-‐01.txt clicks-‐2014-‐01-‐02.txt . . . clicks-‐2014-‐03-‐31.txt
Un-‐parAAoned HDFS directory structure
ParAAoned HDFS directory structure
©2014 Cloudera, Inc. All Rights Reserved.
ParAAoning
• Split the dataset into smaller consumable chunks • Rudimentary form of “indexing” • <data set name>/<parAAon_column_name=parAAon_column_value>/{files}
27 ©2014 Cloudera, Inc. All Rights Reserved.
ParAAoning consideraAons
• What column to bucket by? • HDFS is append only. • Don’t have too many parAAons (<10,000) • Don’t have too many small files in the parAAons (more than block size generally)
• We decided to parAAon by 1mestamp
28 ©2014 Cloudera, Inc. All Rights Reserved.
What is buckeAng?
29
clicks dt=2014-‐01-‐01/clicks.txt dt=2014-‐01-‐02/clicks.txt
Un-‐bucketed HDFS directory structure
clicks dt=2014-‐01-‐01/file0.txt dt=2014-‐01-‐01/file1.txt dt=2014-‐01-‐01/file2.txt dt=2014-‐01-‐01/file3.txt dt=2014-‐01-‐02/file0.txt dt=2014-‐01-‐02/file1.txt dt=2014-‐01-‐02/file2.txt dt=2014-‐01-‐02/file3.txt
Bucketed HDFS directory structure
©2014 Cloudera, Inc. All Rights Reserved.
BuckeAng
• Hash-‐bucketed files within each parAAon based on a parAcular column
• Useful when sampling • In some joins, pre-‐reqs:
• Datasets bucketed on the same key as the join key • Number of buckets are the same or one is a mulAple of the other
30 ©2014 Cloudera, Inc. All Rights Reserved.
BuckeAng consideraAons?
• Which column to bucket on? • How many buckets? • We decided to bucket based on cookie
31 ©2014 Cloudera, Inc. All Rights Reserved.
De-‐normalizing consideraAons
• In general, big data joins are expensive • When to de-‐normalize?
• Decided to join the smaller dimension tables • Big fact tables are sAll joined
32 ©2014 Cloudera, Inc. All Rights Reserved.
33
Data IngesAon
©2014 Cloudera, Inc. All Rights Reserved.
File Transfers
• “hadoop fs –put <file>” • Reliable, but not resilient to failure.
• Other opAons are mountable HDFS, for example NFSv3.
34 ©2014 Cloudera, Inc. All Rights Reserved.
Streaming IngesAon
• Flume • Reliable, distributed, and available system for efficient collecAon, aggregaAon and movement of streaming data, e.g. logs.
• Ka{a • Reliable and distributed publish-‐subscribe messaging system.
35 ©2014 Cloudera, Inc. All Rights Reserved.
Flume vs. Ka{a
• Purpose built for Hadoop data ingest.
• Pre-‐built sinks for HDFS, HBase, etc.
• Supports transformaAon of data in-‐flight.
• General pub-‐sub messaging framework.
• Hadoop not supported, requires 3rd-‐party component (Camus).
• Just a message transport (a very fast one).
36 ©2014 Cloudera, Inc. All Rights Reserved.
Flume vs. Ka{a
• BoOom line: • Flume very well integrated with Hadoop ecosystem, well suited to ingesAon of sources such as log files.
• Ka{a is a highly reliable and scalable enterprise messaging system, and great for scaling out to mulAple consumers.
37 ©2014 Cloudera, Inc. All Rights Reserved.
A Quick IntroducAon to Flume
38
Flume Agent
Source Channel Sink DesAnaAon External Source
Web Server TwiOer JMS System logs …
Consumes events and forwards to channels
Stores events unAl consumed by sinks – file, memory, JDBC
Removes event from channel and puts into external desAnaAon
JVM process hosAng components
©2014 Cloudera, Inc. All Rights Reserved.
A Quick IntroducAon to Flume
• Reliable – events are stored in channel unAl delivered to next stage. • Recoverable – events can be persisted to disk and recovered in the event of failure.
39
Flume Agent
Source Channel Sink DesAnaAon
©2014 Cloudera, Inc. All Rights Reserved.
A Quick IntroducAon to Flume
• DeclaraAve • No coding required. • ConfiguraAon specifies how components are wired together.
40 ©2014 Cloudera, Inc. All Rights Reserved.
A Brief Discussion of Flume PaOerns – Fan-‐in
• Flume agent runs on each of our servers.
• These agents send data to mulAple agents to provide reliability.
• Flume provides support for load balancing.
41 ©2014 Cloudera, Inc. All Rights Reserved.
A Brief Discussion of Flume PaOerns – Spli~ng
• Common need is to split data on ingest.
• For example: • Sending data to mulAple clusters for DR.
• To mulAple desAnaAons. • Flume also supports parAAoning, which is key to our implementaAon.
42 ©2014 Cloudera, Inc. All Rights Reserved.
Sqoop Overview
• Apache project designed to ease import and export of data between Hadoop and external data stores such as relaAonal databases.
• Great for doing bulk imports and exports of data between HDFS, Hive and HBase and an external data store. Not suited for ingesAng event based data.
©2014 Cloudera, Inc. All Rights Reserved. 43
IngesAon Decisions
• Historical Data • Smaller files: file transfer • Larger files: Flume with spooling directory source.
• Incoming Data • Flume with the spooling directory source.
44 ©2014 Cloudera, Inc. All Rights Reserved.
45
Data Processing and Access
©2014 Cloudera, Inc. All Rights Reserved.
Data flow
46
Raw data
ParAAoned clickstream
data
Other data (Financial, CRM, etc.)
Aggregated dataset #2
Aggregated dataset #1
©2014 Cloudera, Inc. All Rights Reserved.
Data processing tools
47
• Hive • Impala • Pig, etc.
©2014 Cloudera, Inc. All Rights Reserved.
Hive
48
• Open source data warehouse system for Hadoop • Converts SQL-‐like queries to MapReduce jobs
• Work is being done to move this away from MR • Stores metadata in Hive metastore • Can create tables over HDFS or HBase data • Access available via JDBC/ODBC
©2014 Cloudera, Inc. All Rights Reserved.
Impala
49
• Real-‐Ame open source SQL query engine for Hadoop • Doesn’t build on MapReduce • WriOen in C++, uses LLVM for run-‐Ame code generaAon • Can create tables over HDFS or HBase data • Accesses Hive metastore for metadata • Access available via JDBC/ODBC
©2014 Cloudera, Inc. All Rights Reserved.
Pig
50
• Higher level abstracAon over MapReduce (like Hive) • Write transformaAons in scripAng language – Pig LaAn • Can access Hive metastore via HCatalog for metadata
©2014 Cloudera, Inc. All Rights Reserved.
Data Processing consideraAons
51
• We chose Hive for ETL and Impala for interac1ve BI.
©2014 Cloudera, Inc. All Rights Reserved.
52
Metadata Management
©2014 Cloudera, Inc. All Rights Reserved.
What is Metadata?
53
• Metadata is data about the data • Format in which data is stored • Compression codec • LocaAon of the data • Is the data parAAoned/bucketed/sorted?
©2014 Cloudera, Inc. All Rights Reserved.
Metadata in Hive
54
Hive Metastore
©2014 Cloudera, Inc. All Rights Reserved.
Metadata
55
• Hive metastore has become the de-‐facto metadata repository • HCatalog makes Hive metastore accessible to other applicaAons (Pig, MapReduce, custom apps, etc.)
©2014 Cloudera, Inc. All Rights Reserved.
Hive + HCatalog
56 ©2014 Cloudera, Inc. All Rights Reserved.
57
OrchestraAon
©2014 Cloudera, Inc. All Rights Reserved.
OrchestraAon
• Once the data is in Hadoop, we need a way to manage workflows in our architecture.
• Scheduling and tracking MapReduce jobs, Hive jobs, etc. • Several opAons here:
• Cron • Oozie, Azkaban • 3rd-‐party tools, Talend, Pentaho, InformaAca, enterprise schedulers.
58 ©2014 Cloudera, Inc. All Rights Reserved.
Oozie
• Supports defining and execuAng a sequence of jobs.
• Can trigger jobs based on external dependencies or schedules.
59 ©2014 Cloudera, Inc. All Rights Reserved.
60
Final Architecture
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – High Level Overview
61
Data Sources IngesAon
Data Storage/Processing
Data ReporAng/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – High Level Overview
62
Data Sources IngesAon
Data Storage/Processing
Data ReporAng/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – IngesAon
63
Web App Avro Agent Web App Avro Agent
Web App Avro Agent Web App Avro Agent
Web App Avro Agent Web App Avro Agent
Web App Avro Agent Web App Avro Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Fan-‐in PaOern
MulA Agents for Failover and rolling restarts
HDFS
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – High Level Overview
64
Data Sources IngesAon
Data Storage/Processing
Data ReporAng/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – Storage and Processing
65
/etl/weblogs/20140331/ /etl/weblogs/20140401/ …
Data Processing /data/markeAng/clickstream/bouncerate/ /data/markeAng/clickstream/aOribuAon/ …
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – High Level Overview
66
Data Sources IngesAon
Data Storage/Processing
Data ReporAng/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – Data Access
67
Hive/Impala
BI/AnalyAcs Tools
DWH Sqoop
Local Disk
R, etc.
DB import tool
JDBC/ODBC
©2014 Cloudera, Inc. All Rights Reserved.
Contact info • Mark Grover
• @mark_grover • www.linkedin.com/in/grovermark
• Jonathan Seidman • [email protected] • @jseidman • hOps://www.linkedin.com/pub/jonathan-‐seidman/1/26a/959 • hOp://www.slideshare.net/jseidman
• Slides at slideshare.net/hadooparchbook
68 ©2014 Cloudera, Inc. All Rights Reserved.
69 ©2014 Cloudera, Inc. All Rights Reserved.
Top Related