Application architectures with hadoop

Headline Goes Here Speaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

ApplicaAon Architectures with Hadoop Mark Grover | SoGware Engineer Jonathan Seidman | SoluAons Architect, Partner Engineering April 1, 2014

About Us • Mark

•  CommiOer on Apache Bigtop, commiOer and PPMC member on Apache Sentry (incubaAng).

•  Contributor to Hadoop, Hive, Spark, Sqoop, Flume. •  @mark_grover

•  Jonathan •  SoluAons Architect, Partner Engineering Team. •  Co-‐founder of Chicago Hadoop User Group and Chicago Big Data. •  jseidman@cloudera.com •  @jseidman

Co-‐authoring O’Reilly book

•  Titled ‘Hadoop ApplicaAon Architectures’ • How to build end-‐to-‐end soluAons using Apache Hadoop and related tools • Updates on TwiOer: @hadooparchbook •  hOp://www.hadooparchitecturebook.com

Challenges of Hadoop ImplementaAon

Click Stream Analysis

Case Study

Click Stream Analysis

Log Files

Web Log Example

[2012/09/22 20:56:04.294 -0500] "GET /info/ HTTP/1.1" 200 701 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en)" "age=38&gender=1&incomeCategory=5&session=983040389&user=627735038&region=8&userType=1” [2012/09/23 14:12:52.294 -0500] "GET /wish/remove/275 HTTP/1.1" 200 701 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; en-us) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16" "age=63&gender=1&incomeCategory=1&session=1561203915&user=1364334488&region=4&userType=1"

Hadoop Architectural ConsideraAons

•  Storage managers? •  HDFS? HBase?

•  Data storage and modeling: •  File formats? Compression? Schema design?

•  Data movement •  How do we actually get the data into Hadoop? How do we get it out?

•  Metadata •  How do we manage data about the data?

•  Data access and processing •  How will the data be accessed once in Hadoop? How can we transform it? How do we query it?

•  OrchestraAon •  How do we manage the workflow for all of this?

Data Storage and Modeling

Data Storage – Storage Manager consideraAons

•  Popular storage managers for Hadoop •  Hadoop Distributed File System (HDFS) •  HBase

Data Storage – HDFS vs HBase

•  Stores data directly as files •  Fast scans •  Poor random reads/writes

•  Stores data as Hfiles on HDFS •  Slow scans •  Fast random reads/writes

Data Storage – Storage Manager consideraAons

• We choose HDFS •  AnalyAcal needs in this case served beOer by fast scans.

Data Storage Format

Data Storage – Format ConsideraAons

•  Store as plain text? •  Sure, well supported by Hadoop. •  Text can easily be processed by MapReduce, loaded into Hive for analysis, and so on.

•  But… • Will begin to consume lots of space in HDFS. • May not be opAmal for processing by tools in the Hadoop ecosystem.

Data Storage – Format ConsideraAons

•  But, we can compress the text files… •  Gzip – supported by Hadoop, but not spliOable. •  Bzip2 – hey, spliOable! Great compression! But decompression is slooowww.

•  LZO – spliOable (with some work), good compress/de-‐compress performance. Good choice for storing text files on Hadoop.

•  Snappy – provides a good tradeoff between size and speed.

Data Storage – More About Snappy

•  Designed at Google to provide high compression speeds with reasonable compression.

•  Not the highest compression, but provides very good performance for processing on Hadoop.

•  Snappy is not spliOable though, which brings us to…

SequenceFile

• Stores records as binary key/value pairs.

• SequenceFile “blocks” can be compressed.

• This enables spliOability with non-‐spliOable compression.

• Kinda SequenceFile on Steroids.

• Self-‐documenAng – stores schema in header.

• Provides very efficient storage.

• Supports spliOable compression.

Our Format Choices…

• Avro with Snappy •  Snappy provides opAmized compression. •  Avro provides compact storage, self-‐documenAng files, and supports schema evoluAon.

•  Avro also provides beOer failure handling than other choices. •  SequenceFiles would also be a good choice, and are directly supported by ingesAon tools in the ecosystem.

•  But only supports Java.

HDFS Schema Design

Recommended HDFS Schema Design

• How to lay out data on HDFS?

Recommended HDFS Schema Design

/user/<username> -‐ User specific data, jars, conf files /etl – Data in various stages of ETL workflow /tmp – temp data from tools or shared between users /data – shared data for the enAre organizaAon /app – Everything but data: UDF jars, HQL files, Oozie workflows

Advanced HDFS Schema Design

What is ParAAoning?

dataset col=val1/file.txt col=val2/file.txt . . . col=valn/file.txt

dataset file1.txt file2.txt . . . filen.txt

Un-‐parAAoned HDFS directory structure

ParAAoned HDFS directory structure

What is ParAAoning?

clicks dt=2014-‐01-‐01/clicks.txt dt=2014-‐01-‐02/clicks.txt . . . dt=2014-‐03-‐31/clicks.txt

clicks clicks-‐2014-‐01-‐01.txt clicks-‐2014-‐01-‐02.txt . . . clicks-‐2014-‐03-‐31.txt

Un-‐parAAoned HDFS directory structure

ParAAoned HDFS directory structure

ParAAoning

•  Split the dataset into smaller consumable chunks •  Rudimentary form of “indexing” •  <data set name>/<parAAon_column_name=parAAon_column_value>/{files}

ParAAoning consideraAons

• What column to bucket by? •  HDFS is append only. •  Don’t have too many parAAons (<10,000) •  Don’t have too many small files in the parAAons (more than block size generally)

• We decided to parAAon by 1mestamp

What is buckeAng?

clicks dt=2014-‐01-‐01/clicks.txt dt=2014-‐01-‐02/clicks.txt

Un-‐bucketed HDFS directory structure

clicks dt=2014-‐01-‐01/file0.txt dt=2014-‐01-‐01/file1.txt dt=2014-‐01-‐01/file2.txt dt=2014-‐01-‐01/file3.txt dt=2014-‐01-‐02/file0.txt dt=2014-‐01-‐02/file1.txt dt=2014-‐01-‐02/file2.txt dt=2014-‐01-‐02/file3.txt

Bucketed HDFS directory structure

BuckeAng

•  Hash-‐bucketed files within each parAAon based on a parAcular column

•  Useful when sampling •  In some joins, pre-‐reqs:

•  Datasets bucketed on the same key as the join key •  Number of buckets are the same or one is a mulAple of the other

BuckeAng consideraAons?

• Which column to bucket on? •  How many buckets? • We decided to bucket based on cookie

De-‐normalizing consideraAons

•  In general, big data joins are expensive • When to de-‐normalize?

•  Decided to join the smaller dimension tables •  Big fact tables are sAll joined

Data IngesAon

File Transfers

• “hadoop fs –put <file>” • Reliable, but not resilient to failure.

• Other opAons are mountable HDFS, for example NFSv3.

Streaming IngesAon

•  Flume •  Reliable, distributed, and available system for efficient collecAon, aggregaAon and movement of streaming data, e.g. logs.

•  Ka{a •  Reliable and distributed publish-‐subscribe messaging system.

Flume vs. Ka{a

• Purpose built for Hadoop data ingest.

• Pre-‐built sinks for HDFS, HBase, etc.

• Supports transformaAon of data in-‐flight.

• General pub-‐sub messaging framework.

• Hadoop not supported, requires 3rd-‐party component (Camus).

• Just a message transport (a very fast one).

Flume vs. Ka{a

•  BoOom line: •  Flume very well integrated with Hadoop ecosystem, well suited to ingesAon of sources such as log files.

•  Ka{a is a highly reliable and scalable enterprise messaging system, and great for scaling out to mulAple consumers.

A Quick IntroducAon to Flume

Flume Agent

Source Channel Sink DesAnaAon External Source

Web Server TwiOer JMS System logs …

Consumes events and forwards to channels

Stores events unAl consumed by sinks – file, memory, JDBC

Removes event from channel and puts into external desAnaAon

JVM process hosAng components

•  Reliable – events are stored in channel unAl delivered to next stage. •  Recoverable – events can be persisted to disk and recovered in the event of failure.

Flume Agent

Source Channel Sink DesAnaAon

• DeclaraAve • No coding required. •  ConfiguraAon specifies how components are wired together.

A Brief Discussion of Flume PaOerns – Fan-‐in

• Flume agent runs on each of our servers.

• These agents send data to mulAple agents to provide reliability.

• Flume provides support for load balancing.

A Brief Discussion of Flume PaOerns – Spli~ng

• Common need is to split data on ingest.

• For example: •  Sending data to mulAple clusters for DR.

•  To mulAple desAnaAons. • Flume also supports parAAoning, which is key to our implementaAon.

Sqoop Overview

• Apache project designed to ease import and export of data between Hadoop and external data stores such as relaAonal databases.

• Great for doing bulk imports and exports of data between HDFS, Hive and HBase and an external data store. Not suited for ingesAng event based data.

IngesAon Decisions

• Historical Data •  Smaller files: file transfer •  Larger files: Flume with spooling directory source.

•  Incoming Data •  Flume with the spooling directory source.

Data Processing and Access

Data flow

Raw data

ParAAoned clickstream

Other data (Financial, CRM, etc.)

Aggregated dataset #2

Aggregated dataset #1

Data processing tools

• Hive •  Impala •  Pig, etc.

• Open source data warehouse system for Hadoop •  Converts SQL-‐like queries to MapReduce jobs

• Work is being done to move this away from MR •  Stores metadata in Hive metastore •  Can create tables over HDFS or HBase data • Access available via JDBC/ODBC

Impala

•  Real-‐Ame open source SQL query engine for Hadoop • Doesn’t build on MapReduce • WriOen in C++, uses LLVM for run-‐Ame code generaAon •  Can create tables over HDFS or HBase data • Accesses Hive metastore for metadata • Access available via JDBC/ODBC

• Higher level abstracAon over MapReduce (like Hive) • Write transformaAons in scripAng language – Pig LaAn •  Can access Hive metastore via HCatalog for metadata

Data Processing consideraAons

• We chose Hive for ETL and Impala for interac1ve BI.

Metadata Management

What is Metadata?

• Metadata is data about the data •  Format in which data is stored •  Compression codec •  LocaAon of the data •  Is the data parAAoned/bucketed/sorted?

Metadata in Hive

Hive Metastore

Metadata

• Hive metastore has become the de-‐facto metadata repository • HCatalog makes Hive metastore accessible to other applicaAons (Pig, MapReduce, custom apps, etc.)

Hive + HCatalog

OrchestraAon

• Once the data is in Hadoop, we need a way to manage workflows in our architecture.

•  Scheduling and tracking MapReduce jobs, Hive jobs, etc. •  Several opAons here:

•  Cron •  Oozie, Azkaban •  3rd-‐party tools, Talend, Pentaho, InformaAca, enterprise schedulers.

• Supports defining and execuAng a sequence of jobs.

• Can trigger jobs based on external dependencies or schedules.

Final Architecture

Final Architecture – High Level Overview

Data Sources IngesAon

Data Storage/Processing

Data ReporAng/Analysis

Final Architecture – IngesAon

Web App Avro Agent Web App Avro Agent

Flume Agent

Fan-‐in PaOern

MulA Agents for Failover and rolling restarts

Final Architecture – Storage and Processing

/etl/weblogs/20140331/ /etl/weblogs/20140401/ …

Data Processing /data/markeAng/clickstream/bouncerate/ /data/markeAng/clickstream/aOribuAon/ …

Final Architecture – Data Access

Hive/Impala

BI/AnalyAcs Tools

DWH Sqoop

Local Disk

R, etc.

DB import tool

JDBC/ODBC

Contact info • Mark Grover

• @mark_grover •  www.linkedin.com/in/grovermark

•  Jonathan Seidman •  jseidman@cloudera.com • @jseidman •  hOps://www.linkedin.com/pub/jonathan-‐seidman/1/26a/959 •  hOp://www.slideshare.net/jseidman

•  Slides at slideshare.net/hadooparchbook

Application architectures with hadoop – big data techcon 2014

Technology

Transcript of Application architectures with hadoop – big data techcon 2014

Architectures paysagées

Software Architectures, Week 3 - Microservice-based Architectures

Curso Hadoop. FcoJavierLahozSevilla v1.0.pdf · Introducción+a Hadoop. InstalaciónenAWS • Parte+1.+Introducción+a Hadoop+ – ¿Que+es+Hadoop?+ – Versionesde+Hadoop+ – Gesón

TECHCON SYSTEMS TS500R MULTI-PURPOSE DIGITAL ...

New World Hadoop Architectures (& What Problems They Really Solve) for Oracle DBAs

Architectures Maçonniques.pdf

Een infra DevOps CI/CD straat - vMA TechCon · 2018-09-28 · vMA TechCon 2018 #vmatechcon2018 Wat betekent dit voor Ops? • Traditionele applicatie stacks (zoals LAMP, J2EE en .NET)

Building apps with HBase - Big Data TechCon Boston

Hadoop Trends & Hadoop on EC2

Seminar Softwarearchitektur · 2020. 7. 14. · Design architectures Document architectures Evaluate architectures Reconstruct architectures Communicate architectural decisions Benefit

Reconfigurable Architectures

App Architectures

Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture

DeNA TechCon for Student 2016 - 村田紘司

European architectures

Z obsahu čísla vyberáme - TechCON · 2010. 9. 16. · TechCON magazínu oceníte po obsahovej i rozsahovej stránke a opäť v ňom nájdete čo najviac užitočných a praktických

Automatic Hardware Generation for Reconﬁgurable Architectures · Automatic Hardware Generation for Reconﬁgurable Architectures Razvan Nane˘ Abstract R ECONFIGURABLE Architectures

Stuart Pérez A12729. Agenda Que es Hadoop Porque usarlo Componentes de Hadoop HDFS MapReduce Cluster Hadoop (HDFS + MR) Hadoop Scheduler Conclusiones.

Architectures bizarres