Workflow Hacks! #1
2
アンケート• 終了後 メールにてアンケートを送付します
• 質問内容
• 現在、どのようなシステムを使っているか?
• ワークフローでどのような問題を解決したいか?
• 回答いただいた方に、抽選でTreasure Dataパーカーをプレゼント!
3
About Me: Taro L. Saito
4
2007 University of Tokyo. Ph.D. XML DBMS, Transaction Processing
Relational-Style XML Query [SIGMOD 2008]
~ 2014 Assistant Professor at University of Tokyo Genome Science Research
- Big Data Processing - Distributed Computing
2014.03~ Treasure Data, Inc. Tokyo
2015.07~ Treasure Data, Inc. Mountain View, CA
Cloud Platform for Data Analytics
8
• Importing 1,000,000~ records / sec. • Presto (Distributed SQL engine)
• 50,000~ queries / day • Processing 10 trillion records / day • http://qiita.com/xerial/items/a9093b60062f2c613fda
Import Export
StoreAnalyze with Presto/Hive
(Distributed SQL Engine)
EnterpEnterprise
Data
BI
Workflow Fundamental Features• Dependency management
• task1 -> task2 -> task3 … • Scheduling • Execution monitoring • State management
• Error handling • Easy access to logs • Notification
9
Workflow Tools• Workflow Management Tools • Python: Luigi, Airflow, pinball • For Hadoop: Oozie (XML) • Script-based: Makefile, Azkaban • Biological Science: Galaxy (Web UI), nextflow • Domestic: JP1, Hinemos
• Dataflow DSL • Spark, Flink, DriadLINQ, TensorFlow • Cascading (Java -> MR), Scalding (Scala -> MR)
10
Dataflow DSL• Translate this data processing program
• into a cluster computing program
11
A B
A0
A1
A2
B1
B2
f
B0
C
C
g
map reduce
f g
Redbook: Dataflow Engines• Chapter 5: Large-Scale Dataflow Engine, by Peter Bailis
• http://www.redbook.io/ch5-dataflow.html
• DryadLINQ • Most influential interface
for dataflow DSL • SQL-like operation • Functional style
• Spark • SparkSQL
• 70% of Spark accesses • Dataset API
• Shift to the dataframe based API
12
Dataflow -> Execution Plan• Example - Hive: SQL to MapReduce
• Mapping SQL stages into MapReduce program • SELECT page, count(*) FROM weblog
GROUP BY page
13
HDFS
A0B0
A1
A2
BB1
B2
B3
A
map reduce mergesplit
HDFS
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
Workflows
14
Af
B Cg
D E
F
G
Hadoop is not enough• C. Olston et al. [SIGMOD 2011]
• continuous processing • independent scheduling
• Incremental processing • Google Parcolator [OSDI 2010] • Naiad - Differential Workflow
Microsoft [SOSP 2013]
15
Continuous Processing• The Dataflow Model
• Akidau et al., Google [VLDB2015]
• Unbounded data processing • late-coming data
• Integration of • batch processing • accumulation
16
Cluster Computing with Dryad M. Budiu, 2008
Cluster Computing with Dryad M. Budiu, 2008
Workflow Hacks!
Airflow
19
Airflow• Best practices with Airflow - An open source
platform for workflows & schedules (Nov 2015) • At Silicon Valley Data Engineering Meetup
• https://youtu.be/dgaoqOZlvEA
20
Workflow Development• Programmatic
• Generate workflows by code • Configuration as Code
• Workflow reuse/overwrite • object oriented
• Parameterization
21
Luigi• Luigiによるワークフロー管理
• http://qiita.com/k24d/items/fb9bed08423e6249d376
22
Nextflow• http://www.nextflow.io/
23
Dataflow DSL vs Workflow DSL• Dataflow
• A -> B -> C -> … • Data dependencies
• Workflow • Task A -> Task B -> Task C -> …
• Task dependencies • Data transfer is optional (through file or DB)
• + Scheduling • + Task names
• For monitoring, redo, etc.
24
Weavelet (wvlet)• Object-oriented workflow DSL for Scala
• Workflow reuse, extension, override • Parameterization • Function := Task, Workflow := Class
25
Isolating DAG generation and its execution
• Alternatives of MR • Tez • Pig on Spark https://issues.apache.org/jira/browse/PIG-4059
• Asakusa on Hadoop, Spark
26
Local
Hadoop
Spark
Result
DSL generates DAG
Stream DSL• Add “moving stream” support to Dataflow DSL
• ”moving" streams and "resting" datasets • Example
• Spark Streaming • Spark DSL + Micro-batch for stream
• Microsoft Azure Stream SQL • Windowing support for moving data
• Norikra • Stream processing with SQL
• Reactive programming • ReactiveX (Netflix), Akka Streaming (beta) <- Stream DSL (DAG) • Back-pressure support
• Controlling data transfer speed from receiver side
27
Task Execution Retry• リトライと冪等性のデザインパターン
• http://frsyuki.hatenablog.com/entry/2014/06/09/164559
• System failures • Process is not responding
• network, hardware failures • Middleware failures
• provisioning failures, missing components
• User failures • Wrong configuration • Programming error
28
Retry Example• Example: Task calling a REST API /create/xxx
• Client: First attempt • Server returns 200 Success
• But failed to get the status code • Client retries the task
• Get 409 conflict error (entry xxx is already created)
• Solution (Application side) • Handle 409 error as success in the client (idempotent
execution) • More strict approach
• Making xxx unique for each request
29
Fault Tolerance• Presto: Distributed query engine developed by Facebook
• Uses HTTP data transfer
• No fault-tolerance
• 99.5% of queries finishes without any failure
• For queries processing 10 billions or more rows => Drops to 85%
30
A0B0
A1
A2
BB1
B2
B3
A
map reduce mergesplit
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
Summary• Recent workflow tools
• Driven by Python community • Because of this book! (=>) • Airflow, Luigi, etc.
• Workflow manager • Handle system failures, monitoring
• Workflow development • DAG based DSL (dataflow, workflow, stream processing) -> Execution • Does not cover application logic errors
• Idempotent execution • Requires splitting large tasks into smaller ones
31