Workflow Hacks #1 - dots. Tokyo

Workflow Hacks! #1Taro L. Saito

[email protected] Dec. 14, 2015

dots. Tokyo, Japan

mailto:[email protected]?subject=

Workflow Hacks! #1

2

アンケート• 終了後メールにてアンケートを送付します

• 質問内容

• 現在、どのようなシステムを使っているか？

• ワークフローでどのような問題を解決したいか？

• 回答いただいた方に、抽選でTreasure Dataパーカーをプレゼント！

3

About Me: Taro L. Saito

4

2007 University of Tokyo. Ph.D. XML DBMS, Transaction Processing

Relational-Style XML Query [SIGMOD 2008]

~ 2014 Assistant Professor at University of Tokyo Genome Science Research

- Big Data Processing - Distributed Computing

2014.03~ Treasure Data, Inc. Tokyo

2015.07~ Treasure Data, Inc. Mountain View, CA

Cloud Platform for Data Analytics

8

• Importing 1,000,000~ records / sec. • Presto (Distributed SQL engine)

• 50,000~ queries / day • Processing 10 trillion records / day • http://qiita.com/xerial/items/a9093b60062f2c613fda

Import Export

StoreAnalyze with Presto/Hive

(Distributed SQL Engine)

EnterpEnterprise

Data

BI

http://qiita.com/xerial/items/a9093b60062f2c613fda

Workflow Fundamental Features• Dependency management

• task1 -> task2 -> task3 … • Scheduling • Execution monitoring • State management

• Error handling • Easy access to logs • Notification

9

Page 10: Workflow Hacks #1 - dots. Tokyo

Workflow Tools• Workflow Management Tools • Python: Luigi, Airflow, pinball • For Hadoop: Oozie (XML) • Script-based: Makefile, Azkaban • Biological Science: Galaxy (Web UI), nextflow • Domestic: JP1, Hinemos

• Dataflow DSL • Spark, Flink, DriadLINQ, TensorFlow • Cascading (Java -> MR), Scalding (Scala -> MR)

10

Page 11: Workflow Hacks #1 - dots. Tokyo

Dataflow DSL• Translate this data processing program

• into a cluster computing program

11

A B

A0

A1

A2

B1

B2

f

B0

C

g

map reduce

f g

Page 12: Workflow Hacks #1 - dots. Tokyo

Redbook: Dataflow Engines• Chapter 5: Large-Scale Dataflow Engine, by Peter Bailis

• http://www.redbook.io/ch5-dataflow.html

• DryadLINQ • Most influential interface

for dataflow DSL • SQL-like operation • Functional style

• Spark • SparkSQL

• 70% of Spark accesses • Dataset API

• Shift to the dataframe based API

12

http://www.redbook.io/ch5-dataflow.html

Page 13: Workflow Hacks #1 - dots. Tokyo

Dataflow -> Execution Plan• Example - Hive: SQL to MapReduce

• Mapping SQL stages into MapReduce program • SELECT page, count(*) FROM weblog

GROUP BY page

13

HDFS

A0B0

A1

A2

BB1

B2

B3

A

map reduce mergesplit

HDFS

TableScan(weblog)

GroupBy(hash(page))

count(weblog of a page)

result

Page 14: Workflow Hacks #1 - dots. Tokyo

Workflows

14

Af

B Cg

D E

F

G

Page 15: Workflow Hacks #1 - dots. Tokyo

Hadoop is not enough• C. Olston et al. [SIGMOD 2011]

• continuous processing • independent scheduling

• Incremental processing • Google Parcolator [OSDI 2010] • Naiad - Differential Workflow

Microsoft [SOSP 2013]

15

Page 16: Workflow Hacks #1 - dots. Tokyo

Continuous Processing• The Dataflow Model

• Akidau et al., Google [VLDB2015]

• Unbounded data processing • late-coming data

• Integration of • batch processing • accumulation

16

Page 17: Workflow Hacks #1 - dots. Tokyo

Cluster Computing with Dryad M. Budiu, 2008

Workflow Hacks!

Airflow

19

Page 20: Workflow Hacks #1 - dots. Tokyo

Airflow• Best practices with Airflow - An open source

platform for workflows & schedules (Nov 2015) • At Silicon Valley Data Engineering Meetup

• https://youtu.be/dgaoqOZlvEA

20

https://youtu.be/dgaoqOZlvEA

Page 21: Workflow Hacks #1 - dots. Tokyo

Workflow Development• Programmatic

• Generate workflows by code • Configuration as Code

• Workflow reuse/overwrite • object oriented

• Parameterization

21

Page 22: Workflow Hacks #1 - dots. Tokyo

Luigi• Luigiによるワークフロー管理

• http://qiita.com/k24d/items/fb9bed08423e6249d376

22

http://qiita.com/k24d/items/fb9bed08423e6249d376

Page 23: Workflow Hacks #1 - dots. Tokyo

Nextflow• http://www.nextflow.io/

23

Page 24: Workflow Hacks #1 - dots. Tokyo

Dataflow DSL vs Workflow DSL• Dataflow

• A -> B -> C -> … • Data dependencies

• Workflow • Task A -> Task B -> Task C -> …

• Task dependencies • Data transfer is optional (through file or DB)

• + Scheduling • + Task names

• For monitoring, redo, etc.

24

Page 25: Workflow Hacks #1 - dots. Tokyo

Weavelet (wvlet)• Object-oriented workflow DSL for Scala

• Workflow reuse, extension, override • Parameterization • Function := Task, Workflow := Class

25

Page 26: Workflow Hacks #1 - dots. Tokyo

Isolating DAG generation and its execution

• Alternatives of MR • Tez • Pig on Spark https://issues.apache.org/jira/browse/PIG-4059

• Asakusa on Hadoop, Spark

26

Local

Hadoop

Spark

Result

DSL generates DAG

https://issues.apache.org/jira/browse/PIG-4059

Page 27: Workflow Hacks #1 - dots. Tokyo

Stream DSL• Add “moving stream” support to Dataflow DSL

• ”moving" streams and "resting" datasets • Example

• Spark Streaming • Spark DSL + Micro-batch for stream

• Microsoft Azure Stream SQL • Windowing support for moving data

• Norikra • Stream processing with SQL

• Reactive programming • ReactiveX (Netflix), Akka Streaming (beta) 　<- Stream DSL (DAG) • Back-pressure support

• Controlling data transfer speed from receiver side

27

Page 28: Workflow Hacks #1 - dots. Tokyo

Task Execution Retry• リトライと冪等性のデザインパターン

• http://frsyuki.hatenablog.com/entry/2014/06/09/164559

• System failures • Process is not responding

• network, hardware failures • Middleware failures

• provisioning failures, missing components

• User failures • Wrong configuration • Programming error

28

http://frsyuki.hatenablog.com/entry/2014/06/09/164559

Page 29: Workflow Hacks #1 - dots. Tokyo

Retry Example• Example: Task calling a REST API /create/xxx

• Client: First attempt • Server returns 200 Success

• But failed to get the status code • Client retries the task

• Get 409 conflict error (entry xxx is already created)

• Solution (Application side) • Handle 409 error as success in the client (idempotent

execution) • More strict approach

• Making xxx unique for each request

29

Page 30: Workflow Hacks #1 - dots. Tokyo

Fault Tolerance• Presto: Distributed query engine developed by Facebook

• Uses HTTP data transfer

• No fault-tolerance

• 99.5% of queries finishes without any failure

• For queries processing 10 billions or more rows => Drops to 85%

30

A0B0

A1

A2

BB1

B2

B3

A

map reduce mergesplit

TableScan(weblog)

GroupBy(hash(page))

count(weblog of a page)

result

Page 31: Workflow Hacks #1 - dots. Tokyo

Summary• Recent workflow tools

• Driven by Python community • Because of this book! (=>) • Airflow, Luigi, etc.

• Workflow manager • Handle system failures, monitoring

• Workflow development • DAG based DSL (dataflow, workflow, stream processing) -> Execution • Does not cover application logic errors

• Idempotent execution • Requires splitting large tasks into smaller ones

31

Download - Workflow Hacks #1 - dots. Tokyo