Data-driven Workflow Planning in Cluster management Systems Srinath Shankar David J DeWitt...

Data-driven Workflow Planning in Cluster management Systems

Srinath Shankar

David J DeWitt

Department of Computer Sciences

University of Wisconsin-Madison, USA

Data explosion in science

Scientific applications – Traditionally considered as compute-intensive

Data explosion in recent years Astronomy – hundreds of TB

Sloan Digital Sky Survey LIGO – Laser Interferometry Gravitational-wave

observer Bioinformatics –

BIRN – Biomedical informatics research network SwissProt – Protein database

Scientific workflows and files

A, B, C and D are programsFile1 and File2 are pipeline (intermediate) files

FileInput is a batch input file -- common to all DAGs

Jobs with dependencies organized in Directed Acyclic GraphsLarge number of similar DAGs make up a workflow

Distributed scientific computing

Scientists have exploited distributed computing to run their programs and workflows

One popular distributed computing system is Condor

Condor harvests idle CPU cycles on machines in a network

Condor has been installed on roughly 113,000 machines across 1,600 clusters around the world

But …

Several advances have been made since the development of Condor in the `80s

Machines are getting cheaper Organizations no longer rely solely on idle desktop machines for

computing cycles The proportion of machines dedicated to Condor computing in a

cluster is increasing Disk capacities are increasing

A single machine may have 500 GB of disk space Thus, desktop machines may also have a lot of free disk space

Dedicated and desktop machines have unused disk space Half a petabyte of disk space spread over a modest cluster of 1000

machines

Focus

The volume of data processed by scientific applications is increasing.

How can we leverage distributed disk space to improve data management in cluster computing systems (like Condor) ?

Step 1: Store workflow data across the disks of machines in a cluster

Step 2: Schedule workflows based on data location – Exploit disk space to improve workflow execution times

Overview of Condor

Planner

Submit Machine

ExecuteMachine

User Data

User Process

User input data

Output data

Data flow Control flow

Job info

Job info

Machine info

Machine info

Job and workflow submission

To submit a job, the user provides a “submit” file containing Complete job description – The input, output and

error files, when to transfer these files etc. Machine preferences like OS, CPU speed and

memory Workflows are managed in a separate layer

The user specifies dependencies between jobs in a separate “DAG description” file

A DAG manager process (DAGMan) on the submit machine continuously monitors job completion events

This process submits a job only when all its parents have completed

Limitations of Condor

The “source” of files in Condor is the submit machine, or perhaps a shared or third-party file system Inefficient handling of files during workflow

execution Files always transferred to and from the

submit machine The planner only handles single jobs

It has no direct knowledge of job dependencies.

It only sees a job after DAGMan submits it.

Distributed file caching

Keep the files of a job on the disk of machines after execution

Utilize local disks on execute machines as sources of files

Schedule dependent jobs on same machine whenever feasible

Avoid network file transfer Reduce overall workflow execution

time

Disk aware planning

Goal – reduce workflow execution time by minimizing file transfers

Planner must be aware of the locations of cached files

Requires a planner that is also aware of workflow structure

Two phase planning algorithm

AssignDAGs : Each DAG in a workflow tentatively assigned to the best machine based on disk cache contents

But, assigning whole DAGs ignores inter-job parallelism

Parallelize : Exploit parallelism in DAG to distribute load Cost-benefit analysis used when

scheduling dependent jobs on different machines

Planning example

A B

CF1 F2

Sample DAG

A

C

A

C

A

C

A

C

A

C

A B

C

Suppose we have 4 machines available to run the workflow shown below

Sample Workflow (6 DAGs)

Assignment of DAGs

For each DAG in the workflow, we determine the machine that will result in earliest completion time for that DAG, and assign it to that machine.

DAG runtime = Sum of job runtimes and file transfer times File transfer times depends on cache contents of the

machine Effectively, each DAG is treated like a single job

in this phase.

Schedule after AssignDAGs

M1 M2 M3 M4

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

Time

Jobs in the same DAG are of the same color

The schedule producedafter AssignDAGs entails

no transfer of intermediatefiles

Assignment phase (contd.)

While DAGs are being assigned, a cumulative runtime is maintained for each machine

Once a DAG has been scheduled on a machine, we assume that machine caches the workflow batch input (common to all DAGs)

Thus, batch input transfer times are not included in calculations of the runtime of other DAGs on that machine

Parallelization of DAGs

After assignment phase, uneven load on machines There are “extra” DAGs on a few heavily loaded

machines There are some machines with a much lighter load

Exploit inter-job parallelism to distribute load The “extra” DAGs are examined in turn. If two jobs in a DAG can be run in parallel, we

try to move one of them to a lightly loaded machine.

Parallelization – Costs and benefits

Cost of parallelization – When you move a job to a different machine than its parents and children, its input and output files have to be transferred to and from that machine.

Cost = (input_size +output_size)/net_BW. Input_size and output_size are the sizes of the input

files and output files for the job Net_BW is the network bandwidth Cost is the time taken to perform data transfers to and

from the different machines Benefit = Time saved due to parallel execution of

jobs

C

Final Schedule

M1 M2 M3 M4

A

B

C

A

B

C

A

B

C

A

B

C

A B A B

C

Time

F2 F2 Network file transfer

In the final schedule, files are transferred from M2 to M1

and from M4 to M3

B

C C

B

Parallelization (contd.)

In the formula for the cost of parallelization, input_size and output_size are adjusted for files already cached on either machine

If a job being considered for parallelization has no children, output_size is taken as 0 since its output files do not need to be transferred back

Implementation

Main feature is a database used to store File information – checksums, sizes, file type,

file locations Job information – Files used by jobs, job

dependencies Workflow schedules – Produced by the

planner The Condor daemons were modified to

directly connect to the database and perform insert/updates/queries

Role of database

Data-base

Workflowand file info

Cacheinfo

Workflow, file info

ScheduleSubmit

MachineExecuteMachine

Planner

FileCache

UserData

Implementation – versioning

Versions of input and executables are determined by checksums computed at submission time

The versions of intermediate and output files are “derived” from the versions of the inputs and executables that produce them

Implementation – Distributed Storage

Before a job executes on a machine, its input files are retrieved Files available in the machine’s local cache are used

directly Unavailable files are retrieved from other machines in

the cluster. Any machine can serve as a file server After a job completes, its executable, input and

output files are saved in the execute machine’s disk cache.

Once a job has completed, the database is updated with the new status and cache information.

Implementation – Workflow submission

An entire workflow is submitted at one time The workflow submission tools directly update

the database with job and workflow information This information includes files used by the

workflow as well as job dependencies in the workflow

The planner directly uses the information in the database. Thus It has knowledge of job dependencies during planning It has knowledge of the locations of the relevant files

during planning

Performance testing

Comparison of three systems ORIG – The original Condor system DAG-C – Our caching and DAG-

oriented planning framework Job-C

Same caching mechanism as DAG-C No DAG-based planning. When a job is

ready, it is matched to the machine that caches most input

Description of setup

Tested on BLAST and synthetic workflows with varying branch-in factor and pipeline volume

Cluster of 25 execute machines – all files were in the same network

Two submit machines Network bandwidth was 100 Mbps No shared file system was used All experiments run with initially clean disk

caches

The BLAST workflow

BLAST is a sequence alignment workflow. Given a protein sequence “seq”, blastall checks a database of known proteins

for any similarities. Proteins with similar sequences are expected to have similar properties. Javawrap converts the

results into CSV and binary format for later use.

Batch input :(~4GB)nr_db.psqnr_db.pin

nr_dbnr_db.phr

nr.gz

Pipeline volume: seq.blast (~2MB)

blastall(3.1 MB)

java-wrap(1KB)

seq.blast

nr_db.psq (986 MB)nr_db.pin (23 MB)

nr_db.psq (986 MB)nr_db.pin (23 MB)

seq seq.csv

seq.bin

BLAST results

0

100

200

300

400

500

600

700

Running time (min)

25 50 75 100

Number of DAGs

ORIGJob-CDAG-C

Sensitivity to pipeline volume

F1, F2, G1 and G2 are distinct files

10 minutes per job Varying size per file –

100MB, 1GB, 1.5 GB, 2GB

50 DAGs per workflow

Pipeline I/O results

0

50

100

150

200

250

300

Running time (min)

100 MB 1 GB 1.5 GB 2 GB

Size per file

ORIGJob-CDAG-C

DAG breadth

File Fi, Gi are distinct

Varied branching factor (n) from 3 to 6

10 min per job Tested a 50 DAG

workflow with 1GB per file

DAG breadth results (1GB)

0

100

200

300

400

500

600

Running time (min)

3 4 5 6

DAG breadth (n)

ORIGJob-CDAG-C

Varying computation time

Size of each file set to 1GB Varied the time per job from 10 to 30

minutes. (i.e. time per DAG from 80 to 240 min)

Tested a 50 DAG workflow

Increasing computation

0100200300400500600700800900

Workflow running

time (min)

10 15 20 25 30

Running time per job (min)

ORIGJob-CDAG-C

Results – Summary

Job-C and DAG-C are better than ORIG In ORIG, all file traffic through submit machine In Job-C and DAG-C, files can be retrieved from

multiple locations Thus, caching helps DAG-C is significantly better than Job-C

when pipeline volume, branching factor are high In Job-C parent jobs often run on different

machines Output files have to be transferred to the machine

where their child executes Thus, DAG-oriented planning helps

Distributed file caching – other benefits

Scientists frequently reuse files (such as executables) – These can be used directly at their stored locations.

Maintaining user data ‘What were the programs run to obtain this

output ?’ ‘When did I last use a particular version of a

file?’

Ongoing work

Planning Evaluating planning overhead, dependence on DB

size Make planning scheme more responsive to job

failure, machine failure A cache replacement policy based on an LRFU

scheme has been implemented, but not validated (See paper for details). Ongoing work includes Validating the cache replacement policy and

determining the best policy for a workflow depending on user’s submission pattern

Including the time needed for generating a file in estimates of its “cache-worthiness”

Related work

ZOO, GridDB – data centric workflow management systems

Thain et al. – Pipeline and batch sharing in Grid workloads – HPDC 2003

Romosan et al. – Coscheduling of computation and data on computer clusters – SSDBM 2005

Bright et al. – Efficient scheduling and execution of scientific workflow tasks – SSDBM 2005

Questions ?

Data-driven Workflow Planning in Cluster management Systems Srinath Shankar David J DeWitt...

Documents

Transcript of Data-driven Workflow Planning in Cluster management Systems Srinath Shankar David J DeWitt...