Large scale analytical workflows

31
bioexcel.eu Partners Funding Large-scale analytical workflows on the cloud using Galaxy and Globus Presenters: Ravi Madduri Host: Adam Carter BioExcel Educational Webinar Series #8 16 November, 2016

Transcript of Large scale analytical workflows

Page 1: Large scale analytical workflows

bioexcel.eu

Partners Funding

Large-scale analytical workflows on the cloud using Galaxy and Globus

Presenters: Ravi MadduriHost: Adam Carter

BioExcel Educational Webinar Series #8

16 November, 2016

Page 2: Large scale analytical workflows

bioexcel.eu

Thiswebinarisbeingrecorded

Page 3: Large scale analytical workflows

bioexcel.eu

BioExcel Overview• Excellence in Biomolecular Software

- Improve the performance, efficiency and scalability of key codes

• Excellence in Usability- Devise efficient workflow environments

with associated data integration

• Excellence in Consultancy and Training- Promote best practices and train end users

DMI Monitor

DMI Enactor

DMI Executor

DMI Enactor

Data Delivery Point

Data Source

Monitoring flow

Data flow

Service Invocation

DMI Optimiser

DMI Planner

DMIValidator

DMI Gateway

DMI Gateway

DMI Gateway

DMI Enactor

Portal / Workbench

DMI Request

DADC Engineer

DMI Expert

Repository

Registry

DMI Expert

Domain Expert

Page 4: Large scale analytical workflows

bioexcel.eu

Interest Groups

• Integrative Modeling IG• Free Energy Calculations IG• Best practices for performance tuning IG• Hybrid methods for biomolecular systems IG• Biomolecular simulations entry level users IG• Practical applications for industry IG• Training• Workflows

Support platformshttp://bioexcel.eu/contact

Forums Code Repositories Chat channel Video Channel

Page 5: Large scale analytical workflows

bioexcel.eu

Today’s PresenterRavi Madduri is a Scientist at Argonne National Laboratiories and Senior Research Fellow at University of Chicago.Ravi is actively involved in developing innovative software and networking technology. As lead architect of the Reliable File Transfer, he designed novel testing and profiling capabilities, ensuring that it met the needs of key communities such as TeraGrid.He implemented Grid file transfer patterns in the Java CoG Kit and developed a remote application virtualization infrastructure; the Grid-enable extension was incorporated in the Grid Service Authoring Toolkit and is used by NCI Information Systems.He is applying new technology in diverse science and engineering domains. For example, he is a key contributor to the Cancer Bioinformatics Grid. He played a lead role in the evolution of GridFTP and its adoption by researchers for the Laser Interferometer Gravitational Wave Observatory and the Large Hadron Collider. Moreover, as part of the NEESgrid project, he helped scientific teams incorporate Grid technology into their earthquake engineering research.

5

Page 6: Large scale analytical workflows

globus.org/genomics

LargeScaleAnalyticalWorkflowsontheCloudusingGalaxyandGlobus

RaviK.MadduriArgonneNationalLaboratory,UniversityofChicago

[email protected]

Page 7: Large scale analytical workflows

globus.org/genomics

• Globusisdeveloped,operated,andsupportedbyresearchers,developers,andbioinformaticiansattheComputationInstitute– UniversityofChicago/ArgonneNationalLab

• Weareanon-profitorganizationbuildingsolutionsfornon-profitresearchers

• Ourgoalistosupporttheadvancementofsciencebybringingtogetherourstrengthsandcapabilitiestohelpmeettheuniqueneedsofresearchersandresearchinstitutions

WhoWeAre

Page 8: Large scale analytical workflows

globus.org/genomics

SequencingCenters

SequencingCenters

DataMovementandAccessChallenges

ManualDataAnalysis

PublicData

Storage

LocalCluster/CloudSeq

Center

ResearchLab

• Dataisdistributedindifferentlocations

• Researchlabsneedaccesstothedataforanalysis• BeabletoSharedatawithotherresearchers/collaborators

• Inefficientwaysofdatamovement• DataneedstobeavailableonthelocalandDistributedCompute

Resources• LocalClusters,Cloud,Grid

HowdoweanalyzethisSequenceData

OncewehavetheSequenceData

Picard

GATK

Fastq RefGenome

Alignment

VariantCalling

• ManuallymovethedatatotheComputenode

(Re)RunScript

Install

Modify

• InstallallthetoolsrequiredfortheAnalysis• BWA,Picard,GATK,FilteringScripts,etc.

• Shellscriptstosequentiallyexecutethetools• Manuallymodifythescriptsforanychange

• ErrorProne,difficulttokeeptrack,messy..• Difficulttomaintainandtransfertheknowledge

ChallengesInLargeScaleNGSAnalysis

Page 9: Large scale analytical workflows

globus.org/genomics

Additional Challenges in Big Data

• Rapidly validating a hypothesis• Scaling up the analysis after validation• Trivially applying the same techniques on

other/all datasets of interest• Reproducibility

– Unique Identifiers for inputs and outputs– Publishable Results– Discoverable Results

11/23/16 BIGDATAforDISCOVERYSCIENCE

9

Page 10: Large scale analytical workflows

globus.org/genomics

SequencingCenters

SequencingCenters

PublicData

Storage

LocalCluster/CloudSeq

Center

ResearchLab

Globusprovidesfor• High-performance• Fault-tolerant• Securefiletransferbetweenalldata-endpoints

Datamanagement Dataanalysis

Picard

GATK

Fastq Ref Genome

Alignment

Variant Calling

GlobusGenomicsonAmazonEC2

• Analyticaltoolsareautomaticallyrunonthescalablecomputeresourceswhenpossible

• GlobusintegratedwithinGalaxy

• Web-basedUI• Drag-Drop

workflowcreations

• Easilymodifyworkflowswithnewtools

Galaxy-basedworkflowmanagement

GlobusGenomics

Page 11: Large scale analytical workflows

globus.org/genomics

Technologies/Services

• EBS/S3 for scratch and semi-permanent storage• EC2 – on-demand, reserved, spot• VPCs• ELB• HTCondor• Globus transfer, identity management• Chef• Cloudtrails, SNS, SES – monitoring, notifications,

audit• IAM – access management, key management• RDS – state management

Page 12: Large scale analytical workflows

globus.org/genomics

• Professionallymanagedandsupportedplatform• Bestpracticepipelines

– WholeGenome,Exome,RNA-Seq,ChIP-Seq,…

• Enhancedworkbenchwithbreadthofanalytictools• Technicalsupportandbioinformaticsconsulting• Accesstopre-integratedend-pointsforreliableandhigh-

performancedatatransfer(e.g.BroadInstitute,PerkinElmer,universitysequencingcenters,etc.)

• Cost-effectivesolutionwithsubscription-basedpricing

AdditionalCapabilities

Page 13: Large scale analytical workflows

globus.org/genomics

Profiler

1.Submitprofilingrequest 5.Returnprofiles

Worker

Workerwebservice

PCP HTCondor

2.Provisionworkers

3.Start/monitorprofiling

Worker

Workerwebservice

PCP HTCondorWorker

Workerwebservice

PCP HTCondorWorker

Workerwebservice

PCP HTCondor

4.ParsePCPlogandstoreprofiles

A Cloud Tool Profiling Service

� Describe profile requests in JSON

� Provision resources and apply a profiling Web Service

q Use Performance Co-pilot (PCP) to capture usage

� Capture and process PCP logs

� Return profiles as JSON (or logs via s3)

Page 14: Large scale analytical workflows

globus.org/genomics

Cost-aware Provisioning

14

1. Filter instance types with profiles

2. Determine price for each instance type across all AZs

3. Rank potential requests

4. Make requests and monitor

5. Cancel or repurpose excess active requests once one is fulfilled

$$$

???

R.Chardetal.Cost-awarecloudprovisioning,Proceedingsofthe11thIEEEInternationalConferenceone-Science(e-Science),2015.

Page 15: Large scale analytical workflows

globus.org/genomics

• Workflows can be easily defined and automated with integrated Galaxy Platform capabilities

• Data movement is streamlined with integrated Globus file-transfer functionality

• Resources can be provisioned on-demand with Amazon Web Services cloud based infrastructure

GlobusGenomics

Page 16: Large scale analytical workflows

globus.org/genomics

Packaging data for interchange

11/23/16 BIGDATAforDISCOVERYSCIENCE

16

https://github.com/ini-bdds/bdbag

Page 17: Large scale analytical workflows

globus.org/genomics

Packaging data for interchange

A packaging format for encapsulating– Payload: arbitrary content– Tags: metadata describing the payload– Checksums: supports verification of content

Bio_data_bag/ |-- data | \-- genomic | \-- 2a673.fastq| \-- 2a673.fastq| -- manifest-md5.txt | afbfa231324812378123bfa data/genomic/2a673.fasta| -- bagit.txt

Contact-Name: John Smith

BDBag

Page 18: Large scale analytical workflows

globus.org/genomics

Minimal viable identifiers (minid)• Every data item that you create can be

automatically assigned a digital id• You can reference it, share it, resolve it

Page 19: Large scale analytical workflows

globus.org/genomics

Resolve a minid

Page 20: Large scale analytical workflows

globus.org/genomics

Generating Data Identifiers

11/23/16 BIGDATAforDISCOVERYSCIENCE

20

https://github.com/ini-bdds/minid

Page 21: Large scale analytical workflows

globus.org/genomics

Dnase Hypersensitivity Analysis

BDDSData

1.CreateaQuery 2.Query

EncodetoBDBagService3.Query

BDDSAnalysisServices4.BDBagMinID

BDDSERMRestService TRN

BDDSGalaxyService

BDBAG

5.ExecuteBigDataAnalysispipelines

6.Results

7.PublishResults

BDDSPublicationService

8.IndexResults

CEL

FASTQ

BDBAG

Page 22: Large scale analytical workflows

globus.org/genomics

Extending Globus Genomics

• BDBags – Interchangeable data objects for collections of files– Checksums– Unique identifiers

• Batch Execution on BDBags generating results as “bags” of results

• Strong data provenance for reproducibility• Elastic File System for scratch and S3 for

results

Page 23: Large scale analytical workflows

globus.org/genomics

Extending Globus Genomics

• 500+ skinny docker containers • Instrumented with cadvisor• Extended the application profiling service

to generate profiles for cpu, i/o, memory, disk

• Dashboards using graphana• RDS to store the profiles

Page 24: Large scale analytical workflows

globus.org/genomics11/23/16 BIGDATAforDISCOVERYSCIENCE

24

Page 25: Large scale analytical workflows

globus.org/genomics

GlobusGenomicsataglance

30institutions,groups

10smillioncorehours

labs

2PBsrawsequences

analyzed

>1500analysistools

1000sgenomesprocessed

>50workflows

99%uptimeoverthepast

twoyears

43stepsnumberofstepsinasingleworkflow

5 dayslongestrunning

workflow

100sdifferentspecies

1000sgenomesprocessed

5 dayslongestrunning

workflow

Page 26: Large scale analytical workflows

globus.org/genomics

Typical Usecases

AprofileofinheritedpredispositiontobreastcanceramongNigerianwomen

Y. Zheng, T. Walsh, F. Yoshimatsu, M. Lee, S. Gulsuner, S. Casadei, A. Rodriguez, T. Ogundiran, C. Babalola, O. Ojengbede, D. Sighoko, R. Madduri, M.-C. King, O. Olopade

AcasestudyforhighthroughputanalysisofNGSdatafortranslationalresearchusingGlobusGenomics

D. Sulakhe, A. Rodriguez, K. Bhuvaneshwar, Y. Gusev, R. Madduri, L. Lacinski, U. Dave, I. Foster, S. Madhavan

Page 27: Large scale analytical workflows

globus.org/genomics

Globus Genomics users

DobynsLab

Cox LabVolchenboum LabOlopade Lab

Nagarajan Lab

Page 28: Large scale analytical workflows

globus.org/genomics

• More information on Globus Genomics:www.globus.org/genomics

• More information on Globus: www.globus.org

Page 29: Large scale analytical workflows

globus.org/genomics

Our work is supported by:U.S. DEPARTMENT OF

ENERGY

29

Page 30: Large scale analytical workflows

globus.org/genomics

Team

Page 31: Large scale analytical workflows

bioexcel.eu

Audience Q&A session

Please use the Questionsfunction in GoToWebinar

application

Any other questions or points to discuss after the live

webinar? Join the discussion the discussion at

http://ask.bioexcel.eu or jump straight to the topic at http://bit.ly/2fghe8B.