Large scale analytical workflows

Post on 19-Jan-2017

196 views 0 download

Transcript of Large scale analytical workflows

bioexcel.eu

Partners Funding

Large-scale analytical workflows on the cloud using Galaxy and Globus

Presenters: Ravi MadduriHost: Adam Carter

BioExcel Educational Webinar Series #8

16 November, 2016

bioexcel.eu

Thiswebinarisbeingrecorded

bioexcel.eu

BioExcel Overview• Excellence in Biomolecular Software

- Improve the performance, efficiency and scalability of key codes

• Excellence in Usability- Devise efficient workflow environments

with associated data integration

• Excellence in Consultancy and Training- Promote best practices and train end users

DMI Monitor

DMI Enactor

DMI Executor

DMI Enactor

Data Delivery Point

Data Source

Monitoring flow

Data flow

Service Invocation

DMI Optimiser

DMI Planner

DMIValidator

DMI Gateway

DMI Gateway

DMI Gateway

DMI Enactor

Portal / Workbench

DMI Request

DADC Engineer

DMI Expert

Repository

Registry

DMI Expert

Domain Expert

bioexcel.eu

Interest Groups

• Integrative Modeling IG• Free Energy Calculations IG• Best practices for performance tuning IG• Hybrid methods for biomolecular systems IG• Biomolecular simulations entry level users IG• Practical applications for industry IG• Training• Workflows

Support platformshttp://bioexcel.eu/contact

Forums Code Repositories Chat channel Video Channel

bioexcel.eu

Today’s PresenterRavi Madduri is a Scientist at Argonne National Laboratiories and Senior Research Fellow at University of Chicago.Ravi is actively involved in developing innovative software and networking technology. As lead architect of the Reliable File Transfer, he designed novel testing and profiling capabilities, ensuring that it met the needs of key communities such as TeraGrid.He implemented Grid file transfer patterns in the Java CoG Kit and developed a remote application virtualization infrastructure; the Grid-enable extension was incorporated in the Grid Service Authoring Toolkit and is used by NCI Information Systems.He is applying new technology in diverse science and engineering domains. For example, he is a key contributor to the Cancer Bioinformatics Grid. He played a lead role in the evolution of GridFTP and its adoption by researchers for the Laser Interferometer Gravitational Wave Observatory and the Large Hadron Collider. Moreover, as part of the NEESgrid project, he helped scientific teams incorporate Grid technology into their earthquake engineering research.

5

globus.org/genomics

LargeScaleAnalyticalWorkflowsontheCloudusingGalaxyandGlobus

RaviK.MadduriArgonneNationalLaboratory,UniversityofChicago

madduri@anl.gov

globus.org/genomics

• Globusisdeveloped,operated,andsupportedbyresearchers,developers,andbioinformaticiansattheComputationInstitute– UniversityofChicago/ArgonneNationalLab

• Weareanon-profitorganizationbuildingsolutionsfornon-profitresearchers

• Ourgoalistosupporttheadvancementofsciencebybringingtogetherourstrengthsandcapabilitiestohelpmeettheuniqueneedsofresearchersandresearchinstitutions

WhoWeAre

globus.org/genomics

SequencingCenters

SequencingCenters

DataMovementandAccessChallenges

ManualDataAnalysis

PublicData

Storage

LocalCluster/CloudSeq

Center

ResearchLab

• Dataisdistributedindifferentlocations

• Researchlabsneedaccesstothedataforanalysis• BeabletoSharedatawithotherresearchers/collaborators

• Inefficientwaysofdatamovement• DataneedstobeavailableonthelocalandDistributedCompute

Resources• LocalClusters,Cloud,Grid

HowdoweanalyzethisSequenceData

OncewehavetheSequenceData

Picard

GATK

Fastq RefGenome

Alignment

VariantCalling

• ManuallymovethedatatotheComputenode

(Re)RunScript

Install

Modify

• InstallallthetoolsrequiredfortheAnalysis• BWA,Picard,GATK,FilteringScripts,etc.

• Shellscriptstosequentiallyexecutethetools• Manuallymodifythescriptsforanychange

• ErrorProne,difficulttokeeptrack,messy..• Difficulttomaintainandtransfertheknowledge

ChallengesInLargeScaleNGSAnalysis

globus.org/genomics

Additional Challenges in Big Data

• Rapidly validating a hypothesis• Scaling up the analysis after validation• Trivially applying the same techniques on

other/all datasets of interest• Reproducibility

– Unique Identifiers for inputs and outputs– Publishable Results– Discoverable Results

11/23/16 BIGDATAforDISCOVERYSCIENCE

9

globus.org/genomics

SequencingCenters

SequencingCenters

PublicData

Storage

LocalCluster/CloudSeq

Center

ResearchLab

Globusprovidesfor• High-performance• Fault-tolerant• Securefiletransferbetweenalldata-endpoints

Datamanagement Dataanalysis

Picard

GATK

Fastq Ref Genome

Alignment

Variant Calling

GlobusGenomicsonAmazonEC2

• Analyticaltoolsareautomaticallyrunonthescalablecomputeresourceswhenpossible

• GlobusintegratedwithinGalaxy

• Web-basedUI• Drag-Drop

workflowcreations

• Easilymodifyworkflowswithnewtools

Galaxy-basedworkflowmanagement

GlobusGenomics

globus.org/genomics

Technologies/Services

• EBS/S3 for scratch and semi-permanent storage• EC2 – on-demand, reserved, spot• VPCs• ELB• HTCondor• Globus transfer, identity management• Chef• Cloudtrails, SNS, SES – monitoring, notifications,

audit• IAM – access management, key management• RDS – state management

globus.org/genomics

• Professionallymanagedandsupportedplatform• Bestpracticepipelines

– WholeGenome,Exome,RNA-Seq,ChIP-Seq,…

• Enhancedworkbenchwithbreadthofanalytictools• Technicalsupportandbioinformaticsconsulting• Accesstopre-integratedend-pointsforreliableandhigh-

performancedatatransfer(e.g.BroadInstitute,PerkinElmer,universitysequencingcenters,etc.)

• Cost-effectivesolutionwithsubscription-basedpricing

AdditionalCapabilities

globus.org/genomics

Profiler

1.Submitprofilingrequest 5.Returnprofiles

Worker

Workerwebservice

PCP HTCondor

2.Provisionworkers

3.Start/monitorprofiling

Worker

Workerwebservice

PCP HTCondorWorker

Workerwebservice

PCP HTCondorWorker

Workerwebservice

PCP HTCondor

4.ParsePCPlogandstoreprofiles

A Cloud Tool Profiling Service

� Describe profile requests in JSON

� Provision resources and apply a profiling Web Service

q Use Performance Co-pilot (PCP) to capture usage

� Capture and process PCP logs

� Return profiles as JSON (or logs via s3)

globus.org/genomics

Cost-aware Provisioning

14

1. Filter instance types with profiles

2. Determine price for each instance type across all AZs

3. Rank potential requests

4. Make requests and monitor

5. Cancel or repurpose excess active requests once one is fulfilled

$$$

???

R.Chardetal.Cost-awarecloudprovisioning,Proceedingsofthe11thIEEEInternationalConferenceone-Science(e-Science),2015.

globus.org/genomics

• Workflows can be easily defined and automated with integrated Galaxy Platform capabilities

• Data movement is streamlined with integrated Globus file-transfer functionality

• Resources can be provisioned on-demand with Amazon Web Services cloud based infrastructure

GlobusGenomics

globus.org/genomics

Packaging data for interchange

11/23/16 BIGDATAforDISCOVERYSCIENCE

16

https://github.com/ini-bdds/bdbag

globus.org/genomics

Packaging data for interchange

A packaging format for encapsulating– Payload: arbitrary content– Tags: metadata describing the payload– Checksums: supports verification of content

Bio_data_bag/ |-- data | \-- genomic | \-- 2a673.fastq| \-- 2a673.fastq| -- manifest-md5.txt | afbfa231324812378123bfa data/genomic/2a673.fasta| -- bagit.txt

Contact-Name: John Smith

BDBag

globus.org/genomics

Minimal viable identifiers (minid)• Every data item that you create can be

automatically assigned a digital id• You can reference it, share it, resolve it

globus.org/genomics

Resolve a minid

globus.org/genomics

Generating Data Identifiers

11/23/16 BIGDATAforDISCOVERYSCIENCE

20

https://github.com/ini-bdds/minid

globus.org/genomics

Dnase Hypersensitivity Analysis

BDDSData

1.CreateaQuery 2.Query

EncodetoBDBagService3.Query

BDDSAnalysisServices4.BDBagMinID

BDDSERMRestService TRN

BDDSGalaxyService

BDBAG

5.ExecuteBigDataAnalysispipelines

6.Results

7.PublishResults

BDDSPublicationService

8.IndexResults

CEL

FASTQ

BDBAG

globus.org/genomics

Extending Globus Genomics

• BDBags – Interchangeable data objects for collections of files– Checksums– Unique identifiers

• Batch Execution on BDBags generating results as “bags” of results

• Strong data provenance for reproducibility• Elastic File System for scratch and S3 for

results

globus.org/genomics

Extending Globus Genomics

• 500+ skinny docker containers • Instrumented with cadvisor• Extended the application profiling service

to generate profiles for cpu, i/o, memory, disk

• Dashboards using graphana• RDS to store the profiles

globus.org/genomics11/23/16 BIGDATAforDISCOVERYSCIENCE

24

globus.org/genomics

GlobusGenomicsataglance

30institutions,groups

10smillioncorehours

labs

2PBsrawsequences

analyzed

>1500analysistools

1000sgenomesprocessed

>50workflows

99%uptimeoverthepast

twoyears

43stepsnumberofstepsinasingleworkflow

5 dayslongestrunning

workflow

100sdifferentspecies

1000sgenomesprocessed

5 dayslongestrunning

workflow

globus.org/genomics

Typical Usecases

AprofileofinheritedpredispositiontobreastcanceramongNigerianwomen

Y. Zheng, T. Walsh, F. Yoshimatsu, M. Lee, S. Gulsuner, S. Casadei, A. Rodriguez, T. Ogundiran, C. Babalola, O. Ojengbede, D. Sighoko, R. Madduri, M.-C. King, O. Olopade

AcasestudyforhighthroughputanalysisofNGSdatafortranslationalresearchusingGlobusGenomics

D. Sulakhe, A. Rodriguez, K. Bhuvaneshwar, Y. Gusev, R. Madduri, L. Lacinski, U. Dave, I. Foster, S. Madhavan

globus.org/genomics

Globus Genomics users

DobynsLab

Cox LabVolchenboum LabOlopade Lab

Nagarajan Lab

globus.org/genomics

• More information on Globus Genomics:www.globus.org/genomics

• More information on Globus: www.globus.org

globus.org/genomics

Our work is supported by:U.S. DEPARTMENT OF

ENERGY

29

globus.org/genomics

Team

bioexcel.eu

Audience Q&A session

Please use the Questionsfunction in GoToWebinar

application

Any other questions or points to discuss after the live

webinar? Join the discussion the discussion at

http://ask.bioexcel.eu or jump straight to the topic at http://bit.ly/2fghe8B.