Large scale analytical workflows
Transcript of Large scale analytical workflows
bioexcel.eu
Partners Funding
Large-scale analytical workflows on the cloud using Galaxy and Globus
Presenters: Ravi MadduriHost: Adam Carter
BioExcel Educational Webinar Series #8
16 November, 2016
bioexcel.eu
Thiswebinarisbeingrecorded
bioexcel.eu
BioExcel Overview• Excellence in Biomolecular Software
- Improve the performance, efficiency and scalability of key codes
• Excellence in Usability- Devise efficient workflow environments
with associated data integration
• Excellence in Consultancy and Training- Promote best practices and train end users
DMI Monitor
DMI Enactor
DMI Executor
DMI Enactor
Data Delivery Point
Data Source
Monitoring flow
Data flow
Service Invocation
DMI Optimiser
DMI Planner
DMIValidator
DMI Gateway
DMI Gateway
DMI Gateway
DMI Enactor
Portal / Workbench
DMI Request
DADC Engineer
DMI Expert
Repository
Registry
DMI Expert
Domain Expert
bioexcel.eu
Interest Groups
• Integrative Modeling IG• Free Energy Calculations IG• Best practices for performance tuning IG• Hybrid methods for biomolecular systems IG• Biomolecular simulations entry level users IG• Practical applications for industry IG• Training• Workflows
Support platformshttp://bioexcel.eu/contact
Forums Code Repositories Chat channel Video Channel
bioexcel.eu
Today’s PresenterRavi Madduri is a Scientist at Argonne National Laboratiories and Senior Research Fellow at University of Chicago.Ravi is actively involved in developing innovative software and networking technology. As lead architect of the Reliable File Transfer, he designed novel testing and profiling capabilities, ensuring that it met the needs of key communities such as TeraGrid.He implemented Grid file transfer patterns in the Java CoG Kit and developed a remote application virtualization infrastructure; the Grid-enable extension was incorporated in the Grid Service Authoring Toolkit and is used by NCI Information Systems.He is applying new technology in diverse science and engineering domains. For example, he is a key contributor to the Cancer Bioinformatics Grid. He played a lead role in the evolution of GridFTP and its adoption by researchers for the Laser Interferometer Gravitational Wave Observatory and the Large Hadron Collider. Moreover, as part of the NEESgrid project, he helped scientific teams incorporate Grid technology into their earthquake engineering research.
5
globus.org/genomics
LargeScaleAnalyticalWorkflowsontheCloudusingGalaxyandGlobus
RaviK.MadduriArgonneNationalLaboratory,UniversityofChicago
globus.org/genomics
• Globusisdeveloped,operated,andsupportedbyresearchers,developers,andbioinformaticiansattheComputationInstitute– UniversityofChicago/ArgonneNationalLab
• Weareanon-profitorganizationbuildingsolutionsfornon-profitresearchers
• Ourgoalistosupporttheadvancementofsciencebybringingtogetherourstrengthsandcapabilitiestohelpmeettheuniqueneedsofresearchersandresearchinstitutions
WhoWeAre
globus.org/genomics
SequencingCenters
SequencingCenters
DataMovementandAccessChallenges
ManualDataAnalysis
PublicData
Storage
LocalCluster/CloudSeq
Center
ResearchLab
• Dataisdistributedindifferentlocations
• Researchlabsneedaccesstothedataforanalysis• BeabletoSharedatawithotherresearchers/collaborators
• Inefficientwaysofdatamovement• DataneedstobeavailableonthelocalandDistributedCompute
Resources• LocalClusters,Cloud,Grid
HowdoweanalyzethisSequenceData
OncewehavetheSequenceData
Picard
GATK
Fastq RefGenome
Alignment
VariantCalling
• ManuallymovethedatatotheComputenode
(Re)RunScript
Install
Modify
• InstallallthetoolsrequiredfortheAnalysis• BWA,Picard,GATK,FilteringScripts,etc.
• Shellscriptstosequentiallyexecutethetools• Manuallymodifythescriptsforanychange
• ErrorProne,difficulttokeeptrack,messy..• Difficulttomaintainandtransfertheknowledge
ChallengesInLargeScaleNGSAnalysis
globus.org/genomics
Additional Challenges in Big Data
• Rapidly validating a hypothesis• Scaling up the analysis after validation• Trivially applying the same techniques on
other/all datasets of interest• Reproducibility
– Unique Identifiers for inputs and outputs– Publishable Results– Discoverable Results
11/23/16 BIGDATAforDISCOVERYSCIENCE
9
globus.org/genomics
SequencingCenters
SequencingCenters
PublicData
Storage
LocalCluster/CloudSeq
Center
ResearchLab
Globusprovidesfor• High-performance• Fault-tolerant• Securefiletransferbetweenalldata-endpoints
Datamanagement Dataanalysis
Picard
GATK
Fastq Ref Genome
Alignment
Variant Calling
GlobusGenomicsonAmazonEC2
• Analyticaltoolsareautomaticallyrunonthescalablecomputeresourceswhenpossible
• GlobusintegratedwithinGalaxy
• Web-basedUI• Drag-Drop
workflowcreations
• Easilymodifyworkflowswithnewtools
Galaxy-basedworkflowmanagement
GlobusGenomics
globus.org/genomics
Technologies/Services
• EBS/S3 for scratch and semi-permanent storage• EC2 – on-demand, reserved, spot• VPCs• ELB• HTCondor• Globus transfer, identity management• Chef• Cloudtrails, SNS, SES – monitoring, notifications,
audit• IAM – access management, key management• RDS – state management
globus.org/genomics
• Professionallymanagedandsupportedplatform• Bestpracticepipelines
– WholeGenome,Exome,RNA-Seq,ChIP-Seq,…
• Enhancedworkbenchwithbreadthofanalytictools• Technicalsupportandbioinformaticsconsulting• Accesstopre-integratedend-pointsforreliableandhigh-
performancedatatransfer(e.g.BroadInstitute,PerkinElmer,universitysequencingcenters,etc.)
• Cost-effectivesolutionwithsubscription-basedpricing
AdditionalCapabilities
globus.org/genomics
Profiler
1.Submitprofilingrequest 5.Returnprofiles
Worker
Workerwebservice
PCP HTCondor
2.Provisionworkers
3.Start/monitorprofiling
Worker
Workerwebservice
PCP HTCondorWorker
Workerwebservice
PCP HTCondorWorker
Workerwebservice
PCP HTCondor
4.ParsePCPlogandstoreprofiles
A Cloud Tool Profiling Service
� Describe profile requests in JSON
� Provision resources and apply a profiling Web Service
q Use Performance Co-pilot (PCP) to capture usage
� Capture and process PCP logs
� Return profiles as JSON (or logs via s3)
globus.org/genomics
Cost-aware Provisioning
14
1. Filter instance types with profiles
2. Determine price for each instance type across all AZs
3. Rank potential requests
4. Make requests and monitor
5. Cancel or repurpose excess active requests once one is fulfilled
$$$
???
R.Chardetal.Cost-awarecloudprovisioning,Proceedingsofthe11thIEEEInternationalConferenceone-Science(e-Science),2015.
globus.org/genomics
• Workflows can be easily defined and automated with integrated Galaxy Platform capabilities
• Data movement is streamlined with integrated Globus file-transfer functionality
• Resources can be provisioned on-demand with Amazon Web Services cloud based infrastructure
GlobusGenomics
globus.org/genomics
Packaging data for interchange
11/23/16 BIGDATAforDISCOVERYSCIENCE
16
https://github.com/ini-bdds/bdbag
globus.org/genomics
Packaging data for interchange
A packaging format for encapsulating– Payload: arbitrary content– Tags: metadata describing the payload– Checksums: supports verification of content
Bio_data_bag/ |-- data | \-- genomic | \-- 2a673.fastq| \-- 2a673.fastq| -- manifest-md5.txt | afbfa231324812378123bfa data/genomic/2a673.fasta| -- bagit.txt
Contact-Name: John Smith
BDBag
globus.org/genomics
Minimal viable identifiers (minid)• Every data item that you create can be
automatically assigned a digital id• You can reference it, share it, resolve it
globus.org/genomics
Resolve a minid
globus.org/genomics
Generating Data Identifiers
11/23/16 BIGDATAforDISCOVERYSCIENCE
20
https://github.com/ini-bdds/minid
globus.org/genomics
Dnase Hypersensitivity Analysis
BDDSData
1.CreateaQuery 2.Query
EncodetoBDBagService3.Query
BDDSAnalysisServices4.BDBagMinID
BDDSERMRestService TRN
BDDSGalaxyService
BDBAG
5.ExecuteBigDataAnalysispipelines
6.Results
7.PublishResults
BDDSPublicationService
8.IndexResults
CEL
FASTQ
BDBAG
globus.org/genomics
Extending Globus Genomics
• BDBags – Interchangeable data objects for collections of files– Checksums– Unique identifiers
• Batch Execution on BDBags generating results as “bags” of results
• Strong data provenance for reproducibility• Elastic File System for scratch and S3 for
results
globus.org/genomics
Extending Globus Genomics
• 500+ skinny docker containers • Instrumented with cadvisor• Extended the application profiling service
to generate profiles for cpu, i/o, memory, disk
• Dashboards using graphana• RDS to store the profiles
globus.org/genomics11/23/16 BIGDATAforDISCOVERYSCIENCE
24
globus.org/genomics
GlobusGenomicsataglance
30institutions,groups
10smillioncorehours
labs
2PBsrawsequences
analyzed
>1500analysistools
1000sgenomesprocessed
>50workflows
99%uptimeoverthepast
twoyears
43stepsnumberofstepsinasingleworkflow
5 dayslongestrunning
workflow
100sdifferentspecies
1000sgenomesprocessed
5 dayslongestrunning
workflow
globus.org/genomics
Typical Usecases
AprofileofinheritedpredispositiontobreastcanceramongNigerianwomen
Y. Zheng, T. Walsh, F. Yoshimatsu, M. Lee, S. Gulsuner, S. Casadei, A. Rodriguez, T. Ogundiran, C. Babalola, O. Ojengbede, D. Sighoko, R. Madduri, M.-C. King, O. Olopade
AcasestudyforhighthroughputanalysisofNGSdatafortranslationalresearchusingGlobusGenomics
D. Sulakhe, A. Rodriguez, K. Bhuvaneshwar, Y. Gusev, R. Madduri, L. Lacinski, U. Dave, I. Foster, S. Madhavan
globus.org/genomics
Globus Genomics users
DobynsLab
Cox LabVolchenboum LabOlopade Lab
Nagarajan Lab
globus.org/genomics
• More information on Globus Genomics:www.globus.org/genomics
• More information on Globus: www.globus.org
globus.org/genomics
Our work is supported by:U.S. DEPARTMENT OF
ENERGY
29
globus.org/genomics
Team
bioexcel.eu
Audience Q&A session
Please use the Questionsfunction in GoToWebinar
application
Any other questions or points to discuss after the live
webinar? Join the discussion the discussion at
http://ask.bioexcel.eu or jump straight to the topic at http://bit.ly/2fghe8B.