Uvod u bioinformatičku analizu podataka s Galaxy
aplikacijom
Enis Afgan Institut Ruđer Bošković
30.9.2014.
Svi mi • U 30 sekundi ili manje recite svima
• Vaše ime • Vaš zavod / afiliaciju • Nešto o Vašem znanstvenom radu • Zašto ste ovdje / što se nadate da ćete naučiti
Pregled radionice 9:30-10:00 Uvodno predavanje: Galaxy i CloudMan aplikacije
10:00-10:15 Q&A / pauza
10:15-10:30 Pokretanje vlastitog CloudMan klastera
10:30-11:30 Galaxy 101
11:30-11:45 Q&A / pauza
11:45-12:30 Podešavanje Galaxy i CloudMan aplikacija
12:30-12:45 Anketa i AWS credits: 3x $100
Making sense of this data requires
sophisticated analysis environment
with
adequate computational infrastructure
that is
accessible to the researchers
while it ensures
reproducibility of scientific results.
What is Galaxy?
A data analysis and integration tool
A (free for everyone) web service integrating a wealth of tools, compute resources, terabytes of
reference data and permanent storage
Open source software that makes integrating your own tools and data and customizing for your own
site simple
Running a tool - Automatically generated
web UI from a tool wrapper (any tool can be integrated)
- Integrated with other tools
Reproducibility in Genomics 18 Nat. Genetics experiments in microarray gene expression
<50% of reproducible
Problems • missing data (38%) • missing software, hardware
details (50%) • missing methods, processing
details (66%)
Ioannidis, J.P.A. et al. “Repeatability of published microarray gene expression analyses.” Nat Genet 41, 149-155 (2009)
14 re-sequencing experiments in Nat. Genetics, Nature, Science
0% reproducible?
Problems • missing primary data (50%) • tools unavailable (50%) • missing parameter setting, tool
versions (100%)
"Devil in the details," Nature, vol. 470, 305-306 (2011).
http://usegalaxy.org (a.k.a. Main)
• Public web site
• Anybody can use it
• Hundreds of tools
• Persistent
• +500 users/month
• ~200TB of user data
• ~140,000 analysis jobs / month
http://bit.ly/gxystats
Public Galaxy Servers https://wiki.galaxyproject.org/PublicGalaxyServers
Interested in:
ChIP-chip and ChIP-seq? ü Cistrome
Statistical Analysis?
ü Genomic Hyperbrowser
Sequence and tiling arrays?
ü Oqtans
Text Mining?
ü DBCLS Galaxy
Reasoning with ontologies?
ü GO Galaxy
Internally symmetric protein structures?
ü SymD
Compute clusters
• A number of connected computers
• Typically built from commodity components
• Used to improve performance: throughput or speed (supercomputers)
Cloud Computing • Dynamically scalable shared resources accessed over a network
• Control infrastructure via API
• Private, public, or hybrid
• Virtually unlimited resources: storage, computing, services • Only pay for what you use
What is CloudMan?
CloudMan allows one to create a compute cluster in the cloud, use pre-configured applications, or add
one’s own. And then share it all.
Deploying a CloudMan Platform
1. An account on the supported cloud
2. Start a master instance via a launcher app or the cloud web dashboard
3. Use the CloudMan web interface on the master instance to manage the platform
Share Your Instance • Share entire (Galaxy) CloudMan platform
• Even the customized ones (including data and/or tools)
• Fully automated solution
• Publish a self-contained analysis • In progress or otherwise
How much does the Cloud cost?
Amazon Web Services • $0.14 per CPU hour (~$100 per CPU month) • $0.05 per GB-month (~$50 per TB-month)
Pregled radionice 9:30-10:00 Uvodno predavanje: Galaxy i CloudMan aplikacije
10:00-10:15 Q&A / pauza
10:15-10:30 Pokretanje vlastitog CloudMan klastera
10:30-11:30 Galaxy 101
11:30-11:45 Q&A / pauza
11:45-12:30 Podešavanje Galaxy i CloudMan aplikacija
12:30-12:45 Anketa i AWS credits: 3x $100
Rad s vlastitim CloudMan klasterom • Launch an instance • Demonstrate the following CloudMan
features and prepare for the data analysis part: • Manual & Auto-scaling • Using an S3 bucket as a data source • Accessing an instance over ssh • Customizing an instance • Controlling Galaxy • Sharing-an-instance
• Perform data analysis in Galaxy • Find exons with most SNPs
Inte
rac
tio
n fl
ow
Launch an instance 1. Slides @ bit.ly/irb-ws 2. Load biocloudcentral.org 3. Enter the access key and secret key
provided at http://bit.ly/ws-creds
4. Provide your email address 5. Use your initials as the cluster name 6. Set any password (and remember it) 7. Use Large instance type 8. Start your instance
Wait for the instance to start (~2-3 minutes)
9. Access Galaxy application For more details, see
http://cloudman.irb.hr
Pregled radionice 9:30-10:00 Uvodno predavanje: Galaxy i CloudMan aplikacije
10:00-10:15 Q&A / pauza
10:15-10:30 Pokretanje vlastitog CloudMan klastera
10:30-11:30 Galaxy 101
11:30-11:45 Q&A / pauza
11:45-12:30 Podešavanje Galaxy i CloudMan aplikacija
12:30-12:45 Anketa i AWS credits: 3x $100
Agenda details • Launch an instance • Perform data analysis in Galaxy
• Find exons with most SNPs • Demonstrate the following CloudMan
features and prepare for the data analysis part: • Manual & Auto-scaling • Using an S3 bucket as a data source • Accessing an instance over ssh • Customizing an instance • Controlling Galaxy • Sharing-an-instance
Inte
rac
tio
n fl
ow
A Rough Plan
• Get some data • Coding exons on chromosome 22 • SNPs on chromosome 22
• Mess with it • Identify which exons have SNPs • Count SNPs per exon
• Visualize our results
Exons, from UCSC SNPs, from UCSC
1 1 2
Exons, from UCSC
SNPs, from UCSC
Overlap pairings
Exon overlap counts
Exons, from UCSC
1 1 2
Exon overlap counts
1 1 2
Join on exon name 0 0 0
1 1 2
Rearrange columns w/ cut
Data types overview: BED • Tab-delimited text file that defines a feature track • Zero-based • One line per feature • Each line contains 3-12 columns
Data types overview: Tabular / Interval
• Tab-delimited text file • Interval
• Each line represents genomic intervals • Zero-based • One line per interval • Each line contains 3-5 columns
Pregled radionice 9:30-10:00 Uvodno predavanje: Galaxy i CloudMan aplikacije
10:00-10:15 Q&A / pauza
10:15-10:30 Pokretanje vlastitog CloudMan klastera
10:30-11:30 Galaxy 101
11:30-11:45 Q&A / pauza
11:45-12:30 Podešavanje Galaxy i CloudMan aplikacija
12:30-12:45 Anketa i AWS credits: 3x $100
Agenda details • Launch an instance • Perform data analysis in Galaxy
• Find exons with most SNPs • Demonstrate the following CloudMan
features and prepare for the data analysis part: • Manual & Auto-scaling • Using an S3 bucket as a data source • Accessing an instance over ssh • Customizing an instance • Controlling Galaxy • Sharing-an-instance
Inte
rac
tio
n fl
ow
Manual scaling • Explicitly add 1 worker node to your cluster
• Node type corresponds to node processing capacity
• Research use of Spot instances
Public / shared data • Take a look at the 1000 Genomes data
• Take a look at AWS Public Datasets
• More examples exist
• How to use this freely available data and make new discoveries?
Accessing an instance over ssh
Use the terminal (or install Secure Shell for Chrome)
SSH using user ubuntu and the password you chose when launching an instance:
[local machine]$ ssh ubuntu@<instance IP address>
Once logged in
• You have full system access to your instance, including sudo; use it as any other system
• galaxy user exists on the system and should be used when manipulating Galaxy (sudo su galaxy)
• Can submit any jobs via the standard qsub command
Customizing an instance • Edit Galaxy’s configuration
$ sudo su galaxy
$ cd /mnt/galaxy/galaxy-app
$ nano universe_wsgi.ini
allow_library_path_paste = True
Controlling Galaxy • Start/stop Galaxy application
• Add an admin user
• Use the email you registered with
S3 bucket as a data library
• Within Galaxy, create a Data Library, using S3 bucket path as the data source (/mnt/workshop-data)
• This will import all the datasets into the Data Library
• Import that datasets into a history
Proširivanje palete programa • Galaxy ToolShed = App Store za Galaxy
• Need to be an Admin to use
• Browse the Main ToolShed and install needed tool(s)
Sharing-an-Instance • Share the entire CloudMan platform
• Includes all of user data and even the customizations
• Publish a self-contained analysis
• Make a note of the share-string and send it to your neighbor
Pregled radionice 9:30-10:00 Uvodno predavanje: Galaxy i CloudMan aplikacije
10:00-10:15 Q&A / pauza
10:15-10:30 Pokretanje vlastitog CloudMan klastera
10:30-11:30 Galaxy 101
11:30-11:45 Q&A / pauza
11:45-12:30 Podešavanje Galaxy i CloudMan aplikacija
12:30-12:45 Anketa i AWS credits: 3x $100
Want more tutorials?
genome.edu.au/wiki/Learn
galaxy-tut.genome.edu.au
• RNA-seq (basic and advanced)
• Variant detection (basic and advanced)
• Genome assembly
• Quality control for small RNA
• …
Top Related