Ruby on bioinformatics
-
Upload
tse-ching-ho -
Category
Science
-
view
348 -
download
7
description
Transcript of Ruby on bioinformatics
Ruby Conference Taiwan 2014
Ruby on Bioinformatics
Tse-Ching Ho !何澤清!@tsechingho!2014 / 4 / 26
Horse + Stripe = Zebra
Biology + Informatics = Bioinformatics
Age of Big Data
Age of Data Science
High Through Put Data
❖ Big Data!
❖ file size is small but there are many files!
❖ file size is large but there are just few files!
❖ Data size of bioinformatics!
❖ 1,000,000,000 records for a subject (person) is normal
The Storage Demand is Increasing
from Dr. Yu-Tai Wang
Data Size of Sequencing After 5 Years
https://www.nanoporetech.com
70,000 New Born Baby X 500 GB = 35 TB30,000 patients X 10,000 cells X 500 GB = 1.5 X 1011 GB = 150 EB
from Dr. Yu-Tai Wang
1. count by current NGS data!2. not include civil medical institutes
Computing Power is Required
❖ HPC!
❖ Infiniband cluster!
❖ Amazon EC2 cluster!
❖ Hadoop cluster!
❖ Many cores of CPU!
❖ Large Memory!
❖ High IO efficiencyhttp://arstechnica.com/business/2012/05/amazons-hpc-cloud-supercomputing-for-the-99/
http://arstechnica.com/business/2012/04/4829-per-hour-supercomputer-built-on-amazon-cloud-to-fuel-cancer-research/
$4,828.85 per hour 51,132 cores, 58.78TB RAM6,742 Amazon EC2 instances
2012!Protein simulation!Cycle Computing System!Ganglia HPC clusters!Deployed by Opscode Chef
http://www.hpcwire.com/2013/07/08/infiniband_snaps_up_strong_super_share/
Is 10 GB network enough for I/O?
embarrassingly parallel:The calculations are independent of each other.
http://glennklockwood.blogspot.tw/2013/12/high-performance-virtualization-sr-iov_14.html
Infiniband is good at I/O efficiency
• Interconnect speed.!• I/O performance.!• Infiniband system is about
3.8GB/s of Bandwidth.!• 10 GB network is about
400MB/s of Bandwidth.
Data science is about DATA!
Data Scientist Concerns
❖ Data quality!
❖ Factors of filter!
❖ Statistics!
❖ Visualization!
❖ Interpretation
Programmer also Concerns
❖ High through put data (Big Data) handling!
❖ Data format / File format!
❖ Data parsing!
❖ Statistic tools!
❖ Visualization!
❖ Profit / Markets
Biology
http://businessintelligence.com/bi-insights/the-personalized-medicine-revolution-is-almost-here/
A Dream of Personalized Medicine
from Dr. Yen-Hua Huang
Genomic Disease
http://www1.imperial.ac.uk/computationalsystemsmedicine/biomolecularmedicine/personalised/
Cure by Medicines
http://scienceroll.com/2008/04/25/personalized-medicine-real-clinical-examples/
Personalized Medicine
http://www.genomicslawreport.com/index.php/tag/personalized-medicine/
Personal Genomic Analysis
http://www.thecureisnow.org/index.php/our-strategy/philosophy-of-tcin/personalized-medicine
http://www.genengnews.com/insight-and-intelligence/personalized-medicine-not-quite-there-yet/77899649/
DNA
http://cisncancer.org/research/what_we_know/omics/personalized_medicine_02.html
DNA Sequencing
http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/
http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/
http://www.broadinstitute.org/blog/beyond-genome-new-uses-dna-sequencers
http://biodbnet.abcc.ncifcrf.gov/dbInfo/netGraph.php
ID mapping of DatabasesEach node is a database.!Each database has it’s unique id.!These ids connected as a network.!I think handling these complexity should be easy for the people seating here.
Bioinformatics Sites for Rubists
Ruby Sites for Bioinformatists
What programming language is best for a bioinformatics beginner?
Mapping Sequence Data
from Jui-Tse Hsu
Simple Mapping Sequence Data
Convert to SAM
Compress to BAM
Index, Sort, Remove duplicate PCR (Rmdup)
1. .seq -> fastq 2. Illumina score -> Phred score
1. cleaned bam file 2. quality control, get statistics, mapped, unmapped, etc.
1. SNVs in VCFs 2. structural variants 3. copy number changes, etc.
Aligner (soap2, bwa, bowtie, etc.)
from Jui-Tse Hsu
Illumina Exome sequence reads Aligned reads Aligned reads!
(sam file)
Aligned reads!(bam file)
Useful reads dataCall variants
Visualization in browsers
C/C++❖ Key Algorithms!
❖ Written by C/C++!
❖ Foundation Tools!
❖ BWA!
❖ Bowtie / Bowtie2!
❖ samtools / bamtools!
❖ GMAP / GSNAP!
❖ BLAT!
❖ Tophat
http://genomebiology.com/2010/11/12/220
Analysis Pipeline
Overview of the RNA-seq analysis pipeline for detecting differential expression
Perl
❖ First language!
❖ Bioperl!
❖ Ensembl
http://millionchimpanzees.blogspot.tw/2011/09/book-review-learning-perl-sixth-edition.html
Java
❖ good part of java!
❖ GATK!
❖ Taverna!
❖ Hadoop
http://shop.oreilly.com/product/9780596803742.do
R
❖ Statistic tools!
❖ Bioconductor!
❖ EdgeR!
❖ Data Mining and Analysis Books
http://exploringdata.github.io/data-visualization-books/analysis/
Python
❖ young people!
❖ Galaxy
http://news.oreilly.com/2008/08/python-for-unix-and-linux-syst.html
The Ruby Way in Bioinformatics
What kinds of libraries would you think it is important?
Foundation gems
❖ activerecord!
❖ nokogiri!
❖ ffi!
❖ parallel!
!
!
❖ bioruby!
❖ sciruby!
!
!
!
C binding & wrapper
❖ bio-samtools!
❖ bio-bwa!
❖ bio-affy!
❖ bio-faster!
❖ mpi-ruby!
❖ bio-grid!
❖ gsl!
❖ rb-gsl!
❖ nmatrix!
❖ sambamba - D language!
!
Data parser / analyser❖ bio-genomic-interval!
❖ bio-blastxmlparser!
❖ bio-assembly!
❖ bio-gff3!
❖ bio-gff3-pltools!
❖ bio-alignment!
❖ bio-maf!
❖ bio-table!
❖ bio-rdf!
❖ bio-vcf!
❖ bio-velvet!
❖ bio-gngm!
❖ bio-gag!
❖ bio-dbsnp
Data parser / analyser❖ bio-phyloxml!
❖ bio-jplace!
❖ bio-gex!
❖ bio-ipcress!
❖ bio-stockholm!
❖ bio-synreport!
❖ bio-cigar!
❖ bio-wolf_psort_wrapper!
❖ bio-hmmer3_report!
❖ bio-dbla-finder!
❖ bio-newbler_outputs!
❖ bio-sra_fastq_dumper!
❖ bigbio!
Data parser / analyser - protein❖ protk!
❖ mascot-dat!
❖ bio-protparam!
❖ bio-plasmoap!
❖ bio-signalp!
❖ bio-exportpred!
❖ bio-hydropathy!
❖ bio-epitope!
❖ bio-bio-orthomcl!
❖ bio-isoelectric_point!
❖ bio-octopus!
❖ bio-tm_hmm!
❖ bio-aliphatic_index!
Database / Web API
❖ ruby-ensembl-api!
❖ bio-ucsc-api!
❖ bio-liftover!
❖ intermine!
❖ bio-eupathdb!
❖ bio-krona!
❖ bio-sra!
❖ bio-sradlhttp://www.ensembl.org
Statistics
❖ statsample!
❖ statsample-sem!
❖ statsample-optimization!
❖ statsample-timeseries!
❖ distribution!
❖ rinrubyhttp://www.ncss.com/software/ncss/survival-analysis-in-ncss
SVG & Graph
❖ rubyvis!
❖ plotrb!
❖ bio-svgenes!
❖ bio-vis!
❖ gnuplot
http://rubyvis.rubyforge.org
Tools
❖ minimization!
❖ integration!
❖ quorum - rails engine
I am Not Analyst,I am Programmer.
What can I get involved?
Pipeline / Workflow
Galaxy - python!Taverna - java!
??? - Ruby
Web System
❖ Data warehouse!
❖ Pipeline management!
❖ Coordination center!
❖ Visualisation
Cloud / Distributed / Parallel
http://www.mynamesnotmommy.com/yes-there-are-dumb-questions/question-mark/
What We Are Doing By Ruby?
Ensembl Virtual Machine
❖ Powered by VeeWee, Vagrant and Chef!
❖ Automatic build versioned Ensembl system (perl)!
❖ Include database, queuing services and analysis tools!
❖ Multi sites, multi species in one virtual machine!
❖ Help to build local & custom system
from Tse-Ching Ho
Ensembl Virtual Machine
Use existed vagrant box
Prepare SOP for Chef recipes
Provision VM with Chef recipes Write Chef recipes
Export VM by Virtualbox
Setup Vagrantfile
Create Vagrant box by Veewee
Write definition of Vagrant box by Veewee
Ensembl VM Automation
from Tse-Ching Ho
Ensembl Virtual Machine
Web view of Ensembl
from Tse-Ching Ho
DR. RAW
❖ Derived from DRAW and SneakPeek!
❖ Composed of C/C++, bash, perl, java, ruby!
❖ Have both DNA and RNA re-sequence analysis!
❖ Enhanced quality control for DNA and RNA!
❖ Distributed computing pipeline!
❖ Support PBS, LSF, SGE platforms (queuing system)
from Hannah Lin
DR. RAW
Analysis Tools
Analysis Pipeline
Quality Control
Resource Manager System
DNA QC Forward : Reverse
RNA QC!Forward : Reverse
BWA-0.7.7!Samtools-0.1.19!
GATK-3.1
GSNAP-13-10-25!Cufflink-13-11!FusionGene …
DNA Sequencing data
RNA Sequencing data
SGE (Sun Grid Engine)PBS (Portable Batch System)!LSF (Load Sharing Facility)
Green: new components!Red: updated components from Hannah Lin
DR. RAW
Web view by Rails
from Hannah Lin
Neo4j - JRuby Data Parser
❖ Graph database for data integration of discrete clinical research documents!
❖ Origin data are excel/csv files collected in different time, by different people!
❖ Neo4j is good for cleanup such massive data set!
❖ Cooperation between biologist and programmer
from Wei-Ming Wu, Chia-Hsuan Lee
Neo4j - JRuby Data Parser
from Wei-Ming Wu, Chia-Hsuan Lee
Neo4j - JRuby Data Parser
from Wei-Ming Wu, Chia-Hsuan LeeCollision Rate of Input Data: 1.3 %
API Server for Third Party Firm
❖ API server based on Rails, run by JRuby!
❖ ActiveRecord models for Oracle database!
❖ activerecord-oracle_enhanced-adapter gem!
❖ Import excel files to third party GUI client !
❖ Third party server send XML request to API server
from Wei-Ming Wu, Sean Wang
API Server for Third Party Firm
TCHC server
API server(rails, jruby)
CSIS (java, oracle)
Send data by XML
Write into database
Read data by client program
Upload data
Parse request
Third Party
Our Servers
Windows GUI
from Wei-Ming Wu, Sean Wang
Daily Checking Rule
❖ Based on Rails, run by JRuby!
❖ ActiveRecord models for Oracle database!
❖ activerecord-oracle_enhanced-adapter gem!
❖ User can define rules for checking data, usually values in filled forms!
❖ Run checking rules daily, not before filling forms
from Wei-Ming Wu, Sean Wang
Daily Checking Rule
from Wei-Ming Wu, Sean Wang
Daily Checking Rule
from Wei-Ming Wu, Sean Wang
Daily Checking Rule
from Wei-Ming Wu, Sean Wang
Daily Checking Rule
from Wei-Ming Wu, Sean Wang
Patient Randomization
❖ Based on Rails, run by JRuby!
❖ ActiveRecord models for Oracle database!
❖ activerecord-oracle_enhanced-adapter gem!
❖ Assign patients into different groups by randomization method!
❖ Cooperation between statistician and programmer
from Wei-Ming Wu, Sean Wang
Patient Randomization
from Wei-Ming Wu, Sean Wang
Patient Randomization
from Wei-Ming Wu, Sean Wang
Patient Randomization
from Wei-Ming Wu, Sean Wang
Assign patients to treatment groups
Database Statistics Dashboard
❖ Based on Rails, run by JRuby!
❖ ActiveRecord models for Oracle database!
❖ activerecord-oracle_enhanced-adapter gem!
❖ google_visualr gem for visualization!
❖ Count number of projects, forms, fields, records and patients
from Wei-Ming Wu, Winnie Lui
Database Statistics Dashboard
from Wei-Ming Wu, Winnie Lui
Education
Learning Bioinformatics
❖ http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html!
❖ http://www.liacs.nl/~hoogeboo/mcb/nature_primer.html!
❖ http://www.mygoblet.org - python, R!
❖ http://www.biotnet.org
Python Book for Bioinformatics
http://shop.oreilly.com/product/9780596154516.do
Python is very successful in Teach than Ruby
Do we lack a killer application by Ruby?
http://www.witardroadbaptist.org/im-new/im-not-sure-im-ready-for-church-yet/
We Need Human !!
Are You Ready To Be A Data Scientist
Or A Binformactis?
Markets
http://www.genengnews.com/gen-articles/personalized-medicine-health-economic-aspects/4824/
http://www.bccresearch.com/market-research/biotechnology/bioinformatics-market-technology-bio051b.html
Under developing Do Asia have enough market sharing?
Topics to take in action
❖ data generation and data management!
❖ data analysis and software!
❖ data processing and storage!
❖ application of bioinformatics in pharma research and development
http://www.giichinese.com.tw/report/bc268909-bioinformatics-technologies-global-markets.html
Health Care in Cloud
❖ Health promotion cloud!
❖ Vaccination cloud!
❖ Exercise cloud!
❖ Workplace wellness!
❖ Physical checkup cloud!
❖ Welfare cloud
from Dr. Chi-Hung Lin
Code For Bioinformatics
Q & A