ASU

31
Agenda: Research Computing @ Arizona State University Program, Vision and Mission Emphasis on Open Source Evolution in Genomic Analysis (HPC > MRv2 > Spark) J.A. Etchings RC@ASU Innovation

Transcript of ASU

Page 1: ASU

Agenda:

• Research Computing @ Arizona State University• Program, Vision and Mission• Emphasis on Open Source• Evolution in Genomic Analysis (HPC > MRv2 > Spark)

J.A. Etchings RC@ASU Innovation

Page 2: ASU

2

Arizona State University has become the foundational model for the “New American University”, a new paradigm for the public Research University that transforms higher education. ASU is committed to Excellence, Access and Impact in everything that it does.

Page 3: ASU
Page 4: ASU

Open-source Data Driven Infrastructure

Google Open-source FunctionGFS HDFS Distributed file systemMapReduce MapReduce Batch distributed data processingBigtable HBase Distributed DB/key-value storeProtobuf/Stubby Thrift & Avro Data serialization/RPCPregel Giraph Distributed graph processingDremel/F1 Impala Scalable interactive SQL (MPP)FlumeJava Crunch Abstracted data pipelines on HadoopIn Memory Spark In Memory Computation

Data Intensive

Page 5: ASU

TransCORE Framework Knowledge EngineContext

OntologiesData Elements

Information ModelsMiddleware

TransactClinical Research

Life Science ResearchQualitative Research

AnalyticIn-Memory Analysis

Genomic DataMachine Learning

Meta-Data Management

Data Resources Open Big Data File System

Relational Key/Value

HPC Parallel

HPC SMA

Transactional

Data Reservoir

Big DataScratch Space

Internet 2 / SDN Connectivity

Page 6: ASU
Page 7: ASU

The entire human genome of a single man 3 billion letters, 262,000 printed pages, 3.3GB

@rikisabatini #TED2016

Page 8: ASU

Clarification & Limitations :

• Yes, we can sequence a Genome for $1000– Unfortunately, this does not include analysis

 • There are 3 billion diploid basepairs, but 6 billion haploid sequences

– Half come from mom and half from dad, and assembling those haplotypes - especially SNPs that are the same haplotype - is going to instrumental in future medical advances

 • Other limitations:

– batch effects (in physical sequencing, in sequencing technology– Different software, different versions of software, and infrastructure (Standardization Gap) – Batch effects can significantly impede variant discovery (false positives are high)

Page 9: ASU

“NEED TO FOCUS NOT ON BIG DATA,

BUT BIG ANSWERS”

Harper Reed – CTO Obama for America 2012

Page 10: ASU

Tumors are not composed of identical cells: There is likely extreme intratumor heterogeneity

Macro heterogeneity> 10 % frequency in the tumor

Micro heterogeneity< 10 % frequency in the tumor

Page 11: ASU

• What are the population dynamics of cancer cell populations?

• What is the role of genetic drift in cancer initiation and progression?

• What is the extent of subclonal variation within a tumor at the time of diagnosis?

• Are resistant subclones present in a tumor before the start of therapy?

Use simulations to ask:

Page 12: ASU

Model parameters and their values• Probability of division, bn, which depends on the fitness of each cell

• Mean selection coefficient, , to generate the exponential distribution of selection coefficients = [ 0.1; 0.01; 0.005 ]

• Average driver mutation rate per cell division, = [; ; ]

• Generation time: average division time = 4 days*

*S Jones et al. Comparative lesion sequencing provides insights into tumor evolution. PNAS (2008)

Page 13: ASU

The model: A branching evolutionary process

Death

Division

Division + driver mutation

The process starts ina single cell with one

driver mutation

OR

OR

1-bn

(1-u)bn

ubn

Page 14: ASU

years later

Driver mutation arises

A clone develops Neoplastic progression starts

years later

The model: A branching evolutionary process

Page 15: ASU

≈ 98% of starting mutant clones die out early

Mean selection coefficient

Driver mutation rate per cell division

Number of realizations

Number of realizations that reached

109 cells

Percentage of realizations that reached 109 cells (%)

Average time to detection

(years)0.1 10155 162 1.6% 17.500.1 1948 112 5.7% 5.210.1 748 134 17.9% 1.740.1 748 111 14.8% 1.62

0.01 6867 125 1.8% 19.800.01 6866 113 1.6% 15.410.01 6866 120 1.7% 13.850.01 6865 115 1.7% 11.16

0.005 11951 102 0.9% 27.970.005 11751 112 1.0% 27.910.005 11750 126 1.1% 22.430.005 11750 100 0.9% 18.28

completed 88265 1432 1.6%

Page 16: ASU

Some tumors develop very quickly (mimics childhood cancers)

Mean selection coefficient

Driver mutation rate per cell division

Number of realizations

Number of realizations that reached

109 cells

Percentage of realizations that reached 109 cells (%)

Average time to detection

(years)0.1 10155 162 1.6% 17.500.1 1948 112 5.7% 5.210.1 748 134 17.9% 1.740.1 748 111 14.8% 1.62

0.01 6867 125 1.8% 19.800.01 6866 113 1.6% 15.410.01 6866 120 1.7% 13.850.01 6865 115 1.7% 11.16

0.005 11951 102 0.9% 27.970.005 11751 112 1.0% 27.910.005 11750 126 1.1% 22.430.005 11750 100 0.9% 18.28

completed 88265 1432 1.6%

Page 17: ASU

Some tumors take decades to develop (mimics many adult cancers, like melanoma)

Mean selection coefficient

Driver mutation rate per cell division

Number of realizations

Number of realizations that reached

109 cells

Percentage of realizations that reached 109 cells (%)

Average time to detection

(years)0.1 10155 162 1.6% 17.500.1 1948 112 5.7% 5.210.1 748 134 17.9% 1.740.1 748 111 14.8% 1.62

0.01 6867 125 1.8% 19.800.01 6866 113 1.6% 15.410.01 6866 120 1.7% 13.850.01 6865 115 1.7% 11.16

0.005 11951 102 0.9% 27.970.005 11751 112 1.0% 27.910.005 11750 126 1.1% 22.430.005 11750 100 0.9% 18.28

completed 88265 1432 1.6%

Page 18: ASU

Computationally Intensive• Running until 10-9 cells was not efficient on a

laptop• Most tumors die out before reaching a detectable

limit• Need to reduce run-time, track all mutations, and

subclone sizes (Massively)

Page 19: ASU

eQTL Analysis Generation trillions of hypothesis tests

• 107 loci x 104 phenotypes x 10s of tissues = 1012 p-values• Tested below on 120 billion associations

Example queries:

• “Given 5 genes of interest, find top 20 most significant eQTLs (cis and/or trans)”o Finishes in several seconds

• “Find all cis-eQTLs across the entire genome”o Finishes in a couple of minuteso Limited by disk throughput

Page 20: ASU

eQTL-Cases eQTL-Controls eQTL-Cases eQTL-Controls eQTL-Cases eQTL-Controls 5 10 15

0

100

200

300

400

500

600

700

800

900

1000

862

306

473

168

404

138

776

308

474

166

387

136

700

192

332

125

240

119

Cloudera Hortonworks

MapR

Time taken in minutes

Number of Cores

Map Reduce

HPC

Apache Spark

Page 21: ASU

• Took a day to get a tumor to 10-7 – (still 2 orders of magnitude too small)

• Convert code from MatLab to Scala (Spark)• Takes seconds to simulate a single tumor• Ability to generate tens of thousands of possible

tumors, and thousands of measurable tumors, observed dynamics

Page 22: ASU

Standard Output = 0.1, μd = 10-8

= 0.01, μd = 10-8

= 0.005, μd = 10-8

= 0.1, μd = 10-7

= 0.01, μd = 10-7

= 0.005, μd = 10-7

= 0.1, μd = 10-6

= 0.01, μd = 10-6

= 0.005, μd = 10-6

= 0.1, μd = 10-5

= 0.01, μd = 10-5

= 0.005, μd = 10-5

N = 162 N = 112 N = 134 N = 111

N = 125 N = 113 N = 120 N = 115

N = 102 N = 112 N = 126 N = 100

Den

sity

Den

sity

Den

sity

Subclone size (number of cells)

Subclone size (number of cells)Subclone size (number of cells)

Subclone size (number of cells)

Subclone size (number of cells)

Page 23: ASU

N = 111

N = 115

N = 100

N = 134

N = 120

N = 126

= 0.1, μd = 10-6 = 0.1, μd = 10-5

= 0.01, μd = 10-6 = 0.01, μd = 10-5

= 0.005, μd = 10-6 = 0.005, μd = 10-5

Resistant subclone size (number of cells) Resistant subclone size (number of cells)

Den

sity

Den

sity

Den

sity

Standard Output

Page 24: ASU

41%1 driver mutations

10%2 driver mutations

19%2 driver mutations

Output to Tableau

Page 25: ASU

Minor subclones that harbor mutations resistant to treatment can result in relapse

4 months on drug 6 months on drug

N. Wagle et al., Journal of Clinical of Oncology (2011)

Response to vemurafenib(V600E BRAF

inhibitors)

Page 26: ASU

Subclonal variation of simulated tumor-1 at diagnosis= 0.005, u= per cell division, and mean division time = 4 daysN

umbe

r of c

ells

Subclonal compositionPopulation dynamics of cancer cells

subclone with a resistance mutation

N = 2,682 cellsResistant mutation rate =

17%1 driver mutation

80%2 driver mutations

Time (years)

Page 27: ASU

Subclonal variation of simulated tumor-2 at diagnosis

Num

ber o

f cel

ls

Time (years)

Subclonal composition

= 0.01, u= per cell division, and mean division time = 4 days

19%2 driver

mutations

10%2 driver mutations

41%1 driver mutations

subclone with a resistance mutationN = 224,502 cells

Resistant mutation rate =

Population dynamics of cancer cells

Page 28: ASU
Page 29: ASU

Conclusions:

• These results constitute an argument for the development and application of more sensitive technologies for the detection of rare pre-existing subclones that might plant the seeds for rapid clinical relapse.

• Based on the predicted extent of standing subclonal variation, drug-resistant subclones are almost certain to exist before the initiation of treatment initiation.

• Greater subclonal diversity in a tumor may predict a higher likelihood of pre-existing resistance to any conceivable targeted therapy

• Subclonal diversity itself may be a marker of the potential to evolve drug resistance, and therefore may be an important prognostic indicator

• Reducing the time to research output with Apache Spark increases the success probability of targeted therapies

Page 30: ASU

The extent of subclonal variation is predicted by number of distinct dominant

clones 

Diego Chowella,b, James Napierc, Rohan Guptac, Karen S. Andersonb,d, Carlo C. Maleyb,d,f,1, and Melissa

A. Wilson Sayresb,d,e,1

aMathematical, Computational and Modeling Sciences Center, bBiodesign Institute, cResearch

Computing Center, dSchool of Life Sciences, eCenter for Evolution and Medicine, Arizona State

University, Tempe, Arizona 85281, USA, fCenter for Evolution and Cancer, University of California San

Francisco, San Francisco, California 94158, USA

 1To whom correspondence may be addressed

E-mail: [email protected] or [email protected] (wilsonsayreslab.org | @mwilsonsayres )

Page 31: ASU