Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical...

103
Statistical analysis of RNA-seq data Etienne Delannoy 1 and Marie-Laure Martin-Magniette 1,2 1- IPS2 Institut des Sciences des Plantes de Paris-Saclay 2- UMR AgroParisTech/INRA Mathematique et Informatique Appliquees E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 1 / 94

Transcript of Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical...

Page 1: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Statistical analysis of RNA-seq data

Etienne Delannoy1 and Marie-Laure Martin-Magniette1,2

1- IPS2 Institut des Sciences des Plantes de Paris-Saclay

2- UMR AgroParisTech/INRA Mathematique et Informatique Appliquees

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 1 / 94

Page 2: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Table of contents

1 Introduction

2 Normalization

3 Differential analysis

4 Clustering of transcriptomic data

5 Experimental design

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 2 / 94

Page 3: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Aims of the lecture

Quantitative analysis of gene expressionOverview of the different steps of the analysisIt is not exhaustive

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 3 / 94

Page 4: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Design of a transcriptomic project

Biological question↓ ↑

Experimental design

choice of the technology and type of analysis↓

Data acquisition↓

Data analysis

normalization, differential analysis, clustering, ...↓ ↑

Validation

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 4 / 94

Page 5: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

HTS data characteristics

Some statistical challenges of HTS dataDiscrete, non-negative, and skewed data with very large dynamicrange (up to 5+ orders of magnitude)Sequencing depth (= “library size”) varies among experimentsTotal number of reads for a gene ∝ expression level × length

Gene 1 Gene 2

Gene 1 Gene 2

Sample 1

Sample 2

To date, most methodological developments are for experimentaldesign, normalization, and differential analysis...

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 5 / 94

Page 6: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Table of contents

1 Introduction

2 NormalizationDifferent types of normalizationComparaison of 7 normalization methodsConclusions on normalization

3 Differential analysis

4 Clustering of transcriptomic data

5 Experimental design

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 6 / 94

Page 7: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Normalization

DefinitionNormalization is a process designed to identify and correcttechnical biases.Two types of biascontrolable biases: the construction of cDNA librariesuncontrolable biases: sequencing process

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 7 / 94

Page 8: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Two types of normalization

Within-sample normalizationEnabling comparisons of genes from a same sampleNot required for a differential analysisNot really relevant for the data interpretationSources of variability: gene length and sequence composition(GC content)

Between-sample normalizationEnabling comparisons of genes from different samplesSources of variability: library size, presence of majority fragments,sequence composition due to PCR-amplification step in librarypreparation‘(Pickrell et al. 2010, Risso et al. 2011)

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 8 / 94

Page 9: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Between-sample normalization: the scaling factor

DefinitionFor sample j , let Ygj be the raw count for gene g.The normalized count is defined by:

Ygj

sj,

where sj is the scaling factor for the sample j .

Three types of methods:Distribution adjustmentMethod taking length into accountThe Effective Library Size concept

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 9 / 94

Page 10: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Distribution adjustment

Let n be the number of samples in the project

Total read count normalization (Marioni et al. 2008)

sj =Nj

1n∑n

`=1 N`

, where Nj =∑

g

Ygj

Upper Quartile normalization (Bullard et al. 2010)

sj =Q3j

1n∑

`=1 Q3`

, where Q3j = Y( 34 [G+1])j

Q3j is computed after exclusion of transcripts with no read countMedian

sj =mediangYgj

1n∑n

`=1 mediangYg`

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 10 / 94

Page 11: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Method taking length into account

RPKM: Reads Per Kilobase per Million mapped readsMotivation greater lane sequencing depth and transcript length=> greater counts whatever the expression levelAssumption read counts are proportional to expression level,transcript length and sequencing depth (same RNAs in equalproportion)Method divide gene read count by total number of reads (inmillion) and transcript length (in kilobase)

Ygj

NjLg× 103 × 106 (1)

RPKM method is an adjustment for library size and transcriptlengthAllows to compare expression levels between genes of the samesampleUnbiased estimation of number of reads but affect the variability.(Oshlack et al. 2009)

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 11 / 94

Page 12: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Method based on the Effective Library Size

Relative Log Expression (RLE)compute a pseudo-reference sample: geometric mean acrosssamples (less sensitive to extreme value than standard mean)

(n∏

`=1

Yg`)1/n

calculate normalization factor

sj = mediangYgj

(∏n

`=1 Yg`)1/n

normalize them such that their product equals 1

sj =sj

exp[ 1n∑

` log s`]

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 12 / 94

Page 13: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Method based on the Effective Library Size

Trimmed Mean of M-values (TMM)Assumption: the majority of the genes are not differentially expressed

●●

● ●●●

●●

●●

● ●

●●

●●

●●

●●

●●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●● ●●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●● ●

●●●

●●

●●

● ●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

● ●

● ●●

● ●

●●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ● ●●●

● ●●●

●●

● ●

●●

●●

●●

● ●●

●●

● ●

●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●●

●●

●●●●

● ●

● ●

●●

●●

●●

● ●

●●

●●●

●●

●●

● ● ●

●●

●●

●●

●●

●●

● ●

● ●

●● ●

●●

●●

●●

●●

●●●

●●●

●●

●●

● ●

●●

●● ●

● ●

● ●

●● ●

● ●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

●● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

● ●

●●

●●

●●

●●● ●

●● ●●

● ●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●● ●

●●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

● ●●

●●

●●

● ●●

●●

● ●

●● ●

●●

●●

●●

●●

● ●

● ●●

●●

●● ●

● ●

● ●

●● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ● ●●

● ●

● ●

●●

●●●

●●

●●

●●

●●

●●●

●●●

● ●

●●

●●

●●

●●

●●● ●

●●

● ●

●●

●●

●●

●●

●● ● ●

●●

●●

●●

●●

●●

●●

● ●

●●●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●●

●●

●●

●●

●● ●

●●

● ●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●●

●●

●●●

●●

● ●

●●

●●

●● ●

● ●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●● ●

● ●●

● ●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

● ●

● ●●

● ●

●●●

●●

●●

● ●●

●●

●●

●●

●●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

● ●

● ●

●●

●●●

● ●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

● ●

●●

●●●

● ●

●●

●●●

● ●●

● ●

●●●

● ●

●●

●● ●

●●

●●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●● ●

● ●

●●

●●

●●

● ● ●

●●

●●

● ●

●● ●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ● ●

●● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●● ●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

● ●

●●●●

●● ●

●●

● ●●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

● ●●

●●●

●●

●●

● ●

● ●

● ●●

● ●

●● ●

●●

●●●

●●

● ● ●

● ●

●●

●●

●●

●●

● ●●

●●

●● ●

●●

●●

● ●

●●

●● ●

●●

●●●

●●

●●

●●●

● ●

●●

●●

●●

● ●

● ●

●●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●●

●● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

● ●●

● ●

●●

●●

●●

● ●

●●

●●

●● ● ●

● ●

●●

● ●

● ●

●●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●● ●

●●

● ● ●

●●

●●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●● ●

●●

●●

●●

● ●● ●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

●●

●●

●●

●●●

●●

●●

●●

●●●

●● ●●

●●

●● ●●●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

● ●● ●●

● ●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●●

●●

●●●

● ●●

● ●

●●

●●

●●

●●●

●●● ●●

●●

●●

●●

●●

●●

●●

●● ●

●●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●● ●●

●● ●

●●

●●

● ●

● ●

●●

●●

●● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●●

●●●

● ●●

●●

●●

● ●

● ● ●

●●

●● ●

●●

● ●

●●

●●

●●

● ●●

● ●

●●

● ●

●●●

●●

●●

● ●

● ●

● ●

●●●

● ●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●●

●● ●

●●

●●

● ●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

●● ●●

●●

●●●

●●

● ●

●●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●● ●●

●●

● ●

●●

●●●

●●

● ●

●●

● ●

●●

●●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●● ●

●●

●●

●●

● ●

●●●

● ●

● ●

●●

●●●

● ●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●●●

●●

● ●

●●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●●

●●

●●

●● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ● ●

●●

●●

●●

●●

●●●

●●

●●●

●●

● ●

●●

●●

●●●

● ●●

● ●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●● ●

●●

●● ●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●●

●●

● ●

●●

● ●●

●●

●●

● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

● ●

● ●

● ●

●●

●●

●●

● ●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●● ● ●

● ●●

● ●

●● ●●

●●

●●

●●

● ● ●

● ●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●● ●

●●

● ●

●●●

●●● ●

●●●●●

●● ●

●●

●●

●●

●●

●●●

●● ●

● ●

●●

●●

● ●●

●●

●●

●●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●● ●

●●●

●●

●●

● ●● ●

● ●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●●

●●

●●

●●

●●

● ●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●● ●

● ●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●●●

● ●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●● ●

●●

●●

●●

●●

●● ●

●●

● ●●

●●●

●● ●

●●

● ●

● ●●

●●

● ●

● ●●

●●●

●●●

● ●

●●

●●

●●●

● ●

●●

●● ●

●●

●●

●●

●●

●●●

●●

● ●

● ●

●●●● ●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●●●

●●●●● ●

● ●

●●

●●

●●●

●●

●●

●●

● ●● ●

● ●

● ●●●

●●

●●

●●

● ●

●●

●●

●●● ●

●●

● ●

●●

● ●

●●●

●●

●●

●●●

●●

● ●

●●● ●

●●

●●

●●

● ●

●●●●

●●●

● ●

●●

●● ●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

● ●

●●

●●

● ●●●

●●

●●

● ●●

●●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

● ●●

●●

● ●

●●

●●

●●

●●

●●

●●●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●

●●

●●●●

●●

●●

● ●

● ●

●●● ●●

●●

●●

●●

●●●

●● ●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

● ●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

● ●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●● ●

●●●

● ●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●●

●●

● ●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ● ●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

● ●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●●

●●

● ●

●●

●●●

●●

●● ●●

●●●

●●

● ●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

● ●

●●

●● ●●

●●

●●

●●

●●●

●●

●●

● ●

●● ●

●●

●●

●●

● ●

●●

● ●●

● ●

●●

●●

●●

●●

●●●

●●

● ●●

●● ●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

● ●

●●

●● ●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●● ●

●●

●●●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

● ●●●

●●

●● ●

● ●

●●●

●●

●●

●●

●●●

●●●

●● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

● ● ●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●● ●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ●●

●●

● ●

●●

●●

●●

●●●

● ●

● ●

●●

●●

● ●

● ●

●●

●●

●●

● ●●

● ●

● ●

●●

●●

●●

●●●

● ●●●

● ●●

● ●

●●

●●

●●

● ●●

●●

●●

● ●

●●

● ●

●●

●●

●●●

●●●

●●

●●●

●●

● ●●

●●

●●

● ● ●

●● ●

●● ●●●●

●●

●●

●●

●●●

● ●●

●●

●●

●●

●●

●●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

● ●

● ●

●●●●

●●

●●

●●

●●

●● ●

●●●

●●

●●

●●

●● ●●

● ●

●●

● ●

●●

●● ●

●●

●●

●●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●● ●●

● ●

●● ●

● ●

●●

●● ● ●

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

● ●

●● ●

●●●

●●

●●

●●

●●

● ●● ●

●● ●●

● ●●

●●

●●

●●●

●●

● ●

● ●

● ●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●● ● ●

●●

●●●

●●

●●

● ●●

●●

●● ●

●●

●●

● ●

●●

●●

● ●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●● ●

● ●

●●

●● ●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

● ●

●●●

●●

●●

●●●

●●●

● ●

●●

●●

●●● ●

● ●●

●●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

● ●

●● ●

●●●

●●

●●● ●

●●●

●●

●● ●

●●

● ●●

●●

●●●

●●

●●●

●●

●●

●●

●●● ●● ●

● ●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●● ● ●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●●

●●●

● ●

●●● ●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●● ●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●● ●

●●

● ●

●●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●●

●●

●●

●●

●●

●● ● ●

● ●

●●

●●

●●

● ●

●●●

● ●

●●●

●●

●●

●● ●

●●

● ●●

●●

●●

●●

●●●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●● ●

●●

●●

● ● ●●

●●

●●

● ●

●●

●●

●●

● ●●●

●●

●●●

●● ●

● ●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●●

● ●

●●

●●

●●●

●●

●●

● ●

●●

●●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●●

● ●●●

●●

●●

●●

●● ●

●●

●●

●●●

● ●●

●●

●●

●●

● ● ●

● ●

●●

●●

●●●

●●

●● ●

●●

●●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●● ●

●● ●

●●

●●

●●● ●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●●●

●●

● ●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●● ●●

●●

●●

●●●

●●

●●●

● ●

● ●

●●●

●●

● ●●

● ●

● ●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●●

●●

●●

● ●● ●

●●

● ●

●●

● ●●●

●●

●●● ●

●●

●●

●●

●●

●●

●●

●● ●

● ●●

●●

●● ●

●●

●●

●●

●●

●●●● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●● ●

● ●

●●

●●

● ●

● ●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●● ● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●● ●

● ●●

●●

●● ●

●●

● ●

●●

● ●●

● ●

●●

●●

●●●

●●●

●●

●●

●●

●●

● ●

●●

●●●

● ● ●●

●●●

●●

● ●

●●

● ●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●●

●●

●●●●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●● ●● ●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●●

●●

● ●

●●

● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ● ●

●●●

●●

●●

●● ● ●●●

● ●

●●

●●

●●

●●

● ●●

●●

●●●

●●

●● ●

●●

●●

●●

●●

●● ●

●●

● ●

●●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

● ●

●● ●

●●●

●●

●●●

●●

●●

● ●

● ●

● ●

● ●●

●●

●●

● ●

●●

● ●

●●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●●

● ●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●● ●

● ●

●●

●●

●●

●● ●

●●●

● ●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●● ●

●●

● ●

● ●

●●● ●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●●

● ●

●●●

●●

●●

●●

●● ●

●●

●●

● ●●

●●

● ●●

●●

● ●

●●

●● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

● ●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

● ●

●●

●●

●●●●

● ●

●● ● ●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●●

●●

●● ●

●●

●●

●●

●● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ● ●

●● ●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

● ●

●●

● ●●

● ●

●●

●● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●●

● ●

●●

●●

● ●

● ●

●●

● ●

●●

●●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●● ●

● ●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●●

●●

● ●

●●●

●●

●●

●●●

●●

●● ●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

● ●●

● ●

● ●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

● ●

●●

● ●

●●

●●

●● ●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●● ●

●●

●●

●●

●●

●●

● ●

●● ●

● ●

●●

●● ●●

●●

●●

●●

● ●●

●●

●●

●●

●●●

● ● ●

●●

●●●

●●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

−25 −20 −15 −10

−10

−5

05

10

averaged expression

expr

essi

on d

iffer

ence

Filter on genes with nul countsFilter on the resp. 30% and 5%more extreme values of M r

gj andAr

gj

where

M rgj = log2(

Ygj/Nj

Ygr/Nr)

Argj = [log2(

Ygj

Nj) + log2(

Ygr

Nr)]/2

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 13 / 94

Page 14: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

TMM normalization

AlgorithmSelect the reference r as the library whose upper quartile isclosest to the mean upper quartile.

Compute weights w rgj = (

Nj−YgjNj Ygj

+Nr−YgrNr Ygr

)

Compute TMM rj =

∑g∈G? w r

gj Mrgj∑

g∈G? w rgj

Definesj = 2TMM r

j

Normalize them such that their product equals 1

sj =sj

exp1n∑` s`

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 14 / 94

Page 15: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Which normalization method ?

At lot of different normalization methods...Some are part of models for DE, others are ’stand-alone’They do not rely on similar hypothesesBut all of them claim to remove technical bias associated withRNA-seq data

QuestionsWhich one is the best ?Which criteria are relevant for this choice ?

A comprehensive evaluation of normalization methods for Illuminahigh-throughput RNA sequencing data analysisFrench StatOmique Consortium (2012) doi : 10.1093./bib/bbs046

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 15 / 94

Page 16: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Comparison of 7 normalization methods

Differential analyses on 4 real datasets (RNA-seq or miRNA-seq) andone simulated datasetat least 2 conditions, at least 2 bio. rep., no tech. rep.

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 16 / 94

Page 17: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Comparison procedures

Distribution and properties of normalized datasetsBoxplots, variability between biological replicates

Comparison of DE genes

Differential analysis by exact test: DESeq v1.6.1, default parameters

Number of common DE genes, similarity between list of genes(dendrogram - binary distance and Ward linkage)

Power and control of the Type-I error rate

simulated data

non equivalent library sizes

presence of majority genes

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 17 / 94

Page 18: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Normalized data distributionWhen large diff. in lib. size, TC and RPKM do not improve over the rawcounts.

Example: Mus musculus datasetE. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 18 / 94

Page 19: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Within-condition variability

Example: Mus musculus, condition D dataset

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 19 / 94

Page 20: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Lists of differentially expressed (DE) genes

For each dataset(gene x method) binarymatrice:

1: DE gene0: non DE gene

Jaccard distancebetween methodsdendrogramm, Wardlinkage algorithm

Consensus matriceMean of the distancematrices obtained from eachdataset

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 20 / 94

Page 21: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Type-I Error Rate and Power (Simulated data)Inflated FP rate for all the methods except TMM and DESeq

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 21 / 94

Page 22: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

So the Winner is ... ?

In most casesThe methods yield similar results

However ...Differences appear based on data characteristics

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 22 / 94

Page 23: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Conclusions on normalization

RNA-seq data are affected by technical biaises (total number ofmapped reads per lane, gene length, composition bias)

Csq1: non-uniformity of the distribution of reads along the genomeCsq2: technical variability within and between-sample

Normalization by gene length isn’t required for the differentialanalysis.

Normalisation is necessary and not trivial.

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 23 / 94

Page 24: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Normalisation step is specific of the group ofsamples considered

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 24 / 94

Page 25: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Conclusions on normalization

Differences between normalisation methods when genes withlarge number of reads and very different library depths

TMM and DESeq : performant and robust methods in a differentialanalysi context

Risso et al (2014) proposed a new method (RUVSeq). It is a factoranalysis based on a suitably-chosen subset of negative controlgenes

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 25 / 94

Page 26: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

TP session

Directory Script, look at NormalizationMethods.RRead the codeRun the code and observe the normalized count distribution

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 26 / 94

Page 27: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Table of contents

1 Introduction

2 Normalization

3 Differential analysisIntroductionTest for RNA-seq dataMethod evaluation

4 Clustering of transcriptomic data

5 Experimental design

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 27 / 94

Page 28: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Statistical hypothesis test

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 28 / 94

Page 29: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Statistical hypothesis test

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 29 / 94

Page 30: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Statistical hypothesis test

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 30 / 94

Page 31: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Principle of a hypothesis test

Construction of a testFormulate the two hypothesesConstruct the test statisticDefine its distribution under the null hypothesisCalculate the p-valueDecide if the null hypothesis is rejected or not

Definition of a p-valueIt is the probability of seeing a result as extreme or more extreme thanthe observed data, when the null hypothesis is true

Decision ruleThe null hypothesis is rejected if the p-value is lower than a giventhreshold

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 31 / 94

Page 32: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Decision table

H0 true H0 falseno difference difference

Do not reject H0 Right decision Wrong decisiontype II error

Reject H0 Wrong decision Right decisiontype I error

Acceptable errorsIn a type I error, the null hypothesis is really true but the statisticaltest has led you to believe that it is false. It is a false positive.In a type II error, the null hypothesis is really false but the test hasnot picked up this difference. It is a false negative.

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 32 / 94

Page 33: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Multiple testing

The result of a test can be viewed as a random variable:0 if the result is a true positive1 if the result is a false positive

By definition, P(to be a false positive)=α

QuestionPerform G=10000 tests at level αWhat is the expected number of false-positive ?

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 33 / 94

Page 34: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Contingency table for multiple hypotesis testing

True Falsenull hypotheses null hypotheses

Declared True Negatives False Negatives Negativesnon-significant

Declared False Positives True Positives Positivessignificant

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 34 / 94

Page 35: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

P-value adjustment

Adjust the raw p-values to controlFWER = P(FP > 0) (Bonferroni procedure)FDR = E(FP/P) if P > 0 or 1 otherwise (Benjamini-Hochbergprocedure)

Calculating adjusted p-valuesStart with (unadjusted) p-values for G hypotheses

1 Order the p-values p(1) < . . . < p(G)

2 Multiply each p(i) by its adjustment factorBonferroni: ai = GBenjamini-Hochberg : ai =

Gi

3 Let p(i) = aip(i)4 Set p(i) = min(p(i) , 1) for all i

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 35 / 94

Page 36: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Test for RNA-seq data

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 36 / 94

Page 37: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Distribution for count data

From D. Robinson and D. McCarthy

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 37 / 94

Page 38: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Notations and framework

Let Yg11, . . . ,Yg1n be independent counts from condition 1,

Yg1` ∼ NB(s`λg1, φg)

Let Yg21, . . . ,Yg2n be independent counts from condition 2,

Yg2r ∼ NB(srλg2, φg)

s` the library size of sample `λgj the proportion of the library for this particular gene g incondition j .We want to test

H0 = {λg1 = λg2} vs H1 = {λg1 6= λg2}

Need to estimate λg1, λg2 and φg

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 38 / 94

Page 39: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Dispersion estimation

Few replicates to accurately estimate the dispersion parameterDESeq: φg is a smooth function of λg = λg1 = λg2

edgeR: empirical Bayesian procedure to estimate φg

... and many, many more methods!

Soneson & Delorenzi (2013, BMC Bioinf) compared 11 methods using simulations:No single method is optimal under all circumstances ...

Nookaew et al. (2012, NAR) compared microarray & RNA-seq differential analyses:Importance of mapping to estimate gene expression level

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 39 / 94

Page 40: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Exact Negative Binomial TestRobinson & Smyth (2008) Biostatistics

If librairie sizes are equal (Anderson & Boullion, 1972)s` = s for ` = 1, . . . ,nYg11 . . .+ Yg1n ∼ NB(nsλg1, φg/n)

Yg21 . . .+ Yg2n ∼ NB(nsλg2, φg/n)

There exists a sufficient statistic for λg1, λg2

φg can be estimated independently from λg1, λg2

The normalisation is performed to get librairies with equal size

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 40 / 94

Page 41: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

GLM framework for RNAseq data

Let Ygjv be the counts of reads for gene g in the sample describedby the uplet (j , v)

Generalized Linear Model allows to decompose a function of themean of the observations

We assumeYgjv ∼ NB(µgjv , φg)

withE(log(Ygjv )) = log(sjv ) + log(λgjv )

wheresjv is the library size for sample described by (j , v)

log(λgjv ) = f (g, j , v) for example

log2(λgjv ) = Intercept + αgj + βgv + γgjv

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 41 / 94

Page 42: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Inference and test

Parameters are those describing the mean and the dispersion φg .

They are estimated by the quasi-likelihood estimators: φg areestimated and parameters describing the mean are thenestimated by maximizing a function of the data and the dispersion

Test based on the likelihood ratio test or the Wald test

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 42 / 94

Page 43: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Linear model framework for RNAseq data

Let Ygjv be the counts of reads for gene g in the sample describedby the uplet (j , v)

Data are transformed so that a linear model can be consideredLinear Model allows to decompose a function of the mean of theobservations

We assumeYgjv ∼ N (µgjv , σ

2g)

withE(Ygjv ) = Intercept + sjv + αgj + βgv + γgjv

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 43 / 94

Page 44: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Example

Consider a project where a wild-type plant and three mutants arestudiedAvailable data are three biological replicates of the geneexpression for each type of plantThe aim is to find genes affected by each mutationHow can you answer this question ?

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 44 / 94

Page 45: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Conclusions on the differential analysis

The discrete nature as well as the extreme precision of RNA-seqmeasures are a challenge for statistical analyses

Differential analysis is based on several assumptions: readdistribution and dispersion modeling

Some methods proposed to filter low counts

Exact test is not tailored for experiments with more than one factor

GLM is a more flexible framework

Another alternative is to transform the data and to use a linearmodel (limma-voom)

In the latter two cases, it is important to determine which factorsare important to include in the expectation decomposition ?

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 45 / 94

Page 46: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

In practice

To perform a differential analysis, we have to make a decision aboutthe filtering (yes or no)

the modeling (NB, GLM or LM and which factors are important toconsider ?)

the dispersion estimation (several methods)

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 46 / 94

Page 47: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

How to evaluate methods for the differentialanalysis of gene expression?

Real data:More realistic... but no extensively validated data yet available

Simulated data:Truth is well-controlled... but what model should be used to simulate data? How realisticare the simulated data? How much do results depend on the modelused?

Another solution: synthetic data

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 47 / 94

Page 48: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Synthetic data simulations

H0 genes

Validated

Unknown status

H0 full dataset

H1 rich dataset

Leaves vs Leaves Buds vs Leaves

qRT-PCR

Creation of 10 synthetic datasets for each proportion of full H0 datasetconsidered

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 48 / 94

Page 49: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Synthetic data simulations

H0 genes

Validated

Unknown status

H0 full dataset

H1 rich dataset

random sub-selection

random sub-selection

Synthetic dataset

Leaves vs Leaves Buds vs Leaves

qRT-PCR

Creation of 10 synthetic datasets for each proportion of full H0 datasetconsidered

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 48 / 94

Page 50: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Fifteen evaluated methods

Information on the experimental designTwo conditions: buds and leavesTwo biological replicates are available for each tissuIt means that gene expression is measurement for each tissu fromplants grown in the same growth chamber but at two differentdates

Question about the differential analysisthe filtering (yes or no)Count modeling

NB modelGLM with or without batch effectData transformation and linear model with or without batch effect

dispersion estimation: edgeR, DESeq, DESeq2limma for linear model

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 49 / 94

Page 51: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Definition of a ROC curve

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 50 / 94

Page 52: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

The sets of truly DE genes and truly NDE genes

the set of truly DE genes251 DE genes identified by qRT-PCR among 332 randomly chosengenes

the set of truly NDE genesThe proper identification is not straightforwardNDE.union: genes declared at least once not differentiallyexpressed by DESeq2, glm edgeR and limma-voom taking intoaccount a batch effect.NDE.union may include some genes that are not truly NDENDE.inter: genes always declared not differentially expressed bythe three methods.NDE.inter may exclude some truly NDE genes.Consequently both should be considered for AUC and FDRevaluations.

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 51 / 94

Page 53: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Discrimination of DE and NDE genes

Data filtering has a slight effectFor a proportion of full H0dataset above 0.6 (implying asmaller proportion of DE genes),linear modeling after datatransformation or glm modelingimproves the AUCThis increase is even greaterwhen a batch effect isconsideredThe variance-mean relationshipmodeling seems to have alimited impact

Similar results with NDE.union and NDE.inter

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 52 / 94

Page 54: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Evaluation of the TPR

Tests were performed at FDR level 5%

Two groups : methods takinginto account a batch effect andall the other methods.

For the first group, methodsshow a high TPRFor the second group, the TPRis a function of the full H0dataset proportion.The variance-mean relationshipmodeling and the data filteringseem to have only a limitedimpact.

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 53 / 94

Page 55: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Evaluation of the TPR

Tests were performed at FDR level 5%

Two groups : methods takinginto account a batch effect andall the other methods.For the first group, methodsshow a high TPRFor the second group, the TPRis a function of the full H0dataset proportion.

The variance-mean relationshipmodeling and the data filteringseem to have only a limitedimpact.

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 53 / 94

Page 56: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Evaluation of the TPR

Tests were performed at FDR level 5%

Two groups : methods takinginto account a batch effect andall the other methods.For the first group, methodsshow a high TPRFor the second group, the TPRis a function of the full H0dataset proportion.The variance-mean relationshipmodeling and the data filteringseem to have only a limitedimpact.

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 53 / 94

Page 57: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Evaluation of the FDR

Requires to identify the set of genes non-differentially expressed(NDE)Not easyTwo sets are defined :

(i) intersection of the genes declared NDE by edgeR,DESeq2 and limma-voom that take a batch effect intoaccount (lower bound)

(ii) union of the genes declared NDE by edgeR, DESeq2and limma-voom that take a batch effect into account(upper bound)

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 54 / 94

Page 58: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Evaluation of the lower bound

Tests were performed at FDR level 5%

For the methods not taking abach effect into account, thelower bound is close to 0For the methods taking abach effect into account, thelower bound is higher andincreases with the proportionof full H0 datasetThe filtering procedureseems to stabilize the lowerbound across the proportionof full H0 dataset.

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 55 / 94

Page 59: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Evaluation of the upper bound

A strong effect of the datafilteringThe upper bound of allfiltered methods was lowerthan 0.05 except for glmedgeR with a batch effect at90% of full H0 datasetFor the methods not taking abatch effect into account, theupper bound was very closeto 0, suggesting that theywere very conservative

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 56 / 94

Page 60: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Evaluation of the p-values

RecallWhen no difference is expected, histogram of the p-values areexpected to be uniform histogramFor each synthetic dataset, 100 Kolmogorov-Smirnov tests on1000 genes randomly chosen in the full H0 dataset are performed

73% of tests are rejected after aBonferroni adjustementKS statistic values are smallerfor models taking a batch effectinto account than for modelwithout batch effectData filtering has a slight effect

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 57 / 94

Page 61: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Evaluation of the p-values

RecallWhen no difference is expected, histogram of the p-values areexpected to be uniform histogramFor each synthetic dataset, 100 Kolmogorov-Smirnov tests on1000 genes randomly chosen in the full H0 dataset are performed

73% of tests are rejected after aBonferroni adjustementKS statistic values are smallerfor models taking a batch effectinto account than for modelwithout batch effectData filtering has a slight effect

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 57 / 94

Page 62: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Conclusions

Synthetic data are a relevant frameworkOur analysis suggests that a well-modeling of the expectation ofthe counts is crucialTo date, biological replicates of plants are not considered as afactorImpossible to observe this fact on simulated data

the raw p-value distribution = an indicator of qualityIt should be used to evaluate the fit between the counts and themodelan histogram with a peak at the right side often indicates that themodeling is incorrect

modeling ≥ filtering ≥ dispersion

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 58 / 94

Page 63: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Table of contents

1 Introduction

2 Normalization

3 Differential analysis

4 Clustering of transcriptomic dataModel-based clusteringMixture models for transcriptomic data

5 Experimental design

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 59 / 94

Page 64: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Gene co-expression for gene function prediction

Transcriptome data: main source of ’omic information available forliving organisms

Microarrays (∼1995 - )High-throughput sequencing: RNA-seq (∼2008 - )

Comparison of two conditions (hypothesis tests)→ Differentialexpression analysis

Co-expression (clustering) analysisStudy gene expression behavior across several conditionsCo-expressed genes may be involved in similar biologicalprocess(es) (Eisen et al., 1998))⇒ Co-expression is a tool to study genes without known orpredicted function (orphan genes)⇒ It is also the first step to build a regulatory network

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 60 / 94

Page 65: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Model-based clustering

Probabilistic clustering models : data assumed to come fromdistinct subpopulations, each modeled separately

Rigourous framework for parameter estimation and modelselection

Output: each gene assigned a probability of cluster membership

What are the key ingredients to define a mixture model ?

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 61 / 94

Page 66: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Key ingredients of a mixture model

what we observe the model the expected results

Z = ? Z : 1 = •, 2 = •, 3 = •

Let y = (y1, . . . ,yn) denote n observations described by Q variablesLet Z = (Z1, . . . ,Zn) be the latent vector.

1) Distribution of Z: {Zi} are assumed to be independent and

P(Zi = k) = πk withK∑

k=1

πk = 1 → Z ∼M(n;π1, . . . , πK )

K is the number of components of the mixture

2) Distribution of (Yi |Zi = k) is a parametric distribution f (•;γk )

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 62 / 94

Page 67: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Key ingredients of a mixture model

what we observe the model the expected results

Z = ? Z : 1 = •, 2 = •, 3 = •

Let y = (y1, . . . ,yn) denote n observations described by Q variablesLet Z = (Z1, . . . ,Zn) be the latent vector.

1) Distribution of Z: {Zi} are assumed to be independent and

P(Zi = k) = πk withK∑

k=1

πk = 1 → Z ∼M(n;π1, . . . , πK )

K is the number of components of the mixture

2) Distribution of (Yi |Zi = k) is a parametric distribution f (•;γk )

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 62 / 94

Page 68: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Questions around the mixtures

Modeling: what distribution for each component ? it depends on observed data.

Inference: how to estimate the parameters ? it is usually done with an EM-like algorithm (Dempster et al., 77)

Model selection: how to choose the number of components ?

A collection of mixtures with a varying number of components isusually consideredA criterion is used to select the best model of the collection

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 63 / 94

Page 69: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Outputs of the model and data classification

Distribution: Conditional probabilities:

g(yi ) = π1f (yi ;γ1) + π2f (yi ;γ2) + π3f (yi ;γ3) τik = P(Zi = k |yi ) =πk f (yi ;γk )

g(yi )

τik i = 1 i = 2 i = 3k = 1 0.658 0.007 0.0k = 2 0.342 0.478 0.0k = 3 0.0 0.515 1.0

→ These probabilities enables the classification of the observationsinto the subpopulations

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 64 / 94

Page 70: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Outputs of the model and data classification

Distribution: Conditional probabilities:

g(yi ) = π1f (yi ;γ1) + π2f (yi ;γ2) + π3f (yi ;γ3) τik = P(Zi = k |yi ) =πk f (yi ;γk )

g(yi )

τik i = 1 i = 2 i = 3k = 1 0.658 0.007 0.0k = 2 0.342 0.478 0.0k = 3 0.0 0.515 1.0

Maximum A Posteriori rule: Classification in the component for whichthe conditional probability is the highest.

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 64 / 94

Page 71: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Finite mixture models

Assume data y come from K distinct subpopulations, each modeledseparately:

g(y) =n∏

i=1

K∑k=1

πk f (yi ; γk )

(π1, . . . , πK )′ are the mixing proportions, where∑K

k=1 πk = 1f (·; γk ) is the density of the k th component

For microarray data, we often assume yi |Zi = k ∼ N (µk ,Σk )

For RNA-seq data, we need to choose the family andparameterization of f (·; γk )

QuestionLet yij` be the observed count for gene i in condition j and replicate `.Propose a distribution for f (·; γk )

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 65 / 94

Page 72: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Poisson mixture model

Let yij` be the observed count for gene i in condition j and replicate `Assume

yi |Zi = k ∼J∏

j=1

Lj∏`=1

P(yij`;µij`k )

for i = 1, . . . ,n independently, where variables are independentconditionally on the components.

Question: How to parameterize the mean µij`k to obtain meaningfulclusters of co-expressed genes?

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 66 / 94

Page 73: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Model parameterization: Which genes should beclustered?

2 4 6 8 10 12

050

000

1000

0015

0000

Sample

Cou

nt

2 4 6 8 10 12

02

46

810

12

Sample

Log(

coun

t + 1

)

2 4 6 8 10 12

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Sample

Cou

nt /

Tota

l cou

nt

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 67 / 94

Page 74: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Parameterization of the PMM

Consider the following parameterization:

µij`k = wiλjksj`

wi : overall expression level of observation i (yi··)λk = (λjk ) : clustering parameters that define the profiles of genesin cluster k (variation around wi )sj` : normalized library size for replicate ` of condition j

Constraint:∑

j∑

` sj` = 1 for all k = 1, . . . ,K

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 68 / 94

Page 75: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Interpretation

⇒ Genes are assigned to the same cluster if they share the sameprofile of variation around their mean count across all conditions

Buds

Leav

es

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 69 / 94

Page 76: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Interpretation

⇒ Genes are assigned to the same cluster if they share the sameprofile of variation around their mean count across all conditions

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 69 / 94

Page 77: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Inference: µij`k = wiλjksj`

Expression level

wi = yi··

Library size effectThe MLE estimator of sj` is the “total count” scaling factor:

sj` = y·jl/y···

Other estimators possible: Trimmed Mean of M-values, quantile,DESeq, ...After estimating sjl from the data, we consider this parameter to befixed.

Parameter estimationπk ’s and λ’s are estimated with an EM algorithm

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 70 / 94

Page 78: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Model selection

BIC aims at finding a good number of components to a global fit ofthe data distribution

BIC(m) = log P(Y|m, θ)− νm

2log(n).

whereνm is the number of free parameters of the model mP(Y|m, θ) is the maximum likelihood under this model.

ICL is dedicated to a classification purpose. The penality has anentropy term that penalizes stronger models for which theclassification is uncertain.

ICL(m) = BIC(m)−

{−

n∑i=1

K∑k=1

τik log τik

}

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 71 / 94

Page 79: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Behavior of BIC and ICL in practice for RNA-seqdata

0 50 100 150 200

−2.

0e+

08−

1.0e

+08

Number of clusters

logL

ike

0 50 100 150 200

−2.

0e+

08−

1.0e

+08

Number of clusters

BIC

X

0 50 100 150 200

−2.

0e+

08−

1.0e

+08

Number of clusters

ICL

X

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 72 / 94

Page 80: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Slope heuristics (Birge and Massart, 2006)

Non-asymptotic framework: construct a penalized criterion suchthat the selected model has a risk close to the oracle modelTheoretically validated in Gaussian framework, but encouragingapplications in other contexts (Baudry et al., 2012)

SH(m) = log P(Y|m, θ) + κpenshape(m)

In large dimensions:Stabilization of biasLinear behavior of D

n 7→ −γn(sD)

⇒ Estimation of slope to calibrate κ in a data-driven manner(Data-Driven Slope Estimation = DDSE), capushe R package

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 73 / 94

Page 81: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

HTSCluster: Embryonic fly development

Using slope heuristics, selected model is K = 48(selected model via BIC and ICL is K = 130)

●●●●●

●●●●

●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●● ● ● ● ● ● ● ● ●

−2.

0e+

08−

1.5e

+08

−1.

0e+

08−

5.0e

+07

Contrast representation

penshape(m) (labels : Model names)

−γ n

(sm)

* * *

*The regression line is computed with 17 pointsValidation points

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 65 70 75 80 85 90 95 100

110

120

130

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 74 / 94

Page 82: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Real data analysis: Embryonic fly development

modENCODE project to provide functional annotation ofDrosophila (Graveley et al., 2011)Expression dynamics over 27 distinct stages of developmentduring life cycle studied with RNA-seq12 embryonic samples (collected at 2-hr intervals over 24 hrs) for13,164 genes downloaded from ReCount database (Frazee et al.,2011)

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 75 / 94

Page 83: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

HTSCluster: Embryonic fly development

Maximum conditional probability

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

020

0040

0060

0080

0010

000

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 76 / 94

Page 84: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

HTSCluster: Embryonic fly development

●●●

●●

●●

●●●

●●

●●

●●●●●●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●●

●●

●●

●●●●●●●●●●

●●

●●●

●●

● ●●

●●

●●

●●●●●

●●

● ●

●●●

●●

●●●

●●

●●●

●●

●●

●●●●●●●●●

● ●

●●

●●●

●●●

●●

●●●

●●

●●●●●

●●●●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●●●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●

●●

●●●

●●●●●●

●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●●●●

●●

●● ●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●

●●●●●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●●●

●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●●●●●●●●

●● ●●●

●●●

●●●●●●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45 48

0.2

0.4

0.6

0.8

1.0

Cluster

Max

imum

con

ditio

nal p

roba

bilit

y

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 77 / 94

Page 85: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

HTSCluster: Embryonic fly development

18 19 30 44 20 48 46 10 35 16 25 22 14 3 9 6

τ ≤ 0.8τ > 0.8

Cluster

Fre

quen

cy

010

020

030

040

050

060

0

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 78 / 94

Page 86: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

HTSCluster: Embryonic fly development

Embryos 22−24hrEmbryos 20−22hrEmbryos 18−20hrEmbryos 16−18hrEmbryos 14−16hrEmbryos 12−14hrEmbryos 10−12hrEmbryos 8−10hrEmbryos 6−8hrEmbryos 4−6hrEmbryos 2−4hrEmbryos 0−2hr

0.0

0.2

0.4

0.6

0.8

1.0

λ jks

j.

Cluster

1 2 3 4 5 6 7 8 9 10 11 121314 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Functional enrichment analysis: 33 of 48 clusters associated withat least one Gene Ontology Biological Process term (e.g., cluster6 associated with muscle attachment)

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 79 / 94

Page 87: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Competing models

1 PoisL (Cai et al., 2004): K-means type algorithm using Poissonloglinear model

Equivalent to HTSCluster when equal library sizes, unreplicateddata, equiprobable Poisson mixtures, and parameter estimation viathe Classification EM (CEM) algorithm

2 Witten (2011): hierarchical clustering of dissimilarity measurebased on a Poisson loglinear model

Originally intended to cluster samples3 Si et al. (2014): model-based hierarchical algorithm using Poisson

and negative binomial models4 Classic K-means algorithm on expression profiles (yij`/yi··)

Number of clusters not addressed by any of the aboveFunctional enrichments of Si et al. (2014) seem to be weaker thanours

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 80 / 94

Page 88: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Conclusions: Model-based clustering for HTS data

HTSCluster for clustering count-based RNA-seq profiles:

Poisson mixture model to directly model counts (i.e., no datatransformations, etc.) from HTS experimentsInterpretable parameter constraints lead to straightforwardparameter estimation (EM algorithm), model selection (slopeheuristics)Similar or better performance than previously proposedapproaches on simulated dataAnalyses of real datasets are promisingR package on CRAN available HTSClusterRecently published in Bioinformatics

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 81 / 94

Page 89: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Table of contents

1 Introduction

2 Normalization

3 Differential analysis

4 Clustering of transcriptomic data

5 Experimental designTheoretical conceptsConclusion and discussion

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 82 / 94

Page 90: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

A statistical model: what for?

Aim of an experiment: answer to a biological question.Results of an experiment: (numerous, numerical) measurements.

Model: mathematical formula that relates the experimentalconditions and the observed measurements (response).

(Statistical) modelling: translating a biological question into amathematical model (6= PIPELINE!)

Statistical model: mathematical formula involvingthe experimental conditions,the biological response,the parameters that describe the influence of theconditions on the (mean, theoretical) response,and a description of the (technical, biological)variability.

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 83 / 94

Page 91: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Experimental Design

DefinitionA good design is dedicated to the asked question and facilitates dataanalyses and interpretation of the results. It maximizes collectedinformation and proposes experiments with respect to the financial andmaterial constraints.

Ronald A. Fisher (1890-1962)

To call in the statistician afterthe experiment is done maybe no more than asking himto perform a post-mortem ex-amination: he may be able tosay what the experiment diedof

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 84 / 94

Page 92: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Basic principles - Fisher (1935)

(technical and biological) replicationsReplication (independent obs.) 6= Repeated measurementsRandomization : randomize as much as is practical, to protectagainst unanticipated biasesBlocking : dividing the observations into homogeneous groups.Isolating variation attributable to a nuisance variable (e.g. lane)

Correspondence Nature Biotechnology (July 2011)

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 85 / 94

Page 93: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Basic principles - Fisher (1935)

(technical and biological) replicationsReplication (independent obs.) 6= Repeated measurementsRandomization : randomize as much as is practical, to protectagainst unanticipated biasesBlocking : dividing the observations into homogeneous groups.Isolating variation attributable to a nuisance variable (e.g. lane)

Correspondence Nature Biotechnology (July 2011)

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 85 / 94

Page 94: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Steps of experiment designing

1 Formulate a broadly stated research problem in terms of explicit,addressable questions.

2 Considering the population under study, identifying appropriatesampling or experimental units, defining relevant variables, anddetermining how those variables will be measured.

3 Describe the data analysis strategy4 Anticipate eventual complications during the collection step and

propose a way to handle them

source : Northern Prairie Wildlife Research Center, Statistics for Wildlifers:How much and what kind?

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 86 / 94

Page 95: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

How to Design a good RNA-Seq experiment in aninterdisciplinary context?Some basic rules

Rule 1 Share a minimal common languageRule 2 Well define the biological questionRule 3 Anticipate difficulties with a well designed experimentMake good choices : Replicates vs Sequencing depth

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 87 / 94

Page 96: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Rule 2: Well define the biological question

Choose scientific problems on feasibility and interestOrder your objectives (primary and secondary)Ask yourself if RNA-seq is better than microarray regarding thebiological question

Recall that RNA-Seq technology is useful toStudy all the transcribed entitiesDetect and estimate isoformsConstruct and study a de novo transcriptome

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 88 / 94

Page 97: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Rule 3: Anticipate difficulties with a well designedexperiment

1 Prepare a checklist with all the needed elements to be collected,2 Collect data and determine all factors of variation,3 Choose bioinformatics and statistical models,4 Draw conclusions on results.

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 89 / 94

Page 98: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Be aware of different types of bias

Identify controllable biases / technical specificities

Keep in mind the influence of effects on results:lane ≤ run ≤ RNA library preparation ≤ biological

(Marioni, 2008), (Bullard, 2010)

⇒ Increase biological replications !E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 90 / 94

Page 99: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Biological and technical replicates

Biological replicate : sampling of individuals from a population in orderto make inferences about that population

Technical replicate adresses the measurement error of the assay.

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 91 / 94

Page 100: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Why increasing the number of biologicalreplicates?

To generalize to the population levelTo estimate to a higher degree of accuracy variation in individualtranscript (Hart, 2013)To improve detection of DE transcripts and control of false positiverate: TRUE with at least 3 (Sonenson 2013, Robles 2012)

McIntyre et al. (2011) BMC GenomicsTechnical variability => inconsistent detection of exons at low levels ofcoverage (<5reads per nucleotide)Doing technical replication may be important in studies where lowabundant mRNAs are the focus.

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 92 / 94

Page 101: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

Why increasing the number of biologicalreplicates?

To generalize to the population levelTo estimate to a higher degree of accuracy variation in individualtranscript (Hart, 2013)To improve detection of DE transcripts and control of false positiverate: TRUE with at least 3 (Sonenson 2013, Robles 2012)

McIntyre et al. (2011) BMC GenomicsTechnical variability => inconsistent detection of exons at low levels ofcoverage (<5reads per nucleotide)Doing technical replication may be important in studies where lowabundant mRNAs are the focus.

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 92 / 94

Page 102: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

More biological replicates or increasingsequencing depth?

It depends! (Haas, 2012), (Liu, 2014)DE transcript detection: (+) biological replicatesConstruction and annotation of transcriptome: (+) depth and (+)sampling conditionsTranscriptomic variants search: (+) biological replicates and (+)depth

A solution: multiplexing.

Tag or bar coded with specific sequences added during libraryconstruction and that allow multiple samples to be included in thesame sequencing reaction (lane)

Decision tools available: Scotty (Busby et al. 2013),Library RNAseqPower in Bioconductor (Hart et al., 2013)

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 93 / 94

Page 103: Statistical analysis of RNA-seq data › teaching › sps_summer_school › doc › ...Statistical analysis of RNA-seq data Etienne Delannoy1 and Marie-Laure Martin-Magniette1;2 1-

To summarize

The scientific question of interest drives the experimental choicesCollect informations before planningAll skills are needed to discussions right from project constructionOptimum compromise between replication number andsequencing depth depends on the questionBiological replicates are important in most RNA-seq experimentsWherever possible apply the three Fisher’s principles ofrandomization, replication and local control (blocking)

And do not forget: budget also includes cost of biological dataacquisition, sequencing data backup, bioinformatics and statisticalanalysis.

E. Delannoy & M.-L. Martin-Magniette Analysis of RNA-Seq data INRA 94 / 94