RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and...

33
RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in Bo Li1 and Colin N Dewey1,2*

Transcript of RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and...

Page 1: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

RSEM: accurate transcript quantification from RNA-Seq data

with or without a reference genome

Li and Dewey BMC Bioinformatics 2011, 12:323

Kim Dong-in

Bo Li1 and Colin N Dewey1,2*

Page 2: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

RNA-Seq

millions read reads end of cDNA from RNA fragment (single,pair) transcript quantification

multiple genes or isoforms

reads count, length

Li and Dewey BMC Bioinformatics 2011, 12:323

Abstract + Backgroud

Page 3: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

Transcript quantification - mapping reads to genome, transcript set - estimation gene, isoform abundances Major complication - Not map uniquely to a single gene or iso-

form

Li and Dewey BMC Bioinformatics 2011, 12:323

Abstract + Backgroud

Page 4: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

RSEM (RNASeq by Expectation Maximization)

transcript sequences not reference genome de novo transcriptome assembler Extension methodology paired-end, length reads, length distributions, quality

scores 95% credibility interval (CI) posterior mean estimate(PME) maximum likelihood (ML) estimate

abundance of each gene and isoform

Li and Dewey BMC Bioinformatics 2011, 12:323

Abstract + Backgroud

Page 5: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

RSEM (RNASeq by Expectation Maximization)

In experiments best quantification accuracy short SE reads than PE reads in gene level same sequencing

quality scores is not significant. Illumina error only read sequences quantification accuracy

Li and Dewey BMC Bioinformatics 2011, 12:323

Abstract + Backgroud

Page 6: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

count reads number, read length (mapped

uniquely gene) problems - mappability not in account : biased - alternatively-spliced genes : incorrect estimates - isoform abundances developed - address rescuing reads to multiple gene modeling by isoform level EM (expectation-maximization algorithm)

Li and Dewey BMC Bioinformatics 2011, 12:323

Abstract + Backgroud

Page 7: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

similar statistical methods tools only RSEM, IsoEM handling reads mapped ambiguously iso-

forms and genes

RSEM (RNASeq by Expectation Maximization) - modeling RSPDs(start position distributions) - compute posterior mean estimate(PME) 95% credibility interval (CI) - designed without a whole genome sequence IsoEM - maximum likelihood (ML) estimate

Li and Dewey BMC Bioinformatics 2011, 12:323

Abstract + Backgroud - Related work

Page 8: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

RSEM (RNASeq by Expectation Maximization) 1. generate reference transcript sequences

2. aligned the reference - estimate abundances, credibility intervals scripts rsem-prepare-reference rsem-calculate-expression

Li and Dewey BMC Bioinformatics 2011, 12:323

Implementation

Page 9: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

Li and Dewey BMC Bioinformatics 2011, 12:323

Implementation

Page 10: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

designed to transcript sequences not whole genome

1. complicated alignment to genome ( eukaryotic ) splicing , polyadenylation challenging at genome level 2. transcript-level alignments easy, faster

Li and Dewey BMC Bioinformatics 2011, 12:323

Implementation - Reference sequence preparation

Page 11: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

rsem-prepare-reference - genome database - de novo transcriptome assembler - EST database - UCSC, Ensemble genome browser database - set of preprocessed transcript sequences append poly(A) tail sequences to reference (disabled with–no-polyA)

Li and Dewey BMC Bioinformatics 2011, 12:323

Implementation - Reference sequence preparation

Page 12: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

rsem-prepare-reference rsem-prepare-reference --gtf mm9.gtf \ --transcript-to-gene-map knownIsoforms.txt \ --bowtie-path /sw/bowtie \ /mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \ /ref/mouse_125 -or /mm9

Li and Dewey BMC Bioinformatics 2011, 12:323

Implementation - Reference sequence preparation

Page 13: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

rsem-calculate-expression

mapping(aligned) to reference calculation of relative abundances mapping tools: bowtie(default), sam format mapping condition - no single best align - mismatches in first 25 bases - reads > 200

Li and Dewey BMC Bioinformatics 2011, 12:323

Implementation - Read mapping ,abundance esti-mation

Page 14: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

rsem-calculate-expression fasta (position-dependent) fastq (paired-end, single-end, score) EM(expectation-maximization algorithm) options –strand-specific sense or antisense directions –fragment-length (SE) PE learns length

Li and Dewey BMC Bioinformatics 2011, 12:323

Implementation - Read mapping ,abundance esti-mation

Page 15: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

rsem-calculate-expression –estimate-rspd : highly 5’ or 3’ biased from position distributions –calc-ci (maximum likelihood) 95% credibility intervals : capture uncertainty posterior mean

Li and Dewey BMC Bioinformatics 2011, 12:323

Implementation - Read mapping ,abundance esti-mation

Page 16: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

rsem-calculate-expression - output estimate quantity - isoform-level, gene-level : used by edgeR, DESeq estimate fraction transcripts - TPM (transcripts per million) - independent, mean expressed transcript length TPM > RPKM, FPKM

Li and Dewey BMC Bioinformatics 2011, 12:323

Implementation - Read mapping ,abundance esti-mation

Page 17: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

rsem-calculate-expression – output

–out-bam BAM file : genome browser(alignment)

sem-bam2-wig BAM wig the expected number of reads overlapping each genomic position annotation GTF-formatted

Li and Dewey BMC Bioinformatics 2011, 12:323

Implementation - Visualization

Page 18: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

Li and Dewey BMC Bioinformatics 2011, 12:323

Implementation - Visualization

Page 19: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

rsem-plot-model rsem-calculate-expression output to pdf report learned fragment read length distributions sequencing error parameters

Li and Dewey BMC Bioinformatics 2011, 12:323

Implementation - Visualization

Page 20: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

IsoEM - transcript sequences aligned(bowtie)

Cufflinks - quantification mode genome sequence aligned(tophat)

rQuant - genome sequence aligned(tophat)

RSEM (v0.6) - transcript sequences aligned(bowtie)

Li and Dewey BMC Bioinformatics 2011, 12:323

Results and Discussion - Comparison to related tools

Page 21: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

20 million RNA-Seq (non-strand-specific, mouse transcriptome)

Paired-end reads Single-end reads throwing out the second read of each pair

reference transcript RefSeq - 20,852 genes and 1.2 isoforms per gene on

average Ensembl - 22,329 genes and 3.4 isoforms per gene on

average

Li and Dewey BMC Bioinformatics 2011, 12:323

Results and Discussion - Comparison to related tools

Page 22: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

tested methods measured accuracy

median percent error (MPE)

error fraction (EF) – 10%

false positive (FP) statistics

Li and Dewey BMC Bioinformatics 2011, 12:323

Results and Discussion - Comparison to related tools

Page 23: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

RSEM and IsoEM outperform Cufflinks and

rQuant.

1. not fully handle reads map to multiple genes - cufflinks “rescue"-like strategy one iteration of the EM algorithm - rQuant method handles gene multireads is not clear.

Li and Dewey BMC Bioinformatics 2011, 12:323

Results and Discussion - Comparison to related tools

Page 24: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

RSEM and IsoEM outperform Cufflinks and rQuant.

2. performance gap Cufflinks, rQuant – genome set RSEM and IsoEM – transcript set Cufflinks not properly short transcripts - abnormally high abundance estimates of shorter mean fragment length (280 bases)

RSEM - poly(A) tail handling but not IsoEM

Li and Dewey BMC Bioinformatics 2011, 12:323

Results and Discussion - Comparison to related tools

Page 25: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

Li and Dewey BMC Bioinformatics 2011, 12:323

Results and Discussion - Comparison to related tools

Page 26: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

Li and Dewey BMC Bioinformatics 2011, 12:323

Results and Discussion - Comparison to related tools

MPE : median percent error

EF : error fraction

FP : false positive

Page 27: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

Li and Dewey BMC Bioinformatics 2011, 12:323

Results and Discussion - Comparison to related tools

HBR: human brain referenceUHR: universal human referenceMicroarray Quality Control (MAQC)

qRT-PCR : 1,000 (5%) out of a total of 19,005 - gene : 716 filterd genes

biased towards single-isoform genes

Page 28: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

Li and Dewey BMC Bioinformatics 2011, 12:323

Results and Discussion - Comparison to related tools

Page 29: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

single-end number of reads than length at gene level optimal read length, around 25 bases in mouse and

maize

paired-end isoform, alternative splice genes

Li and Dewey BMC Bioinformatics 2011, 12:323

Results and Discussion - Paired vs. single end reads

Page 30: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

Li and Dewey BMC Bioinformatics 2011, 12:323

Results and Discussion - Comparison to related tools

Page 31: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

Li and Dewey BMC Bioinformatics 2011, 12:323

Results and Discussion - Comparison to related tools

mouse RefSeq

empirical : training data Profile : base-dependent

these results only for the task of quantification (We stress…)

Page 32: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

Li and Dewey BMC Bioinformatics 2011, 12:323

Results and Discussion - Availability and require-ments

Project name: RSEM

Project home page: http://deweylab.biostat.wisc.edu/rsem

Operating systems: Any POSIX-compatible platform (e.g., Linux, Mac OS X, Cygwin) Programming languages: C++, Perl

Other requirements: Pthreads; Bowtie, R

Page 33: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in.

Li and Dewey BMC Bioinformatics 2011, 12:323

Conclusions

RSEM (RNASeq by Expectation Maximization)

- preforming gene, isoform level - not require a reference genome - quantification with de novo transcriptome assemblies

- visualization outputs - credibility interval (CI) estimates - userfriendly, two commands - reference transcript files - single end at gene level quntification - paired end within-gene isoform for mouse, human