RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and...
-
Upload
norma-jefferson -
Category
Documents
-
view
213 -
download
0
Transcript of RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and...
RSEM: accurate transcript quantification from RNA-Seq data
with or without a reference genome
Li and Dewey BMC Bioinformatics 2011, 12:323
Kim Dong-in
Bo Li1 and Colin N Dewey1,2*
RNA-Seq
millions read reads end of cDNA from RNA fragment (single,pair) transcript quantification
multiple genes or isoforms
reads count, length
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud
Transcript quantification - mapping reads to genome, transcript set - estimation gene, isoform abundances Major complication - Not map uniquely to a single gene or iso-
form
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud
RSEM (RNASeq by Expectation Maximization)
transcript sequences not reference genome de novo transcriptome assembler Extension methodology paired-end, length reads, length distributions, quality
scores 95% credibility interval (CI) posterior mean estimate(PME) maximum likelihood (ML) estimate
abundance of each gene and isoform
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud
RSEM (RNASeq by Expectation Maximization)
In experiments best quantification accuracy short SE reads than PE reads in gene level same sequencing
quality scores is not significant. Illumina error only read sequences quantification accuracy
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud
count reads number, read length (mapped
uniquely gene) problems - mappability not in account : biased - alternatively-spliced genes : incorrect estimates - isoform abundances developed - address rescuing reads to multiple gene modeling by isoform level EM (expectation-maximization algorithm)
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud
similar statistical methods tools only RSEM, IsoEM handling reads mapped ambiguously iso-
forms and genes
RSEM (RNASeq by Expectation Maximization) - modeling RSPDs(start position distributions) - compute posterior mean estimate(PME) 95% credibility interval (CI) - designed without a whole genome sequence IsoEM - maximum likelihood (ML) estimate
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud - Related work
RSEM (RNASeq by Expectation Maximization) 1. generate reference transcript sequences
2. aligned the reference - estimate abundances, credibility intervals scripts rsem-prepare-reference rsem-calculate-expression
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation
designed to transcript sequences not whole genome
1. complicated alignment to genome ( eukaryotic ) splicing , polyadenylation challenging at genome level 2. transcript-level alignments easy, faster
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Reference sequence preparation
rsem-prepare-reference - genome database - de novo transcriptome assembler - EST database - UCSC, Ensemble genome browser database - set of preprocessed transcript sequences append poly(A) tail sequences to reference (disabled with–no-polyA)
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Reference sequence preparation
rsem-prepare-reference rsem-prepare-reference --gtf mm9.gtf \ --transcript-to-gene-map knownIsoforms.txt \ --bowtie-path /sw/bowtie \ /mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \ /ref/mouse_125 -or /mm9
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Reference sequence preparation
rsem-calculate-expression
mapping(aligned) to reference calculation of relative abundances mapping tools: bowtie(default), sam format mapping condition - no single best align - mismatches in first 25 bases - reads > 200
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Read mapping ,abundance esti-mation
rsem-calculate-expression fasta (position-dependent) fastq (paired-end, single-end, score) EM(expectation-maximization algorithm) options –strand-specific sense or antisense directions –fragment-length (SE) PE learns length
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Read mapping ,abundance esti-mation
rsem-calculate-expression –estimate-rspd : highly 5’ or 3’ biased from position distributions –calc-ci (maximum likelihood) 95% credibility intervals : capture uncertainty posterior mean
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Read mapping ,abundance esti-mation
rsem-calculate-expression - output estimate quantity - isoform-level, gene-level : used by edgeR, DESeq estimate fraction transcripts - TPM (transcripts per million) - independent, mean expressed transcript length TPM > RPKM, FPKM
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Read mapping ,abundance esti-mation
rsem-calculate-expression – output
–out-bam BAM file : genome browser(alignment)
sem-bam2-wig BAM wig the expected number of reads overlapping each genomic position annotation GTF-formatted
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Visualization
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Visualization
rsem-plot-model rsem-calculate-expression output to pdf report learned fragment read length distributions sequencing error parameters
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Visualization
IsoEM - transcript sequences aligned(bowtie)
Cufflinks - quantification mode genome sequence aligned(tophat)
rQuant - genome sequence aligned(tophat)
RSEM (v0.6) - transcript sequences aligned(bowtie)
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
20 million RNA-Seq (non-strand-specific, mouse transcriptome)
Paired-end reads Single-end reads throwing out the second read of each pair
reference transcript RefSeq - 20,852 genes and 1.2 isoforms per gene on
average Ensembl - 22,329 genes and 3.4 isoforms per gene on
average
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
tested methods measured accuracy
median percent error (MPE)
error fraction (EF) – 10%
false positive (FP) statistics
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
RSEM and IsoEM outperform Cufflinks and
rQuant.
1. not fully handle reads map to multiple genes - cufflinks “rescue"-like strategy one iteration of the EM algorithm - rQuant method handles gene multireads is not clear.
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
RSEM and IsoEM outperform Cufflinks and rQuant.
2. performance gap Cufflinks, rQuant – genome set RSEM and IsoEM – transcript set Cufflinks not properly short transcripts - abnormally high abundance estimates of shorter mean fragment length (280 bases)
RSEM - poly(A) tail handling but not IsoEM
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
MPE : median percent error
EF : error fraction
FP : false positive
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
HBR: human brain referenceUHR: universal human referenceMicroarray Quality Control (MAQC)
qRT-PCR : 1,000 (5%) out of a total of 19,005 - gene : 716 filterd genes
biased towards single-isoform genes
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
single-end number of reads than length at gene level optimal read length, around 25 bases in mouse and
maize
paired-end isoform, alternative splice genes
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Paired vs. single end reads
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
mouse RefSeq
empirical : training data Profile : base-dependent
these results only for the task of quantification (We stress…)
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Availability and require-ments
Project name: RSEM
Project home page: http://deweylab.biostat.wisc.edu/rsem
Operating systems: Any POSIX-compatible platform (e.g., Linux, Mac OS X, Cygwin) Programming languages: C++, Perl
Other requirements: Pthreads; Bowtie, R
Li and Dewey BMC Bioinformatics 2011, 12:323
Conclusions
RSEM (RNASeq by Expectation Maximization)
- preforming gene, isoform level - not require a reference genome - quantification with de novo transcriptome assemblies
- visualization outputs - credibility interval (CI) estimates - userfriendly, two commands - reference transcript files - single end at gene level quntification - paired end within-gene isoform for mouse, human