RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and...

RSEM: accurate transcript quantification from RNA-Seq data

with or without a reference genome

Li and Dewey BMC Bioinformatics 2011, 12:323

Kim Dong-in

Bo Li1 and Colin N Dewey1,2*

RNA-Seq

millions read reads end of cDNA from RNA fragment (single,pair) transcript quantification

multiple genes or isoforms

reads count, length


Abstract + Backgroud

Transcript quantification - mapping reads to genome, transcript set - estimation gene, isoform abundances Major complication - Not map uniquely to a single gene or iso-

form



RSEM (RNASeq by Expectation Maximization)

transcript sequences not reference genome de novo transcriptome assembler Extension methodology paired-end, length reads, length distributions, quality

scores 95% credibility interval (CI) posterior mean estimate(PME) maximum likelihood (ML) estimate

abundance of each gene and isoform




In experiments best quantification accuracy short SE reads than PE reads in gene level same sequencing

quality scores is not significant. Illumina error only read sequences quantification accuracy



count reads number, read length (mapped

uniquely gene) problems - mappability not in account : biased - alternatively-spliced genes : incorrect estimates - isoform abundances developed - address rescuing reads to multiple gene modeling by isoform level EM (expectation-maximization algorithm)



similar statistical methods tools only RSEM, IsoEM handling reads mapped ambiguously iso-

forms and genes

RSEM (RNASeq by Expectation Maximization) - modeling RSPDs(start position distributions) - compute posterior mean estimate(PME) 95% credibility interval (CI) - designed without a whole genome sequence IsoEM - maximum likelihood (ML) estimate


Abstract + Backgroud - Related work

RSEM (RNASeq by Expectation Maximization) 1. generate reference transcript sequences

2. aligned the reference - estimate abundances, credibility intervals scripts rsem-prepare-reference rsem-calculate-expression


Implementation


Implementation

designed to transcript sequences not whole genome

1. complicated alignment to genome ( eukaryotic ) splicing , polyadenylation challenging at genome level 2. transcript-level alignments easy, faster


Implementation - Reference sequence preparation

rsem-prepare-reference - genome database - de novo transcriptome assembler - EST database - UCSC, Ensemble genome browser database - set of preprocessed transcript sequences append poly(A) tail sequences to reference (disabled with–no-polyA)



rsem-prepare-reference rsem-prepare-reference --gtf mm9.gtf \ --transcript-to-gene-map knownIsoforms.txt \ --bowtie-path /sw/bowtie \ /mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \ /ref/mouse_125 -or /mm9



rsem-calculate-expression

mapping(aligned) to reference calculation of relative abundances mapping tools: bowtie(default), sam format mapping condition - no single best align - mismatches in first 25 bases - reads > 200


Implementation - Read mapping ,abundance esti-mation

rsem-calculate-expression fasta (position-dependent) fastq (paired-end, single-end, score) EM(expectation-maximization algorithm) options –strand-specific sense or antisense directions –fragment-length (SE) PE learns length



rsem-calculate-expression –estimate-rspd : highly 5’ or 3’ biased from position distributions –calc-ci (maximum likelihood) 95% credibility intervals : capture uncertainty posterior mean



rsem-calculate-expression - output estimate quantity - isoform-level, gene-level : used by edgeR, DESeq estimate fraction transcripts - TPM (transcripts per million) - independent, mean expressed transcript length TPM > RPKM, FPKM



rsem-calculate-expression – output

–out-bam BAM file : genome browser(alignment)

sem-bam2-wig BAM wig the expected number of reads overlapping each genomic position annotation GTF-formatted


Implementation - Visualization

rsem-plot-model rsem-calculate-expression output to pdf report learned fragment read length distributions sequencing error parameters



IsoEM - transcript sequences aligned(bowtie)

Cufflinks - quantification mode genome sequence aligned(tophat)

rQuant - genome sequence aligned(tophat)

RSEM (v0.6) - transcript sequences aligned(bowtie)


Results and Discussion - Comparison to related tools

20 million RNA-Seq (non-strand-specific, mouse transcriptome)

Paired-end reads Single-end reads throwing out the second read of each pair

reference transcript RefSeq - 20,852 genes and 1.2 isoforms per gene on

average Ensembl - 22,329 genes and 3.4 isoforms per gene on

average



tested methods measured accuracy

median percent error (MPE)

error fraction (EF) – 10%

false positive (FP) statistics



RSEM and IsoEM outperform Cufflinks and

rQuant.

1. not fully handle reads map to multiple genes - cufflinks “rescue"-like strategy one iteration of the EM algorithm - rQuant method handles gene multireads is not clear.



RSEM and IsoEM outperform Cufflinks and rQuant.

2. performance gap Cufflinks, rQuant – genome set RSEM and IsoEM – transcript set Cufflinks not properly short transcripts - abnormally high abundance estimates of shorter mean fragment length (280 bases)

RSEM - poly(A) tail handling but not IsoEM





MPE : median percent error

EF : error fraction

FP : false positive



HBR: human brain referenceUHR: universal human referenceMicroarray Quality Control (MAQC)

qRT-PCR : 1,000 (5%) out of a total of 19,005 - gene : 716 filterd genes

biased towards single-isoform genes

single-end number of reads than length at gene level optimal read length, around 25 bases in mouse and

maize

paired-end isoform, alternative splice genes


Results and Discussion - Paired vs. single end reads



mouse RefSeq

empirical : training data Profile : base-dependent

these results only for the task of quantification (We stress…)


Results and Discussion - Availability and require-ments

Project name: RSEM

Project home page: http://deweylab.biostat.wisc.edu/rsem

Operating systems: Any POSIX-compatible platform (e.g., Linux, Mac OS X, Cygwin) Programming languages: C++, Perl

Other requirements: Pthreads; Bowtie, R


Conclusions


- preforming gene, isoform level - not require a reference genome - quantification with de novo transcriptome assemblies

- visualization outputs - credibility interval (CI) estimates - userfriendly, two commands - reference transcript files - single end at gene level quntification - paired end within-gene isoform for mouse, human

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and...

Documents

Transcript of RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and...