All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

All kmers are not created equal: finding the signal from the noise in large-‐scale metagenomes.

Will Trimble metagenomic annota<on group Argonne Na<onal Laboratory

BEACON seminar April 23, 2014 MSU

Apology: I speak biology with an accent

•  I spent six years in dark rooms with lasers •  Now I use computers to analyze high-‐throughput sequence data.

•  I introduce myself as an applied mathema<cian.

•  Finding scoring func<ons to answer ques<ons with ambiguous data

Apology: I speak biology with an accent

•  I spent six years in dark rooms with lasers •  Now I use computers to analyze high-‐throughput sequence data.

•  I introduce myself as an applied mathema<cian.

•  Finding scoring func<ons to answer ques<ons with ambiguous data

•  Shoveling data from the data producing machine into the data-‐consuming furnace.

•  Sequences are different •  How much did my sequencing run give me? kmerspectrumanalyzer!

•  How much did I sample? nonpareil-k •  PreXy pictures thumbnailpolish!

Outline

•  Sequences are different (math) •  How much did my sequencing run give me? kmerspectrumanalyzer (graphs)

•  How much did I sample? nonpareil-k (graphs) •  PreXy pictures thumbnailpolish (micrographs)!

Outline

Sequences are different

•  Sequencing produces sequences. Sequences are qualita<vely different from all other data types.

@HWI-ST1035:125:D1K4CACXX:8:1101:1168:2214 1:N:0:CGATGT!CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTTTAACTCAGACCATTCAATATTCTCATTTAATTGATCTTCGTGTTGTTCATTTTCCTGTGCTTCA!+!@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIFIHHIIHHHHHFFFFFDFEEEEEEDD!@HWI-ST1035:125:D1K4CACXX:8:1101:1190:2224 1:N:0:CGATGT!CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTATTGTATCCAAATAATACGGTCCAACACGCAGGCGTTATTTTAGGATTAGGTGGTGTCGCTGGACA!+!CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGIJB?BDFGHII<CGBFDBGFFHHIIGEHFFBDDDBB?DDCCCDDDCDDDC>@B<B<C@DDDDBDC!@HWI-ST1035:125:D1K4CACXX:8:1101:1339:2184 1:N:0:CGATGT!CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTTTCTTTCCAATTTGATTGGCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCT!+!BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJJJJJJJJJIJJJHJJJFHIJJJJIJIIJJJIJIJJJIHHEHFFFFFEEEEEECDDDDDDDECCD!

Instrument readings, spectra, micrographs Not categorical.

Low-‐throughput categorical data Categories are sound

High throughput sequence data Categoriza4on is an art



@HWI-ST1035:125:D1K4CACXX:8:1101:1168:2214 1:N:0:CGATGT!CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTTTAACTCAGACCATTCAATATTCTCATTTAATTGATCTTCGTGTTGTTCATTTTCCTGTGCTTCA!+!@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIFIHHIIHHHHHFFFFFDFEEEEEEDD!@HWI-ST1035:125:D1K4CACXX:8:1101:1190:2224 1:N:0:CGATGT!CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTATTGTATCCAAATAATACGGTCCAACACGCAGGCGTTATTTTAGGATTAGGTGGTGTCGCTGGACA!+!CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGIJB?BDFGHII<CGBFDBGFFHHIIGEHFFBDDDBB?DDCCCDDDCDDDC>@B<B<C@DDDDBDC!@HWI-ST1035:125:D1K4CACXX:8:1101:1339:2184 1:N:0:CGATGT!CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTTTCTTTCCAATTTGATTGGCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCT!+!BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJJJJJJJJJIJJJHJJJFHIJJJJIJIIJJJIJIJJJIHHEHFFFFFEEEEEECDDDDDDDECCD!

Instrument readings, spectra, micrographs Not categorical.

Low-‐throughput categorical data Categories are sound

High throughput sequence data Categoriza4on is an art

107 channels 103 channels 1011 channels



•  Each sequence is an informa<on-‐rich (possibly corrupted) quota4on from the catalog of gene<c polymers.

What is this sequence ? >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!

Who wrote this line ? “be regarded as unproved until it has been checked against more exact results”

Searching

We know what to do with these puzzles. You go to this website, and type it in…



Searching

How long do reads need to be to recognize them?



Searching

How long do reads need to be to recognize them?

To do what, to place on a reference genome? this can be turned into a math problem that I will illustrate with a search engine analogy.

How long do reads need to be?

Informa4on (Shannon, 1949, BSTJ): is a quan<ta<ve summary of the uncertainty of a probability distribu4on – a model of the data Profound applicability in paXern matching + modeling

Logarithmic measurements have units!

H =

X

i

pi log2

✓1

pi

◆

A word on the sign of the entropy

•  A popular straw man among-‐mathema<cians-‐and-‐CS-‐people is the “random sequence model.” Uniform categorical distribu<on over all 4L sequences.

•  When we learn something—like we collect some genomes and expect our new sequences to look like them—we implicitly construct a less flat distribu<on. Models always have less entropy than the model of ignorance.

How long do phrases need to be?

Exercise: Pick a book from your bookshelf. Pick an arbitrary page and arbitrary line. for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!

•  Informa<on content of English words: Hword ca. 12 bits per word. •  Size of google books? Big libraries have few 107 books, each one has 105 indexed words ….so a database size of 1012 words. log(database size) = 1012 = 239.9 = 40 bits •  So we expect on average 40 / 12 = 3.3 = 4 words to be enough to find a phrase in google’s index.

Try it.



Exercise: Pick a book from your bookshelf. Pick an arbitrary page and arbitrary line. for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!

Most oken takes 4 words

•  Informa<on content of English words: Hword ca. 12 bits per word. •  Size of google books? Big libraries have few 107 books, each one has 105 indexed words ….so a database size of 1012 words. log(database size) = 1012 = 239.9 = 40 bits •  So we expect on average 40 / 12 = 3.3 = 4 words to be enough to find a phrase in google’s index.

Try it.


Not all phrases are equally dis<nc<ve.

•  Maximum informa<on content of base pairs Hread 2 bits per length-‐ sequence •  Most long kmers are dis<nct: genome of size G (ca 1010 bp) log(G) = 1010 = 233.2 = 34 bits •  So we expect that when 2 > 34 bits, we should be able to place any sequence.

•  That means we need at least 17 base pairs (seems small) to deliver mail anywhere in the genome.

How long do reads need to be?

``

`

`

The data deluge

•  There were some technological breakthroughs in the mid-‐2000s that led to inexpensive collec<on of 10s of Gbytes of sequence data at once.

•  The data has outgrown some favorite algorithms from the 1990s (BLAST)

Picture, if you will, a hiseq flowcell Paris of microbial genomes

Microbial transcriptomes + replicates

Environmental isolate genomes Environmental extract sequencing Prepara<on-‐intensive sequencing

Eukaryo<c sequencing Eukaryo<c sequencing for variants

What’s in there?

Picture, if you will, a hiseq flowcell Paris of microbial genomes

Microbial transcriptomes + replicates

Environmental isolate genomes Environmental extract sequencing Prepara<on-‐intensive sequencing

Eukaryo<c sequencing Eukaryo<c sequencing for variants

What’s in there?

Let’s count kmers!

The kmer spectrum.

21mer abundance

numbe

r of kmers

microbial genome

The kmer spectrum.

21mer abundance

numbe

r of kmers

microbial genome

low-‐abundance errors

peak contains most of genome

high-‐abundance peak contains mul<copy genes

really high abundance stuff oken ar<facts

rare abundant

Ranked kmer spectrum

kmer rank (cumula<ve sum of number of kmers)

21mer abu

ndance

Ranked kmer spectrum

rare

abundant

Ranked kmers consumed

21mer abundance

frac<o

n of observed km

ers

Ranked kmers consumed

rare

abundant

data frac<on is unusually stable

Different kinds of data have different spectra

Redundancy is good

•  OMG! Check out these three sequences! I’ve found the fourth, fikh, and sixth domains of life.

•  OMG! I see this sequence 10 million <mes.

•  OMG! There are more than 10 billion dis<nct 31mers in my dataset. I only have 128 Gbases of memory.

•  Error correc<on and diginorm somewhat amusingly strive for opposite ends.

Redundancy is good

•  OMG! Check out these three sequences! I’ve found the fourth, fikh, and sixth domains of life.

•  OMG! I see this sequence 10 million <mes.

•  OMG! There are more than 10 billion dis<nct 31mers in my dataset. I only have 128 Gbases of memory.

•  Error correc<on and diginorm somewhat amusingly strive for opposite ends.

Abundance-‐based inferences are beXer in the high-‐

abundance part of the data.

kmerspectrumanalyzer: infer genome size and depth

PNO (x; c, {an}, s) =X

n

anNBpdf (s;µ = cn,↵ = s/n)

Generaliza<on of mixed-‐Poisson model to es<mate how much sequence is in each peak.

0 2000 4000 6000 8000 10000

0

2000

4000

6000

8000

10000

Complete Genome size (kb)

Estim

ated

Gen

ome

Size

(kb)

Fig 2 Coun<ng kmers tells you genome size

…for single genomes, most of the <me.

so much for calibra<on data

10% 5.5% 4% 3%

1.7% 1%

0.5% 0.3% 0.1%

The kink does measure error

Ar<ficial E. coli data varying subs<tu<on errors

But I want to sequence everything! Ok, we can count kmers in everything too..

kmerspectrumanalyzer summarizes distribu<on, es<mates genome size, coverage depth

How much novelty is in my dataset?

How many sequences do you need to see before you start seeing the same ones over and over again? Ini<ally, everything is novel, but there will come a point at which less than half of your new observa<ons are already in the catalog.

Nonuniqefraction(✏; {r}, {n}) =X

i

ni · riPj nj · rj

(1� Poisscdf (✏ · ri, 1))(1� Poisscdf (✏ · ri, 0))

How much novelty is in my dataset?

How many sequences do you need to see before you start seeing the same ones over and over again? Ini<ally, everything is novel, but there will come a point at which less than half of your new observa<ons are already in the catalog. We can calculate this efficiently using the kmer spectrum.

Nonpareil: model of sequence coverage

Nonpareil-k: kmer rarefaction

summary of sequence diversity

Nonpareil– uses subset-‐against-‐all alignment to find out how much of dataset is unique

Nonpareil-‐k – crunches kmer spectrum to approximate the unique frac<on, 300x faster.

Nonpareil: model of sequence coverage

Nonpareil-k: kmer rarefaction

summary of sequence diversity

Nonpareil-‐k: stra<fy datasets by coverage distribu<on

most of dataset likely contained in assembly

assembly is likely to miss or aXenuate the large unique frac<on of dataset.

kmer spectra reveal sequencing problems

•  Amok PCR – seemingly random sequences •  Amok MDA – 10 Gbases of sequence, one gene •  PCR duplicates: en<re sequencing run was 50x exact-‐ and near-‐exact duplicate reads

•  Unusually high error rate: indicated by low frac<on of “solid” kmers (for isolate genomes)

•  Contaminated samples: 95% E. coli 5% E. faecalis

Figure'1c!

-6e-04 -4e-04 -2e-04 0e+00 2e-04 4e-04

0100

200

300

400

500

600

PC02 vs Alpha Diversity

eigen_vectors[, "PCO2"]

colo

r_m

atr

ix[, "

alp

ha

-div

ers

ity"]

All: y = -259839.54*x + 209.62 ; R^2 = 0.29Gut: y = -275950.37*x + 118.73 ; R^2 = 0.78Oral: y = -369610.24*x + 298.39 ; R^2 = 0.7

Figure'1d!

HMP / quan<le norm / euclidean / colored by alpha

MG-‐RAST API R-‐package matR

Hey kid, you want some unlabeled data?

Figure'2a!

Figure'2b!

Hey kid, you want some preXy ordina<ons?

Generali<es from the kmer coun<ng mines

•  Many datasets have as much as 5-‐45% of the sequence yield in adapters.

•  FEW DATASETS have well-‐separated abundance peaks (of the sort metavelvet was engineered to find)

•  Diverse datasets have a featureless, geometric rela4onship between kmer rank and kmer abundance.

•  Shannon entropy is oversensi4ve to errors. Higher-‐order Rényi entropy is more stable.

kmer sta<s<cal summaries •  H0 kmer richness (VERY BAD) •  H1 Shannon entropy (BAD) •  H2 Reyni entropy / Simpson index (GOOD)

•  observa<on-‐weighted coverage (BAD) •  observa<on-‐weighted size (BAD) •  observa<on-‐median coverage (GOOD) •  observa<on-‐median size (GOOD) •  frac<on in top 100 kmers (USEFUL) •  frac<on unique (OK but requires size correc<on)

kmer sta<s<cal summaries •  H0 kmer richness (VERY BAD) •  H1 Shannon entropy (BAD) •  H2 Reyni entropy / Simpson index (GOOD)

•  observa<on-‐weighted coverage (BAD) •  observa<on-‐weighted size (BAD) •  observa<on-‐median coverage (GOOD) •  observa<on-‐median size (GOOD) •  frac<on in top 100 kmers (USEFUL) •  frac<on unique (OK but requires size correc<on)

Most of these give answers which vary so strongly with sampling depth as to be unusable. Observa<on-‐weighted frac<on-‐of-‐data metrics behave fairly well. Frac<ons of the data with par<cular proper<es are stable with respect to sampling.

thumbnailpolish!

http://www.mcs.anl.gov/~trimble/flowcell/!

Some<mes the sequencer has a bad day.

Metagenomic annota<on group Folker Meyer Elizabeth Glass Narayan Desai Kevin Keegan Adina Howe Wolfgang Gerlach Wei Tang Travis Harrison Jared Bishof Dan Braithwaite Hunter MaXhews Sarah Owens

Formerly of Yale: Howard Ochman David Williams Georgia Tech: Kostas Konstan<nidis Luis Rodriguez-‐Rojas

Observa<on: Most scien<sts seem to be self-‐taught in compu<ng.

Observa<on: Most scien<sts waste a

lot of <me using computers inefficiently.

Adina and I volunteer with

We teach scien<sts how to get more done

Woods Hole

Tuks

U. Chicago

U. Chicago

UIC

All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Science

Transcript of All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes