All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes
description
Transcript of All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes
![Page 1: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/1.jpg)
All kmers are not created equal: finding the signal from the noise in large-‐scale metagenomes.
Will Trimble metagenomic annota<on group Argonne Na<onal Laboratory
BEACON seminar April 23, 2014 MSU
![Page 2: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/2.jpg)
Apology: I speak biology with an accent
• I spent six years in dark rooms with lasers • Now I use computers to analyze high-‐throughput sequence data.
• I introduce myself as an applied mathema<cian.
• Finding scoring func<ons to answer ques<ons with ambiguous data
![Page 3: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/3.jpg)
Apology: I speak biology with an accent
• I spent six years in dark rooms with lasers • Now I use computers to analyze high-‐throughput sequence data.
• I introduce myself as an applied mathema<cian.
• Finding scoring func<ons to answer ques<ons with ambiguous data
• Shoveling data from the data producing machine into the data-‐consuming furnace.
![Page 4: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/4.jpg)
• Sequences are different • How much did my sequencing run give me? kmerspectrumanalyzer!
• How much did I sample? nonpareil-k • PreXy pictures thumbnailpolish!
Outline
![Page 5: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/5.jpg)
• Sequences are different (math) • How much did my sequencing run give me? kmerspectrumanalyzer (graphs)
• How much did I sample? nonpareil-k (graphs) • PreXy pictures thumbnailpolish (micrographs)!
Outline
![Page 6: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/6.jpg)
Sequences are different
• Sequencing produces sequences. Sequences are qualita<vely different from all other data types.
@HWI-ST1035:125:D1K4CACXX:8:1101:1168:2214 1:N:0:CGATGT!CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTTTAACTCAGACCATTCAATATTCTCATTTAATTGATCTTCGTGTTGTTCATTTTCCTGTGCTTCA!+!@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIFIHHIIHHHHHFFFFFDFEEEEEEDD!@HWI-ST1035:125:D1K4CACXX:8:1101:1190:2224 1:N:0:CGATGT!CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTATTGTATCCAAATAATACGGTCCAACACGCAGGCGTTATTTTAGGATTAGGTGGTGTCGCTGGACA!+!CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGIJB?BDFGHII<CGBFDBGFFHHIIGEHFFBDDDBB?DDCCCDDDCDDDC>@B<B<C@DDDDBDC!@HWI-ST1035:125:D1K4CACXX:8:1101:1339:2184 1:N:0:CGATGT!CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTTTCTTTCCAATTTGATTGGCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCT!+!BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJJJJJJJJJIJJJHJJJFHIJJJJIJIIJJJIJIJJJIHHEHFFFFFEEEEEECDDDDDDDECCD!
Instrument readings, spectra, micrographs Not categorical.
Low-‐throughput categorical data Categories are sound
High throughput sequence data Categoriza4on is an art
![Page 7: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/7.jpg)
Sequences are different
• Sequencing produces sequences. Sequences are qualita<vely different from all other data types.
@HWI-ST1035:125:D1K4CACXX:8:1101:1168:2214 1:N:0:CGATGT!CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTTTAACTCAGACCATTCAATATTCTCATTTAATTGATCTTCGTGTTGTTCATTTTCCTGTGCTTCA!+!@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIFIHHIIHHHHHFFFFFDFEEEEEEDD!@HWI-ST1035:125:D1K4CACXX:8:1101:1190:2224 1:N:0:CGATGT!CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTATTGTATCCAAATAATACGGTCCAACACGCAGGCGTTATTTTAGGATTAGGTGGTGTCGCTGGACA!+!CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGIJB?BDFGHII<CGBFDBGFFHHIIGEHFFBDDDBB?DDCCCDDDCDDDC>@B<B<C@DDDDBDC!@HWI-ST1035:125:D1K4CACXX:8:1101:1339:2184 1:N:0:CGATGT!CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTTTCTTTCCAATTTGATTGGCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCT!+!BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJJJJJJJJJIJJJHJJJFHIJJJJIJIIJJJIJIJJJIHHEHFFFFFEEEEEECDDDDDDDECCD!
Instrument readings, spectra, micrographs Not categorical.
Low-‐throughput categorical data Categories are sound
High throughput sequence data Categoriza4on is an art
107 channels 103 channels 1011 channels
![Page 8: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/8.jpg)
Sequences are different
• Sequencing produces sequences. Sequences are qualita<vely different from all other data types.
• Each sequence is an informa<on-‐rich (possibly corrupted) quota4on from the catalog of gene<c polymers.
![Page 9: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/9.jpg)
What is this sequence ? >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!
Who wrote this line ? “be regarded as unproved until it has been checked against more exact results”
Searching
We know what to do with these puzzles. You go to this website, and type it in…
![Page 10: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/10.jpg)
What is this sequence ? >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!
Who wrote this line ? “be regarded as unproved until it has been checked against more exact results”
Searching
How long do reads need to be to recognize them?
![Page 11: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/11.jpg)
What is this sequence ? >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!
Who wrote this line ? “be regarded as unproved until it has been checked against more exact results”
Searching
How long do reads need to be to recognize them?
To do what, to place on a reference genome? this can be turned into a math problem that I will illustrate with a search engine analogy.
![Page 12: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/12.jpg)
How long do reads need to be?
Informa4on (Shannon, 1949, BSTJ): is a quan<ta<ve summary of the uncertainty of a probability distribu4on – a model of the data Profound applicability in paXern matching + modeling
Logarithmic measurements have units!
H =
X
i
pi log2
✓1
pi
◆
![Page 13: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/13.jpg)
A word on the sign of the entropy
• A popular straw man among-‐mathema<cians-‐and-‐CS-‐people is the “random sequence model.” Uniform categorical distribu<on over all 4L sequences.
• When we learn something—like we collect some genomes and expect our new sequences to look like them—we implicitly construct a less flat distribu<on. Models always have less entropy than the model of ignorance.
![Page 14: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/14.jpg)
How long do phrases need to be?
Exercise: Pick a book from your bookshelf. Pick an arbitrary page and arbitrary line. for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!
![Page 15: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/15.jpg)
• Informa<on content of English words: Hword ca. 12 bits per word. • Size of google books? Big libraries have few 107 books, each one has 105 indexed words ….so a database size of 1012 words. log(database size) = 1012 = 239.9 = 40 bits • So we expect on average 40 / 12 = 3.3 = 4 words to be enough to find a phrase in google’s index.
Try it.
How long do phrases need to be?
![Page 16: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/16.jpg)
How long do phrases need to be?
Exercise: Pick a book from your bookshelf. Pick an arbitrary page and arbitrary line. for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!
Most oken takes 4 words
![Page 17: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/17.jpg)
• Informa<on content of English words: Hword ca. 12 bits per word. • Size of google books? Big libraries have few 107 books, each one has 105 indexed words ….so a database size of 1012 words. log(database size) = 1012 = 239.9 = 40 bits • So we expect on average 40 / 12 = 3.3 = 4 words to be enough to find a phrase in google’s index.
Try it.
How long do phrases need to be?
Not all phrases are equally dis<nc<ve.
![Page 18: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/18.jpg)
• Maximum informa<on content of base pairs Hread 2 bits per length-‐ sequence • Most long kmers are dis<nct: genome of size G (ca 1010 bp) log(G) = 1010 = 233.2 = 34 bits • So we expect that when 2 > 34 bits, we should be able to place any sequence.
• That means we need at least 17 base pairs (seems small) to deliver mail anywhere in the genome.
How long do reads need to be?
``
`
`
![Page 19: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/19.jpg)
The data deluge
• There were some technological breakthroughs in the mid-‐2000s that led to inexpensive collec<on of 10s of Gbytes of sequence data at once.
• The data has outgrown some favorite algorithms from the 1990s (BLAST)
![Page 20: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/20.jpg)
Picture, if you will, a hiseq flowcell Paris of microbial genomes
Microbial transcriptomes + replicates
Environmental isolate genomes Environmental extract sequencing Prepara<on-‐intensive sequencing
Eukaryo<c sequencing Eukaryo<c sequencing for variants
What’s in there?
![Page 21: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/21.jpg)
Picture, if you will, a hiseq flowcell Paris of microbial genomes
Microbial transcriptomes + replicates
Environmental isolate genomes Environmental extract sequencing Prepara<on-‐intensive sequencing
Eukaryo<c sequencing Eukaryo<c sequencing for variants
What’s in there?
Let’s count kmers!
![Page 22: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/22.jpg)
The kmer spectrum.
21mer abundance
numbe
r of kmers
microbial genome
![Page 23: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/23.jpg)
The kmer spectrum.
21mer abundance
numbe
r of kmers
microbial genome
low-‐abundance errors
peak contains most of genome
high-‐abundance peak contains mul<copy genes
really high abundance stuff oken ar<facts
rare abundant
![Page 24: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/24.jpg)
Ranked kmer spectrum
kmer rank (cumula<ve sum of number of kmers)
21mer abu
ndance
Ranked kmer spectrum
rare
abundant
![Page 25: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/25.jpg)
Ranked kmers consumed
21mer abundance
frac<o
n of observed km
ers
Ranked kmers consumed
rare
abundant
data frac<on is unusually stable
![Page 26: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/26.jpg)
Different kinds of data have different spectra
![Page 27: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/27.jpg)
Different kinds of data have different spectra
![Page 28: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/28.jpg)
Redundancy is good
• OMG! Check out these three sequences! I’ve found the fourth, fikh, and sixth domains of life.
• OMG! I see this sequence 10 million <mes.
• OMG! There are more than 10 billion dis<nct 31mers in my dataset. I only have 128 Gbases of memory.
• Error correc<on and diginorm somewhat amusingly strive for opposite ends.
![Page 29: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/29.jpg)
Redundancy is good
• OMG! Check out these three sequences! I’ve found the fourth, fikh, and sixth domains of life.
• OMG! I see this sequence 10 million <mes.
• OMG! There are more than 10 billion dis<nct 31mers in my dataset. I only have 128 Gbases of memory.
• Error correc<on and diginorm somewhat amusingly strive for opposite ends.
Abundance-‐based inferences are beXer in the high-‐
abundance part of the data.
![Page 30: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/30.jpg)
kmerspectrumanalyzer: infer genome size and depth
PNO (x; c, {an}, s) =X
n
anNBpdf (s;µ = cn,↵ = s/n)
Generaliza<on of mixed-‐Poisson model to es<mate how much sequence is in each peak.
![Page 31: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/31.jpg)
0 2000 4000 6000 8000 10000
0
2000
4000
6000
8000
10000
Complete Genome size (kb)
Estim
ated
Gen
ome
Size
(kb)
Fig 2 Coun<ng kmers tells you genome size
…for single genomes, most of the <me.
so much for calibra<on data
![Page 32: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/32.jpg)
10% 5.5% 4% 3%
1.7% 1%
0.5% 0.3% 0.1%
The kink does measure error
Ar<ficial E. coli data varying subs<tu<on errors
![Page 33: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/33.jpg)
But I want to sequence everything! Ok, we can count kmers in everything too..
kmerspectrumanalyzer summarizes distribu<on, es<mates genome size, coverage depth
![Page 34: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/34.jpg)
How much novelty is in my dataset?
How many sequences do you need to see before you start seeing the same ones over and over again? Ini<ally, everything is novel, but there will come a point at which less than half of your new observa<ons are already in the catalog.
![Page 35: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/35.jpg)
Nonuniqefraction(✏; {r}, {n}) =X
i
ni · riPj nj · rj
(1� Poisscdf (✏ · ri, 1))(1� Poisscdf (✏ · ri, 0))
How much novelty is in my dataset?
How many sequences do you need to see before you start seeing the same ones over and over again? Ini<ally, everything is novel, but there will come a point at which less than half of your new observa<ons are already in the catalog. We can calculate this efficiently using the kmer spectrum.
![Page 36: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/36.jpg)
Nonpareil: model of sequence coverage
Nonpareil-k: kmer rarefaction
summary of sequence diversity
Nonpareil– uses subset-‐against-‐all alignment to find out how much of dataset is unique
Nonpareil-‐k – crunches kmer spectrum to approximate the unique frac<on, 300x faster.
![Page 37: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/37.jpg)
Nonpareil: model of sequence coverage
Nonpareil-k: kmer rarefaction
summary of sequence diversity
![Page 38: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/38.jpg)
Nonpareil-‐k: stra<fy datasets by coverage distribu<on
most of dataset likely contained in assembly
assembly is likely to miss or aXenuate the large unique frac<on of dataset.
![Page 39: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/39.jpg)
kmer spectra reveal sequencing problems
• Amok PCR – seemingly random sequences • Amok MDA – 10 Gbases of sequence, one gene • PCR duplicates: en<re sequencing run was 50x exact-‐ and near-‐exact duplicate reads
• Unusually high error rate: indicated by low frac<on of “solid” kmers (for isolate genomes)
• Contaminated samples: 95% E. coli 5% E. faecalis
![Page 40: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/40.jpg)
Figure'1c!
-6e-04 -4e-04 -2e-04 0e+00 2e-04 4e-04
0100
200
300
400
500
600
PC02 vs Alpha Diversity
eigen_vectors[, "PCO2"]
colo
r_m
atr
ix[, "
alp
ha
-div
ers
ity"]
All: y = -259839.54*x + 209.62 ; R^2 = 0.29Gut: y = -275950.37*x + 118.73 ; R^2 = 0.78Oral: y = -369610.24*x + 298.39 ; R^2 = 0.7
Figure'1d!
HMP / quan<le norm / euclidean / colored by alpha
MG-‐RAST API R-‐package matR
Hey kid, you want some unlabeled data?
![Page 41: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/41.jpg)
Figure'2a!
Figure'2b!
Hey kid, you want some preXy ordina<ons?
![Page 42: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/42.jpg)
Generali<es from the kmer coun<ng mines
• Many datasets have as much as 5-‐45% of the sequence yield in adapters.
• FEW DATASETS have well-‐separated abundance peaks (of the sort metavelvet was engineered to find)
• Diverse datasets have a featureless, geometric rela4onship between kmer rank and kmer abundance.
• Shannon entropy is oversensi4ve to errors. Higher-‐order Rényi entropy is more stable.
![Page 43: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/43.jpg)
kmer sta<s<cal summaries • H0 kmer richness (VERY BAD) • H1 Shannon entropy (BAD) • H2 Reyni entropy / Simpson index (GOOD)
• observa<on-‐weighted coverage (BAD) • observa<on-‐weighted size (BAD) • observa<on-‐median coverage (GOOD) • observa<on-‐median size (GOOD) • frac<on in top 100 kmers (USEFUL) • frac<on unique (OK but requires size correc<on)
![Page 44: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/44.jpg)
kmer sta<s<cal summaries • H0 kmer richness (VERY BAD) • H1 Shannon entropy (BAD) • H2 Reyni entropy / Simpson index (GOOD)
• observa<on-‐weighted coverage (BAD) • observa<on-‐weighted size (BAD) • observa<on-‐median coverage (GOOD) • observa<on-‐median size (GOOD) • frac<on in top 100 kmers (USEFUL) • frac<on unique (OK but requires size correc<on)
Most of these give answers which vary so strongly with sampling depth as to be unusable. Observa<on-‐weighted frac<on-‐of-‐data metrics behave fairly well. Frac<ons of the data with par<cular proper<es are stable with respect to sampling.
![Page 45: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/45.jpg)
thumbnailpolish!
http://www.mcs.anl.gov/~trimble/flowcell/!
![Page 46: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/46.jpg)
Some<mes the sequencer has a bad day.
![Page 47: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/47.jpg)
Metagenomic annota<on group Folker Meyer Elizabeth Glass Narayan Desai Kevin Keegan Adina Howe Wolfgang Gerlach Wei Tang Travis Harrison Jared Bishof Dan Braithwaite Hunter MaXhews Sarah Owens
Formerly of Yale: Howard Ochman David Williams Georgia Tech: Kostas Konstan<nidis Luis Rodriguez-‐Rojas
![Page 48: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/48.jpg)
Observa<on: Most scien<sts seem to be self-‐taught in compu<ng.
Observa<on: Most scien<sts waste a
lot of <me using computers inefficiently.
Adina and I volunteer with
![Page 49: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes](https://reader034.fdocument.pub/reader034/viewer/2022042715/5587d3b3d8b42ae8208b4610/html5/thumbnails/49.jpg)
We teach scien<sts how to get more done
Woods Hole
Tuks
U. Chicago
U. Chicago
UIC