DNA Sequencingand Assembly
DNA sequencing
How we obtain the sequence of nucleotides of a species
…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…
Which representative of the species?
Which human?
Answer one:
Answer two: it doesn’t matter
Polymorphism rate: number of letter changes between two different members of a species
Humans: ~1/1,000 – 1/10,000
Other organisms have much higher polymorphism rates
DNA sequencing – vectors
+ =
DNA
Shake
DNA fragments
VectorCircular genome(bacterium, plasmid)
Knownlocation
(restrictionsite)
Different types of vectors
VECTOR Size of insert
Plasmid2,000-10,000
Can control the size
Cosmid 40,000
BAC (Bacterial Artificial Chromosome)
70,000-300,000
YAC (Yeast Artificial Chromosome)
> 300,000Not used much
recently
DNA sequencing – gel electrophoresis
Start at primer(restriction site)
Grow DNA chain
Include dideoxynucleoside(modified a, c, g, t)
Stops reaction at allpossible points
Separate products withlength, using gel electrophoresis
Electrophoresis diagrams
Output of gel electrophoresis: a read
A read: 500-700 nucleotides
A C G A A T C A G …. A16 18 21 23 25 15 28 30 32 21
Quality scores: -10log10Prob(Error)
Reads can be obtained from leftmost, rightmost ends of the insert
Double-barreled sequencing:Both leftmost & rightmost ends are sequenced
Method to sequence segments longer than 500
cut many times at random (Shotgun)
genomic segment
Get one or two reads from each segment
~500 bp ~500 bp
Reconstructing the Sequence (Fragment Assembly)
Cover region with ~7-fold redundancy (7X)
Overlap reads and extend to reconstruct the original genomic region
reads
Definition of Coverage
Length of genomic segment: LNumber of reads: nLength of each read: l
Definition: Coverage C = nl/L
How much coverage is enough?
(Lander-Waterman model):Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides
C
Challenges with Fragment Assembly
• Sequencing errors~1-2% of bases are wrong
• Repeats
• Computation: ~ O( N2 ) where N = # reads
false overlap due to repeat
Repeats
Bacterial genomes: 5%Mammals: 50%
Repeat types:
Low-Complexity DNA (e.g. ATATATATACATA…)Microsatellite repeats: (a1…ak)N where k ~ 3-6
(e.g. CAGCAGTAGCAGCACCAG)Common Repeat Families
SINE (Short Interspersed Nuclear Elements)(e.g. ALU: ~300-long, 106 copies)
LINE (Long Interspersed Nuclear Elements)~500-5,000-long, 200,000 copies
MIRLTR/Retroviral
Other-Genes that are duplicated & then diverge (paralogs)-Recent duplications, ~100,000-long, very similar copies
What can we do about repeats?
Two main approaches:• Cluster the reads
• Link the reads
What can we do about repeats?
Two main approaches:• Cluster the reads
• Link the reads
Strategies for sequencing a whole genome
1. Hierarchical – Clone-by-clonei. Break genome into many long piecesii. Map each long piece onto the genomeiii. Sequence each piece with shotgun
Example: Yeast, Worm, Human, Rat
2. Online version of (1) – Walkingi. Break genome into many long piecesii. Start sequencing each piece with shotguniii. Construct map as you go
Example: Rice genome
3. Whole genome shotgun
One large shotgun pass on the whole genome
Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Fugu
Hierarchical Sequencing
Hierarchical Sequencing Strategy
1. Obtain a large collection of BAC clones2. Map them onto the genome (Physical Mapping)3. Select a minimum tiling path4. Sequence each clone in the path with shotgun5. Assemble6. Put everything together
a BAC clone
mapgenome
Methods of physical mapping
Goal:
Make a map of the locations of each clone relative to one another Use the map to select a minimal set of clones to sequence
Methods:
• Hybridization• Digestion
1. Hybridization
Short words, the probes, attach to complementary words
1. Construct many probes2. Treat each BAC with all probes3. Record which ones attach to it4. Same words attaching to BACS X, Y overlap
p1 pn
Hybridization – Computational Challenge
Matrix:m probes n clones
(i, j): 1, if pi hybridizes to Cj
0, otherwise
Definition: Consecutive ones matrixA matrix 1s are consecutive
Computational problem:Reorder the probes so that matrix is in consecutive-ones form
Can be solved in O(m3) time (m >> n)Unfortunately, data is not perfect
p1 p2 …………………….pm
C1
C2 …
……
……
….C
n
1 0 1…………………...01 1 0 …………………..0
0 0 1 …………………..1
pi1pi2…………………….pim
Cj1C
j2 …
……
……
….C
jn
1 1 1 0 0 0……………..00 1 1 1 1 1……………..00 0 1 1 1 0……………..0
0 0 0 0 0 0………1 1 1 00 0 0 0 0 0………0 1 1 1
2. Digestion
Restriction enzymes cut DNA where specific words appear
1. Cut each clone separately with an enzyme2. Run fragments on a gel and measure length3. Clones Ca, Cb have fragments of length { li, lj, lk } overlap
Double digestion:Cut with enzyme A, enzyme B, then enzymes A + B
Whole-Genome Shotgun Sequencing
Whole Genome Shotgun Sequencing
cut many times at random
genome
forward-reverse linked reads
plasmids (2 – 10 Kbp)
cosmids (40 Kbp) known dist
~500 bp~500 bp
The Overlap-Layout-Consensus approach
1. Find overlapping reads
4. Derive consensus sequence ..ACGATTACAATAGGTT..
2. Merge good pairs of reads into longer contigs
3. Link contigs to form supercontigs
+ many heuristics
1. Find Overlapping Reads
• Sort all k-mers in reads (k ~ 24)
TAGATTACACAGATTAC
TAGATTACACAGATTAC|||||||||||||||||
• Find pairs of reads sharing a k-mer
• Extend to full alignment – throw away if not >95% similar
T GA
TAGA| ||
TACA
TAGT||
1. Find Overlapping Reads
One caveat: repeats
A k-mer that appears N times, initiates N2 comparisons
ALU: 1,000,000 times
Solution:
Discard all k-mers that appear more than c Coverage, (c ~ 10)
1. Find Overlapping Reads
Create local multiple alignments from the overlapping reads
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGA
1. Find Overlapping Reads (cont’d)
• Correct errors using multiple alignment
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGA
C: 20C: 35T: 30C: 35C: 40
C: 20C: 35C: 0C: 35C: 40
• Score alignments
• Accept alignments with good scores
A: 15A: 25A: 40A: 25-
A: 15A: 25A: 40A: 25A: 0
Basic principle of assembly
Repeats confuse us
Ability to merge two reads ability to detect repeats
We can dismiss as repeat any overlap of < t% similarity
Role of error correction:
Discards ~90% of single-letter sequencing errors
Threshold t% increases
2. Merge Reads into Contigs (cont’d)
Merge reads up to potential repeat boundaries(Myers, 1995)
repeat region
2. Merge Reads into Contigs (cont’d)
• Ignore non-maximal reads• Merge only maximal reads into contigs
repeat region
2. Merge Reads into Contigs (cont’d)
• Ignore “hanging” reads, when detecting repeat boundaries
sequencing errorrepeat boundary???
b
a
2. Merge Reads into Contigs (cont’d)
?????
Unambiguous
• Insert non-maximal reads whenever unambiguous
3. Link Contigs into Supercontigs
Too dense: Overcollapsed?
(Myers et al. 2000)
Inconsistent links: Overcollapsed?
Normal density
Find all links between unique contigs
3. Link Contigs into Supercontigs (cont’d)
Connect contigs incrementally, if 2 links
Fill gaps in supercontigs with paths of overcollapsed contigs
3. Link Contigs into Supercontigs
Define G = ( V, E )V := contigs
E := ( A, B ) such that d( A, B ) < C
Reason to do so: Efficiency; full shortest paths cannot be computed
3. Link Contigs into Supercontigs
d ( A, B )Contig A
Contig B
3. Link Contigs into Supercontigs
Contig AContig B
Define T: contigs linked to either A or B
Fill gap between A and B if there is a path in G passing only from contigs in T
4. Derive Consensus Sequence
Derive multiple alignment from pairwise read alignments
TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Derive each consensus base by weighted voting
Mouse Genome
Several heuristics of iteratively:Breaking supercontigs that are suspiciousRejoining supercontigs
Size of problem: 32,000,000 reads
Time: 15 days, 1 processorMemory: 28 Gb
N50 Contig size: 16.3 Kb 24.8 Kb N50 Supercontig size: .265 Mb 16.9 Mb
Mouse Assembly
Sequencing in the (near) future
…
CMOS ChipPhotodiodes
m
4m
Microfluidic Chip
m
Inlet
Outlet
m
m
Top Related