DNA SEQUENCING - IISER Punefarhat/wordpress/wp-content/uploads/2011/06/... · DE NOVO GENOME...
Transcript of DNA SEQUENCING - IISER Punefarhat/wordpress/wp-content/uploads/2011/06/... · DE NOVO GENOME...
Farhat Habib
� Sequencers cannot read an entire genome
� Assembly is the process of reconstructing the genome from the reads
DNA SEQUENCING
Farhat Habib Oct 9, Strand Life Sciences
Fragment Assembly
• Sequencers can only read small parts of the sequence so to obtain the final sequence, these “reads” have to be assembled into a long contiguous sequence
Why hydra?
• Hydra is a model organism for study of developmental patterning, cell differentiation, and regeneration
• Many mutants in hydra are known, including one that fails to regenerate a head, ones with altered sizes, and others with defects in nematocyte production
• Methods for making transgenic hydra have been developed
Hemmrich et al, (2006), Molecular Phylogenetics and Evolution
Phylogenetic tree from 16s mitochondrial rRNA datasets; Bootstrap values from ML/MP/NJ analysis
Farhat Habib
DE NOVO GENOME ASSEMBLYProblem: given a collection of reads, i.e. short subsequences of the genomic sequence in the alphabet “A, C, G, T”, reconstruct the genome from which the reads are derived.Challenges: ◦ Repeats in the genome
…ACCCAGTTGACTGGGATCCTTTTTAAAGACTGGGATTTTAACGCG…
CAGTTGACTG ACTGGGATCC Sample reads
GACTGGGATT ◦ Sequencing errors: substitutions, insertions, deletions, and others.
TTTTTATAGA (substitution), CCTT—TAAACG (deletion and insertion) ◦ Size of the data (typically 100x or more coverage)
Farhat Habib
DE NOVO GENOME ASSEMBLY CURRENT SOLUTIONS
� Overlap-layout-consensus (Celera, Newbler) ◦ Suitable for low coverage, long reads ◦ Highly parallelizable
� De Bruijn graph construction (ALLPATHS-LG, ABySS, Velvet, SOAPdenovo, EULER-SR)
◦ Suitable for high coverage, short reads ◦ Fast but memory-intensive ◦ Sensitive to sequencing errors
� Burrows-Wheeler Transform based (SGA)
Farhat Habib
OVERLAP� Find the best match between the suffix of one read
and the prefix of another
� Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment
� Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring
Farhat Habib
OVERLAP-LAYOUT-CONSENSUS
Overlap: find potentially overlapping reads
Layout: merge reads into contigs and contigs into supercontigs
Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..
Farhat Habib
OVERLAP-LAYOUT-CONSENSUS
NATURE BIOTECHNOLOGY VOLUME 29 NUMBER 11 NOVEMBER 2011 989
states that a connected directed graph has an Eulerian cycle if and only if it is balanced. In particular, Euler’s theorem implies that our de Bruijn graph contains an Eulerian cycle as long as we have located all k-mers present in the genome. Indeed, in this case, for any node, both its indegree and outdegree represent the number of times the (k–1)-mer assigned to that node occurs in the genome.
To see why Euler’s theorem must be true, first note that a graph that contains an Eulerian cycle is balanced because every time an ant traversing an Eulerian cycle passes through a particular vertex, it enters on one edge of the cycle and exits on the next edge. This pairs up all the edges touching each vertex, showing that half the edges touching the vertex lead into it and half lead out from it. It is a bit harder to see the converse—that every connected balanced
nucleotide of the k-mer assigned to that edge.Euler considered graphs for which there
exists a path between every two nodes (called connected graphs). He proved that a connected graph with undirected edges contains an Eulerian cycle exactly when every node in the graph has an even number of edges touching it. For the Königsberg Bridge graph (Fig. 1b), this is not the case because each of the four nodes has an odd number of edges touching it and so the desired stroll through the city does not exist.
The case of directed graphs (that is, graphs with directed edges) is similar. For any node in a directed graph, define its indegree as the number of edges leading into it and its outdegree as the number of edges leaving it. A graph in which indegrees are equal to outdegrees for all nodes is called ‘balanced’. Euler’s theorem
(e.g., AT, TG, GG, GC, CG, GT, CA and AA) can appear only once as a node of the graph. Then, connect node x to node y with a directed edge if some k-mer (e.g., ATG) has prefix x (e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer (Fig. 3d; in Box 3, we describe how this approach was originally discussed in the context of sequencing by hybridization).
Now imagine an ant that follows a differ-ent strategy: instead of visiting every node of the graph (as before), it now attempts to visit every edge of the graph exactly once. Sound familiar? This is exactly the kind of path that would solve the Bridges of Königsberg prob-lem and is called an Eulerian cycle. As it visits all edges of the de Bruijn graph, which rep-resent all possible k-mers, this new ant also spells out a candidate genome; for each edge that the ant traverses, one records the first
Figure 3 Two strategies for genome assembly: from Hamiltonian cycles to Eulerian cycles. (a) An example small circular genome. (b) In traditional Sanger sequencing algorithms, reads were represented as nodes in a graph, and edges represented alignments between reads. Walking along a Hamiltonian cycle by following the edges in numerical order allows one to reconstruct the circular genome by combining alignments between successive reads. At the end of the cycle, the sequence wraps around to the start of the genome. The repeated part of the sequence is grayed out in the alignment diagram. (c) An alternative assembly technique first splits reads into all possible k-mers: with k = 3, ATGGCGT comprises ATG, TGG, GGC, GCG and CGT. Following a Hamiltonian cycle (indicated by red edges) allows one to reconstruct the genome by forming an alignment in which each successive k-mer (from successive nodes) is shifted by one position. This procedure recovers the genome but does not scale well to large graphs. (d) Modern short-read assembly algorithms construct a de Bruijn graph by representing all k-mer prefixes and suffixes as nodes and then drawing edges that represent k-mers having a particular prefix and suffix. For example, the k-mer edge ATG has prefix AT and suffix TG. Finding an Eulerian cycle allows one to reconstruct the genome by forming an alignment in which each successive k-mer (from successive edges) is shifted by one position. This generates the same cyclic genome sequence without performing the computationally expensive task of finding a Hamiltonian cycle.
G
G
A
T
C
A T
G
CG
CGTGCAA
ATGGCGT
CAATGGCGGCGTGC
TGCAATG
ATGGCGT
GGCGTGC
CGTGCAA
TGCAATG
CAATGGC
ATGGCGTGCAATGGCGTATGGCGT
Short-readsequencing
Genome:
Vertices are k-mersEdges are pairwise alignments
Vertices are (k 1)-mersEdges are k-mers
a
c1
2
3
4
56
7
8
9
10
dAT
TG
GG
GC
CG
GT
CA
AA
3
45
6 7
8
1
29
10
ATG
TGG
GGC
GCGCGT
GTGTGC
GCA
CAA
AATATG
TGG
GGC
GCG
CGT
GTG
TGC
GCA
CAA
AAT
ATGATGGCGTGCAATG
b
Genome:Hamiltonian cycle
Visit each vertex once(harder to solve)
Eulerian cycleVisit each edge once
(easier to solve)
1
2
3
4
5
GGC
TGG
ATGAAT
CAA
GCA
TGCGTG
GCG
CGT
k-mers from edgesk-mers from vertices
PR IMERApproach with traditional Sanger sequencing
Compeau et al, Nature Biotechnology, 11 (2011)
Farhat Habib
OVERLAPPING READS
TAGATTACACAGATTAC
TAGATTACACAGATTAC
|||||||||||||||||
• Sort all k-mers in reads (e.g., k ~ 24)
• Find pairs of reads sharing a k-mer
• Extend to full alignment – throw away if not >95% similar
T GA
TAGA
| ||TACA
TAGT
||
Farhat Habib
OVERLAPPING READS AND REPEATS� A k-mer that appears N times, initiates N2 comparisons
� An Alu that appears 106 times initiates 1012 comparisons – too much
� Solution:
Discard all k-mers that appear more than
t × Coverage, (t ~ 10)
Farhat Habib
FINDING OVERLAPPING READSCreate local multiple alignments from the overlapping reads
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGA
Farhat Habib
FINDING OVERLAPPING READS Correct errors using multiple alignment
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGA
C: 20C: 35T: 30C: 35C: 40
C: 20C: 35C: 0C: 35C: 40
Score alignmentsAccept alignments with good scores
A: 15A: 25A: 40A: 25
- A: 15A: 25A: 40A: 25A: 0
Farhat Habib
ALTERNATIVE METHOD
NATURE BIOTECHNOLOGY VOLUME 29 NUMBER 11 NOVEMBER 2011 989
states that a connected directed graph has an Eulerian cycle if and only if it is balanced. In particular, Euler’s theorem implies that our de Bruijn graph contains an Eulerian cycle as long as we have located all k-mers present in the genome. Indeed, in this case, for any node, both its indegree and outdegree represent the number of times the (k–1)-mer assigned to that node occurs in the genome.
To see why Euler’s theorem must be true, first note that a graph that contains an Eulerian cycle is balanced because every time an ant traversing an Eulerian cycle passes through a particular vertex, it enters on one edge of the cycle and exits on the next edge. This pairs up all the edges touching each vertex, showing that half the edges touching the vertex lead into it and half lead out from it. It is a bit harder to see the converse—that every connected balanced
nucleotide of the k-mer assigned to that edge.Euler considered graphs for which there
exists a path between every two nodes (called connected graphs). He proved that a connected graph with undirected edges contains an Eulerian cycle exactly when every node in the graph has an even number of edges touching it. For the Königsberg Bridge graph (Fig. 1b), this is not the case because each of the four nodes has an odd number of edges touching it and so the desired stroll through the city does not exist.
The case of directed graphs (that is, graphs with directed edges) is similar. For any node in a directed graph, define its indegree as the number of edges leading into it and its outdegree as the number of edges leaving it. A graph in which indegrees are equal to outdegrees for all nodes is called ‘balanced’. Euler’s theorem
(e.g., AT, TG, GG, GC, CG, GT, CA and AA) can appear only once as a node of the graph. Then, connect node x to node y with a directed edge if some k-mer (e.g., ATG) has prefix x (e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer (Fig. 3d; in Box 3, we describe how this approach was originally discussed in the context of sequencing by hybridization).
Now imagine an ant that follows a differ-ent strategy: instead of visiting every node of the graph (as before), it now attempts to visit every edge of the graph exactly once. Sound familiar? This is exactly the kind of path that would solve the Bridges of Königsberg prob-lem and is called an Eulerian cycle. As it visits all edges of the de Bruijn graph, which rep-resent all possible k-mers, this new ant also spells out a candidate genome; for each edge that the ant traverses, one records the first
Figure 3 Two strategies for genome assembly: from Hamiltonian cycles to Eulerian cycles. (a) An example small circular genome. (b) In traditional Sanger sequencing algorithms, reads were represented as nodes in a graph, and edges represented alignments between reads. Walking along a Hamiltonian cycle by following the edges in numerical order allows one to reconstruct the circular genome by combining alignments between successive reads. At the end of the cycle, the sequence wraps around to the start of the genome. The repeated part of the sequence is grayed out in the alignment diagram. (c) An alternative assembly technique first splits reads into all possible k-mers: with k = 3, ATGGCGT comprises ATG, TGG, GGC, GCG and CGT. Following a Hamiltonian cycle (indicated by red edges) allows one to reconstruct the genome by forming an alignment in which each successive k-mer (from successive nodes) is shifted by one position. This procedure recovers the genome but does not scale well to large graphs. (d) Modern short-read assembly algorithms construct a de Bruijn graph by representing all k-mer prefixes and suffixes as nodes and then drawing edges that represent k-mers having a particular prefix and suffix. For example, the k-mer edge ATG has prefix AT and suffix TG. Finding an Eulerian cycle allows one to reconstruct the genome by forming an alignment in which each successive k-mer (from successive edges) is shifted by one position. This generates the same cyclic genome sequence without performing the computationally expensive task of finding a Hamiltonian cycle.
G
G
A
T
C
A T
G
CG
CGTGCAA
ATGGCGT
CAATGGCGGCGTGC
TGCAATG
ATGGCGT
GGCGTGC
CGTGCAA
TGCAATG
CAATGGC
ATGGCGTGCAATGGCGTATGGCGT
Short-readsequencing
Genome:
Vertices are k-mersEdges are pairwise alignments
Vertices are (k 1)-mersEdges are k-mers
a
c1
2
3
4
56
7
8
9
10
dAT
TG
GG
GC
CG
GT
CA
AA
3
45
6 7
8
1
29
10
ATG
TGG
GGC
GCGCGT
GTGTGC
GCA
CAA
AATATG
TGG
GGC
GCG
CGT
GTG
TGC
GCA
CAA
AAT
ATGATGGCGTGCAATG
b
Genome:Hamiltonian cycle
Visit each vertex once(harder to solve)
Eulerian cycleVisit each edge once
(easier to solve)
1
2
3
4
5
GGC
TGG
ATGAAT
CAA
GCA
TGCGTG
GCG
CGT
k-mers from edgesk-mers from vertices
PR IMER
Compeau et al, Nature Biotechnology, 11 (2011)
Farhat Habib
SUPERSTRING PROBLEM
• In 1946, Nicolaas de Bruijn solved the ‘superstring problem’: find a shortest circular ‘superstring’ that contains all possible ‘substrings’ of length k (k-mers) over a given alphabet.
• There exist nk k-mers in an alphabet containing n symbols: for example
• If our alphabet is instead 0 and 1, then all possible 3-mers are simply given by all eight 3-digit binary numbers: 000, 001, 010, 011, 100, 101, 110, 111.
• The circular superstring 0001110100 not only contains all 3-mers but also is as short as possible, as it contains each 3-mer exactly once.
Farhat Habib
NATURE BIOTECHNOLOGY VOLUME 29 NUMBER 11 NOVEMBER 2011 989
states that a connected directed graph has an Eulerian cycle if and only if it is balanced. In particular, Euler’s theorem implies that our de Bruijn graph contains an Eulerian cycle as long as we have located all k-mers present in the genome. Indeed, in this case, for any node, both its indegree and outdegree represent the number of times the (k–1)-mer assigned to that node occurs in the genome.
To see why Euler’s theorem must be true, first note that a graph that contains an Eulerian cycle is balanced because every time an ant traversing an Eulerian cycle passes through a particular vertex, it enters on one edge of the cycle and exits on the next edge. This pairs up all the edges touching each vertex, showing that half the edges touching the vertex lead into it and half lead out from it. It is a bit harder to see the converse—that every connected balanced
nucleotide of the k-mer assigned to that edge.Euler considered graphs for which there
exists a path between every two nodes (called connected graphs). He proved that a connected graph with undirected edges contains an Eulerian cycle exactly when every node in the graph has an even number of edges touching it. For the Königsberg Bridge graph (Fig. 1b), this is not the case because each of the four nodes has an odd number of edges touching it and so the desired stroll through the city does not exist.
The case of directed graphs (that is, graphs with directed edges) is similar. For any node in a directed graph, define its indegree as the number of edges leading into it and its outdegree as the number of edges leaving it. A graph in which indegrees are equal to outdegrees for all nodes is called ‘balanced’. Euler’s theorem
(e.g., AT, TG, GG, GC, CG, GT, CA and AA) can appear only once as a node of the graph. Then, connect node x to node y with a directed edge if some k-mer (e.g., ATG) has prefix x (e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer (Fig. 3d; in Box 3, we describe how this approach was originally discussed in the context of sequencing by hybridization).
Now imagine an ant that follows a differ-ent strategy: instead of visiting every node of the graph (as before), it now attempts to visit every edge of the graph exactly once. Sound familiar? This is exactly the kind of path that would solve the Bridges of Königsberg prob-lem and is called an Eulerian cycle. As it visits all edges of the de Bruijn graph, which rep-resent all possible k-mers, this new ant also spells out a candidate genome; for each edge that the ant traverses, one records the first
Figure 3 Two strategies for genome assembly: from Hamiltonian cycles to Eulerian cycles. (a) An example small circular genome. (b) In traditional Sanger sequencing algorithms, reads were represented as nodes in a graph, and edges represented alignments between reads. Walking along a Hamiltonian cycle by following the edges in numerical order allows one to reconstruct the circular genome by combining alignments between successive reads. At the end of the cycle, the sequence wraps around to the start of the genome. The repeated part of the sequence is grayed out in the alignment diagram. (c) An alternative assembly technique first splits reads into all possible k-mers: with k = 3, ATGGCGT comprises ATG, TGG, GGC, GCG and CGT. Following a Hamiltonian cycle (indicated by red edges) allows one to reconstruct the genome by forming an alignment in which each successive k-mer (from successive nodes) is shifted by one position. This procedure recovers the genome but does not scale well to large graphs. (d) Modern short-read assembly algorithms construct a de Bruijn graph by representing all k-mer prefixes and suffixes as nodes and then drawing edges that represent k-mers having a particular prefix and suffix. For example, the k-mer edge ATG has prefix AT and suffix TG. Finding an Eulerian cycle allows one to reconstruct the genome by forming an alignment in which each successive k-mer (from successive edges) is shifted by one position. This generates the same cyclic genome sequence without performing the computationally expensive task of finding a Hamiltonian cycle.
G
G
A
T
C
A T
G
CG
CGTGCAA
ATGGCGT
CAATGGCGGCGTGC
TGCAATG
ATGGCGT
GGCGTGC
CGTGCAA
TGCAATG
CAATGGC
ATGGCGTGCAATGGCGTATGGCGT
Short-readsequencing
Genome:
Vertices are k-mersEdges are pairwise alignments
Vertices are (k 1)-mersEdges are k-mers
a
c1
2
3
4
56
7
8
9
10
dAT
TG
GG
GC
CG
GT
CA
AA
3
45
6 7
8
1
29
10
ATG
TGG
GGC
GCGCGT
GTGTGC
GCA
CAA
AATATG
TGG
GGC
GCG
CGT
GTG
TGC
GCA
CAA
AAT
ATGATGGCGTGCAATG
b
Genome:Hamiltonian cycle
Visit each vertex once(harder to solve)
Eulerian cycleVisit each edge once
(easier to solve)
1
2
3
4
5
GGC
TGG
ATGAAT
CAA
GCA
TGCGTG
GCG
CGT
k-mers from edgesk-mers from vertices
PR IMER
Compeau et al, Nature Biotechnology, 11 (2011)
Farhat Habib
ANOTHER EXAMPLE CONSTRUCTING THE GRAPH (K = 4)
AGAT (8x)
ATCC (7x)
TCCG (7x)
CCGA (7x)
CGAT (6x)
GATG (5x)
ATGA (8x)
TGAG (9x)
GATC (8x)
AAGT (3x)
AGTC (7x)
GTCG (9x)
TCGA (10x)
GGCT (11x)
TAGA (16x)
AGAG (9x)
GAGA (12x)
GACA (8x)
ACAA (5x)
GCTT (8x)
GCTC (2x)
CTTT (8x)
CTCT (1x)
TTTA (8x)
TCTA (2x)
TTAG (12x)
CTAG (2x)
AGAC (9x)
CGAG (8x)
CGAC (1x)
GAGG (16x)
GACG (1x)
AGGC (16x)
ACGC (1x)
A branching vertex is caused by either a repeat in the original sequence or a sequencing error
Sequencing errors are normallydetected by a coverage cutoff threshold
Farhat Habib
EXAMPLE AFTER CONDENSATION
AAGTCGA
TAGAGCTTTAG
GCTCTAG
GAGACAA
CGAG
CGACGC
GAGGCT
AGATCCGATGAG
AGAG
Merge nodes with no branches
Farhat Habib
EXAMPLE AFTER ERROR REMOVAL
AAGTCGA
TAGAGCTTTAG
GAGACAA
CGAG
GAGGCT
AGATCCGATGAG
AGAG
Farhat Habib
EXAMPLE AFTER RECONDENSATION
AAGTCGAG GAGACAAGAGGCTTTAGA
AGATCCGATGAG
AGAG
Any non-branching path in this graph corresponds to a contig in the original sequence.
Farhat Habib
RESOLVING REPEATS USING PAIRED READS
Read 1
Read 2
Insert size: a design
parameter
Genome
Farhat Habib
RESOLVING REPEATS
P. Pevzner and H. Tang, Bioinformatics (2001) Suppl1:S225-33
REPEAT
S1
S3
S2
S4
Matches the distance in the graph,longer than repeat length
REPEATS1 S2
REPEATS3 S4
37
Genome: … S1 REPEAT S2 ……………. S3 REPEAT S4 …
Farhat Habib
RESOLVING REPEATS FAILURE
Mate pair transformation (Velvet, ABySS, EULER-SR) • Find a unique path between mates in the graph. • When multiple paths match the distance between mate-pairs, mate pair
transformation fails.
To resolve a repeat, insert size must be larger than the repeat length and smaller than the length of potential conjugate paths (same length paths passing through the repeat).
REPEAT1
S1
S3
S2
S4
Spans multiple paths
REPEAT2
P1
P2
Farhat Habib
SOME ASSEMBLERS• PHRAP
• Overlap O(n2) → layout (no mate pairs) → consensus • Celera
• Overlap → layout → consensus• Arachne
• Overlap → layout → consensus• Phusion
• Overlap → clustering → PHRAP → assemblage → consensus• Euler
• Indexing → Euler graph → layout by picking paths → consensus
Farhat Habib
DE BRUIJN GRAPH BASED ASSEMBLERS
• ABySS (Simpson et al, 2009)
• velvet (Zerbino and Birney 2008)
• SOAPdenovo (Li et al, 2009)
• Euler-SR (Chaisson and Pevzer 2008)
• SHARCGS (Dohm et al, 2007)
334 | VOL.9 NO.4 | APRIL 2012 | NATURE METHODS
TECHNOLOGY FEATURE
2. Find overlaps between reads
…AGCCTAGACCTACAGGATGCGCGACACGT GGATGCGCGACACGTCGCATATCCGGT…
3. Assemble overlaps into contigs
1. Fragment DNA and sequence
4. Assemble contigs into scaffolds
high-quality genomes and how to recognize a high-quality assembly. For the Genome Assembly Gold-standard Evaluations (GAGE), scientists led by Steven Salzberg at Johns Hopkins University School of Medicine assembled four genomes (three of which had been previously published) using eight of the most popular de novo assembly algorithms3.
Two other efforts, the Assemblathon and dnGASP (de novo Genome Assembly Assessment Project), have taken the form of competitions. Teams generally consisted of the software designers for particular assem-blers, who could adjust parameters as they thought best before submitting assembled genomes for evaluation. Performance was evaluated using simulated data from com-puter-designed genomes4.
The point is not identifying the best overall assembler at a particular point in time, but finding ways to assess and improve assemblers in general, says Ivo Gut, director of the National Genome Analysis Center in Spain, who ran dnGASP. dnGASP compared assembly teams’ performance on a specially designed set of artificial chromosomes: three derived from the human genome, three from the chicken genome, and others represent-ing the fruit fly, nematode, brewer’s yeast
and two species of mustard plant. In addit ion, the contest organizers included special ‘challenge chromo-somes’ that tested assembler perfor-mance on various repetitive struc-tures , divergent alleles and other difficult content.
T h e d a t a s e t for these calibra-tion chromosomes should be freely available later this
year. “You can run [the reference data set] through your assembler and post the results back on the server. And then you can opti-mize your results,” says Gut. Researchers can tune assembly parameters for their genome of interest and benchmark the performance of new versions of their assemblers, get-ting an early indication of an assembler’s performance with a modest investment of computational time, he explains. The
come from either end of DNA fragments that are too long to be sequenced straight through. Depending on the preparation technique, that distance can be as short as 200 base pairs or as large as several tens of kilobases. Knowing that paired reads were generated from the same piece of DNA can help link contigs into ‘scaffolds’, ordered assemblies of contigs with gaps in between. Paired-read data can also indi-cate the size of repetitive regions and how far apart contigs are.
Assessing quality is made more difficult because sequencing technology changes so quickly. In January of this year, Life Technologies launched new versions of its Ion Torrent machines, which can purportedly sequence a human genome in a day, for $1,000 in equipment and reagents. In February, Oxford Nanopore Technologies announced a technology that sequences tens of kilobases in continuous stretches, which would allow genome assembly with much more precision and drastically less computational work. Other companies, such as Pacific Biosciences, also have machines that produce long reads, and at least some researchers are already combining data types to glean the advantages of each.
Software engineers who write assembly programs know they need to adapt. “Every time the data changes, it’s a new problem,” says David Jaffe, who works on genome assembly methods at the Broad Institute in Cambridge, Massachusetts. “Assemblers
are always trying to catch up to the data.” Of course, until a technology has been available for a while, it is hard to know how much researchers will use it. Cost, ease of use, error rates and reliability are hard to assess before a wider community g a i n s m ore e x p e r i e n c e w i t h n e w procedures. Luckily, ongoing efforts for evaluating short-read assemblies should make innovations easier to evaluate and incorporate.
Judging genomesIn the absence of a high-quality reference genome, new genome assemblies are often evaluated on the basis of the number of scaf-folds and contigs required to represent the genome, the proportion of reads that can be assembled, the absolute length of contigs and scaffolds, and the length of contigs and scaffolds relative to the size of the genome.
The most commonly used metric is N50, the smallest scaffold or contig above which 50% of an assembly would be represented. But this metric may not accurately reflect the quality of an assembly. An early assem-bly of the sea squirt Ciona intestinalis had an N50 of 234 kilobases. A subsequent assem-bly extended the N50 more than tenfold, but an analysis by Korf and colleagues showed that this assembly lacked several conserved genes, perhaps because algorithms discard-ed repetitive sequences2. This is not an iso-lated example: the same analysis found that an assembly of the chicken genome lacks 36 genes that are conserved across yeast, plants and other organisms. But these genes seem to be missing from the assembly rather than the organism: focused re-analysis of the raw data found most of these genes in sequences that had not been included in the assembly.
Though the sea squirt and chicken genomes were assembled several years ago, such examples are still relevant because assembly is more difficult with the shorter reads used today, says Deanna Church, a staff scientist at the US National Institutes of Health who leads efforts to improve assem-blies for the mouse and human genomes. “In my experience, people do not look at assemblies critically enough,” she says.
Assessing assemblersRight now, when researchers describe a new assembler, they often run it on a new data set, making comparisons difficult. But a few projects are examining how different assemblers perform with the same data. The goal is to learn both how to assemble
Genome assembly stitches together a genome from short sequenced pieces of DNA.
Competitions for genome assembly bring developers together to exchange advice and ideas, says Ivo Gut.
Mic
hael
Sch
atz,
Col
d Sp
ring
Harb
or
Farhat Habib
N50 VALUE
• the N50 length is defined as the length for which the collection of all contigs of that length or longer contains at least half of the sum of the lengths of all contigs
Farhat Habib
EFFECT OF K
from plantagora.org
Farhat Habib
EFFECT OF INSERT SIZE
Repetitive DNA and next-generation sequencing: computational challenges and solution,Todd J. Treangen & Steven L. Salzberg, Nature Reviews Genetics