DNA SEQUENCING - IISER Punefarhat/wordpress/wp-content/uploads/2011/06/... · DE NOVO GENOME...

Farhat Habib

� Sequencers cannot read an entire genome

� Assembly is the process of reconstructing the genome from the reads

DNA SEQUENCING

Farhat Habib Oct 9, Strand Life Sciences

Fragment Assembly

• Sequencers can only read small parts of the sequence so to obtain the final sequence, these “reads” have to be assembled into a long contiguous sequence

Why hydra?

• Hydra is a model organism for study of developmental patterning, cell differentiation, and regeneration

• Many mutants in hydra are known, including one that fails to regenerate a head, ones with altered sizes, and others with defects in nematocyte production

• Methods for making transgenic hydra have been developed

Hemmrich et al, (2006), Molecular Phylogenetics and Evolution

Phylogenetic tree from 16s mitochondrial rRNA datasets; Bootstrap values from ML/MP/NJ analysis

Farhat Habib

DE NOVO GENOME ASSEMBLYProblem: given a collection of reads, i.e. short subsequences of the genomic sequence in the alphabet “A, C, G, T”, reconstruct the genome from which the reads are derived.Challenges: ◦ Repeats in the genome

…ACCCAGTTGACTGGGATCCTTTTTAAAGACTGGGATTTTAACGCG…

CAGTTGACTG ACTGGGATCC Sample reads

GACTGGGATT ◦ Sequencing errors: substitutions, insertions, deletions, and others.

TTTTTATAGA (substitution), CCTT—TAAACG (deletion and insertion) ◦ Size of the data (typically 100x or more coverage)

Farhat Habib

DE NOVO GENOME ASSEMBLY CURRENT SOLUTIONS

� Overlap-layout-consensus (Celera, Newbler) ◦ Suitable for low coverage, long reads ◦ Highly parallelizable

� De Bruijn graph construction (ALLPATHS-LG, ABySS, Velvet, SOAPdenovo, EULER-SR)

◦ Suitable for high coverage, short reads ◦ Fast but memory-intensive ◦ Sensitive to sequencing errors

� Burrows-Wheeler Transform based (SGA)

Farhat Habib

OVERLAP� Find the best match between the suffix of one read

and the prefix of another

� Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment

� Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring

Farhat Habib

OVERLAP-LAYOUT-CONSENSUS

Overlap: find potentially overlapping reads

Layout: merge reads into contigs and contigs into supercontigs

Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..

Farhat Habib

OVERLAP-LAYOUT-CONSENSUS

NATURE BIOTECHNOLOGY VOLUME 29 NUMBER 11 NOVEMBER 2011 989

states that a connected directed graph has an Eulerian cycle if and only if it is balanced. In particular, Euler’s theorem implies that our de Bruijn graph contains an Eulerian cycle as long as we have located all k-mers present in the genome. Indeed, in this case, for any node, both its indegree and outdegree represent the number of times the (k–1)-mer assigned to that node occurs in the genome.

To see why Euler’s theorem must be true, first note that a graph that contains an Eulerian cycle is balanced because every time an ant traversing an Eulerian cycle passes through a particular vertex, it enters on one edge of the cycle and exits on the next edge. This pairs up all the edges touching each vertex, showing that half the edges touching the vertex lead into it and half lead out from it. It is a bit harder to see the converse—that every connected balanced

nucleotide of the k-mer assigned to that edge.Euler considered graphs for which there

exists a path between every two nodes (called connected graphs). He proved that a connected graph with undirected edges contains an Eulerian cycle exactly when every node in the graph has an even number of edges touching it. For the Königsberg Bridge graph (Fig. 1b), this is not the case because each of the four nodes has an odd number of edges touching it and so the desired stroll through the city does not exist.

The case of directed graphs (that is, graphs with directed edges) is similar. For any node in a directed graph, define its indegree as the number of edges leading into it and its outdegree as the number of edges leaving it. A graph in which indegrees are equal to outdegrees for all nodes is called ‘balanced’. Euler’s theorem

(e.g., AT, TG, GG, GC, CG, GT, CA and AA) can appear only once as a node of the graph. Then, connect node x to node y with a directed edge if some k-mer (e.g., ATG) has prefix x (e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer (Fig. 3d; in Box 3, we describe how this approach was originally discussed in the context of sequencing by hybridization).

Now imagine an ant that follows a differ-ent strategy: instead of visiting every node of the graph (as before), it now attempts to visit every edge of the graph exactly once. Sound familiar? This is exactly the kind of path that would solve the Bridges of Königsberg prob-lem and is called an Eulerian cycle. As it visits all edges of the de Bruijn graph, which rep-resent all possible k-mers, this new ant also spells out a candidate genome; for each edge that the ant traverses, one records the first

Figure 3 Two strategies for genome assembly: from Hamiltonian cycles to Eulerian cycles. (a) An example small circular genome. (b) In traditional Sanger sequencing algorithms, reads were represented as nodes in a graph, and edges represented alignments between reads. Walking along a Hamiltonian cycle by following the edges in numerical order allows one to reconstruct the circular genome by combining alignments between successive reads. At the end of the cycle, the sequence wraps around to the start of the genome. The repeated part of the sequence is grayed out in the alignment diagram. (c) An alternative assembly technique first splits reads into all possible k-mers: with k = 3, ATGGCGT comprises ATG, TGG, GGC, GCG and CGT. Following a Hamiltonian cycle (indicated by red edges) allows one to reconstruct the genome by forming an alignment in which each successive k-mer (from successive nodes) is shifted by one position. This procedure recovers the genome but does not scale well to large graphs. (d) Modern short-read assembly algorithms construct a de Bruijn graph by representing all k-mer prefixes and suffixes as nodes and then drawing edges that represent k-mers having a particular prefix and suffix. For example, the k-mer edge ATG has prefix AT and suffix TG. Finding an Eulerian cycle allows one to reconstruct the genome by forming an alignment in which each successive k-mer (from successive edges) is shifted by one position. This generates the same cyclic genome sequence without performing the computationally expensive task of finding a Hamiltonian cycle.

G

G

A

T

C

A T

G

CG

CGTGCAA

ATGGCGT

CAATGGCGGCGTGC

TGCAATG

ATGGCGT

GGCGTGC

CGTGCAA

TGCAATG

CAATGGC

ATGGCGTGCAATGGCGTATGGCGT

Short-readsequencing

Genome:

Vertices are k-mersEdges are pairwise alignments

Vertices are (k 1)-mersEdges are k-mers

a

c1

2

3

4

56

7

8

9

10

dAT

TG

GG

GC

CG

GT

CA

AA

3

45

6 7

8

1

29

10

ATG

TGG

GGC

GCGCGT

GTGTGC

GCA

CAA

AATATG

TGG

GGC

GCG

CGT

GTG

TGC

GCA

CAA

AAT

ATGATGGCGTGCAATG

b

Genome:Hamiltonian cycle

Visit each vertex once(harder to solve)

Eulerian cycleVisit each edge once

(easier to solve)

1

2

3

4

5

GGC

TGG

ATGAAT

CAA

GCA

TGCGTG

GCG

CGT

k-mers from edgesk-mers from vertices

PR IMERApproach with traditional Sanger sequencing

Compeau et al, Nature Biotechnology, 11 (2011)

Farhat Habib

OVERLAPPING READS

TAGATTACACAGATTAC

TAGATTACACAGATTAC

|||||||||||||||||

• Sort all k-mers in reads (e.g., k ~ 24)

• Find pairs of reads sharing a k-mer

• Extend to full alignment – throw away if not >95% similar

T GA

TAGA

| ||TACA

TAGT

||

Farhat Habib

OVERLAPPING READS AND REPEATS� A k-mer that appears N times, initiates N2 comparisons

� An Alu that appears 106 times initiates 1012 comparisons – too much

� Solution:

Discard all k-mers that appear more than

t × Coverage, (t ~ 10)

Farhat Habib

FINDING OVERLAPPING READSCreate local multiple alignments from the overlapping reads

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGA

Farhat Habib

FINDING OVERLAPPING READS Correct errors using multiple alignment

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGA

C: 20C: 35T: 30C: 35C: 40

C: 20C: 35C: 0C: 35C: 40

Score alignmentsAccept alignments with good scores

A: 15A: 25A: 40A: 25

- A: 15A: 25A: 40A: 25A: 0

Farhat Habib

ALTERNATIVE METHOD










G

G

A

T

C

A T

G

CG

CGTGCAA

ATGGCGT

CAATGGCGGCGTGC

TGCAATG

ATGGCGT

GGCGTGC

CGTGCAA

TGCAATG

CAATGGC



Genome:



a

c1

2

3

4

56

7

8

9

10

dAT

TG

GG

GC

CG

GT

CA

AA

3

45

6 7

8

1

29

10

ATG

TGG

GGC

GCGCGT

GTGTGC

GCA

CAA

AATATG

TGG

GGC

GCG

CGT

GTG

TGC

GCA

CAA

AAT

ATGATGGCGTGCAATG

b




(easier to solve)

1

2

3

4

5

GGC

TGG

ATGAAT

CAA

GCA

TGCGTG

GCG

CGT


PR IMER


Farhat Habib

SUPERSTRING PROBLEM

• In 1946, Nicolaas de Bruijn solved the ‘superstring problem’: find a shortest circular ‘superstring’ that contains all possible ‘substrings’ of length k (k-mers) over a given alphabet.

• There exist nk k-mers in an alphabet containing n symbols: for example

• If our alphabet is instead 0 and 1, then all possible 3-mers are simply given by all eight 3-digit binary numbers: 000, 001, 010, 011, 100, 101, 110, 111.

• The circular superstring 0001110100 not only contains all 3-mers but also is as short as possible, as it contains each 3-mer exactly once.

Farhat Habib










G

G

A

T

C

A T

G

CG

CGTGCAA

ATGGCGT

CAATGGCGGCGTGC

TGCAATG

ATGGCGT

GGCGTGC

CGTGCAA

TGCAATG

CAATGGC



Genome:



a

c1

2

3

4

56

7

8

9

10

dAT

TG

GG

GC

CG

GT

CA

AA

3

45

6 7

8

1

29

10

ATG

TGG

GGC

GCGCGT

GTGTGC

GCA

CAA

AATATG

TGG

GGC

GCG

CGT

GTG

TGC

GCA

CAA

AAT

ATGATGGCGTGCAATG

b




(easier to solve)

1

2

3

4

5

GGC

TGG

ATGAAT

CAA

GCA

TGCGTG

GCG

CGT


PR IMER


Farhat Habib

ANOTHER EXAMPLE CONSTRUCTING THE GRAPH (K = 4)

AGAT (8x)

ATCC (7x)

TCCG (7x)

CCGA (7x)

CGAT (6x)

GATG (5x)

ATGA (8x)

TGAG (9x)

GATC (8x)

AAGT (3x)

AGTC (7x)

GTCG (9x)

TCGA (10x)

GGCT (11x)

TAGA (16x)

AGAG (9x)

GAGA (12x)

GACA (8x)

ACAA (5x)

GCTT (8x)

GCTC (2x)

CTTT (8x)

CTCT (1x)

TTTA (8x)

TCTA (2x)

TTAG (12x)

CTAG (2x)

AGAC (9x)

CGAG (8x)

CGAC (1x)

GAGG (16x)

GACG (1x)

AGGC (16x)

ACGC (1x)

A branching vertex is caused by either a repeat in the original sequence or a sequencing error

Sequencing errors are normallydetected by a coverage cutoff threshold

Farhat Habib

EXAMPLE AFTER CONDENSATION

AAGTCGA

TAGAGCTTTAG

GCTCTAG

GAGACAA

CGAG

CGACGC

GAGGCT

AGATCCGATGAG

AGAG

Merge nodes with no branches

Farhat Habib

EXAMPLE AFTER ERROR REMOVAL

AAGTCGA

TAGAGCTTTAG

GAGACAA

CGAG

GAGGCT

AGATCCGATGAG

AGAG

Farhat Habib

EXAMPLE AFTER RECONDENSATION

AAGTCGAG GAGACAAGAGGCTTTAGA

AGATCCGATGAG

AGAG

Any non-branching path in this graph corresponds to a contig in the original sequence.

Farhat Habib

RESOLVING REPEATS USING PAIRED READS

Read 1

Read 2

Insert size: a design

parameter

Genome

Farhat Habib

RESOLVING REPEATS

P. Pevzner and H. Tang, Bioinformatics (2001) Suppl1:S225-33

REPEAT

S1

S3

S2

S4

Matches the distance in the graph,longer than repeat length

REPEATS1 S2

REPEATS3 S4

37

Genome: … S1 REPEAT S2 ……………. S3 REPEAT S4 …

Farhat Habib

RESOLVING REPEATS FAILURE

Mate pair transformation (Velvet, ABySS, EULER-SR) • Find a unique path between mates in the graph. • When multiple paths match the distance between mate-pairs, mate pair

transformation fails.

To resolve a repeat, insert size must be larger than the repeat length and smaller than the length of potential conjugate paths (same length paths passing through the repeat).

REPEAT1

S1

S3

S2

S4

Spans multiple paths

REPEAT2

P1

P2

Farhat Habib

SOME ASSEMBLERS• PHRAP

• Overlap O(n2) → layout (no mate pairs) → consensus • Celera

• Overlap → layout → consensus• Arachne

• Overlap → layout → consensus• Phusion

• Overlap → clustering → PHRAP → assemblage → consensus• Euler

• Indexing → Euler graph → layout by picking paths → consensus

Farhat Habib

DE BRUIJN GRAPH BASED ASSEMBLERS

• ABySS (Simpson et al, 2009)

• velvet (Zerbino and Birney 2008)

• SOAPdenovo (Li et al, 2009)

• Euler-SR (Chaisson and Pevzer 2008)

• SHARCGS (Dohm et al, 2007)

334 | VOL.9 NO.4 | APRIL 2012 | NATURE METHODS

TECHNOLOGY FEATURE

2. Find overlaps between reads

…AGCCTAGACCTACAGGATGCGCGACACGT GGATGCGCGACACGTCGCATATCCGGT…

3. Assemble overlaps into contigs

1. Fragment DNA and sequence

4. Assemble contigs into scaffolds

high-quality genomes and how to recognize a high-quality assembly. For the Genome Assembly Gold-standard Evaluations (GAGE), scientists led by Steven Salzberg at Johns Hopkins University School of Medicine assembled four genomes (three of which had been previously published) using eight of the most popular de novo assembly algorithms3.

Two other efforts, the Assemblathon and dnGASP (de novo Genome Assembly Assessment Project), have taken the form of competitions. Teams generally consisted of the software designers for particular assem-blers, who could adjust parameters as they thought best before submitting assembled genomes for evaluation. Performance was evaluated using simulated data from com-puter-designed genomes4.

The point is not identifying the best overall assembler at a particular point in time, but finding ways to assess and improve assemblers in general, says Ivo Gut, director of the National Genome Analysis Center in Spain, who ran dnGASP. dnGASP compared assembly teams’ performance on a specially designed set of artificial chromosomes: three derived from the human genome, three from the chicken genome, and others represent-ing the fruit fly, nematode, brewer’s yeast

and two species of mustard plant. In addit ion, the contest organizers included special ‘challenge chromo-somes’ that tested assembler perfor-mance on various repetitive struc-tures , divergent alleles and other difficult content.

T h e d a t a s e t for these calibra-tion chromosomes should be freely available later this

year. “You can run [the reference data set] through your assembler and post the results back on the server. And then you can opti-mize your results,” says Gut. Researchers can tune assembly parameters for their genome of interest and benchmark the performance of new versions of their assemblers, get-ting an early indication of an assembler’s performance with a modest investment of computational time, he explains. The

come from either end of DNA fragments that are too long to be sequenced straight through. Depending on the preparation technique, that distance can be as short as 200 base pairs or as large as several tens of kilobases. Knowing that paired reads were generated from the same piece of DNA can help link contigs into ‘scaffolds’, ordered assemblies of contigs with gaps in between. Paired-read data can also indi-cate the size of repetitive regions and how far apart contigs are.

Assessing quality is made more difficult because sequencing technology changes so quickly. In January of this year, Life Technologies launched new versions of its Ion Torrent machines, which can purportedly sequence a human genome in a day, for $1,000 in equipment and reagents. In February, Oxford Nanopore Technologies announced a technology that sequences tens of kilobases in continuous stretches, which would allow genome assembly with much more precision and drastically less computational work. Other companies, such as Pacific Biosciences, also have machines that produce long reads, and at least some researchers are already combining data types to glean the advantages of each.

Software engineers who write assembly programs know they need to adapt. “Every time the data changes, it’s a new problem,” says David Jaffe, who works on genome assembly methods at the Broad Institute in Cambridge, Massachusetts. “Assemblers

are always trying to catch up to the data.” Of course, until a technology has been available for a while, it is hard to know how much researchers will use it. Cost, ease of use, error rates and reliability are hard to assess before a wider community g a i n s m ore e x p e r i e n c e w i t h n e w procedures. Luckily, ongoing efforts for evaluating short-read assemblies should make innovations easier to evaluate and incorporate.

Judging genomesIn the absence of a high-quality reference genome, new genome assemblies are often evaluated on the basis of the number of scaf-folds and contigs required to represent the genome, the proportion of reads that can be assembled, the absolute length of contigs and scaffolds, and the length of contigs and scaffolds relative to the size of the genome.

The most commonly used metric is N50, the smallest scaffold or contig above which 50% of an assembly would be represented. But this metric may not accurately reflect the quality of an assembly. An early assem-bly of the sea squirt Ciona intestinalis had an N50 of 234 kilobases. A subsequent assem-bly extended the N50 more than tenfold, but an analysis by Korf and colleagues showed that this assembly lacked several conserved genes, perhaps because algorithms discard-ed repetitive sequences2. This is not an iso-lated example: the same analysis found that an assembly of the chicken genome lacks 36 genes that are conserved across yeast, plants and other organisms. But these genes seem to be missing from the assembly rather than the organism: focused re-analysis of the raw data found most of these genes in sequences that had not been included in the assembly.

Though the sea squirt and chicken genomes were assembled several years ago, such examples are still relevant because assembly is more difficult with the shorter reads used today, says Deanna Church, a staff scientist at the US National Institutes of Health who leads efforts to improve assem-blies for the mouse and human genomes. “In my experience, people do not look at assemblies critically enough,” she says.

Assessing assemblersRight now, when researchers describe a new assembler, they often run it on a new data set, making comparisons difficult. But a few projects are examining how different assemblers perform with the same data. The goal is to learn both how to assemble

Genome assembly stitches together a genome from short sequenced pieces of DNA.

Competitions for genome assembly bring developers together to exchange advice and ideas, says Ivo Gut.

Mic

hael

Sch

atz,

Col

d Sp

ring

Harb

or

Farhat Habib

N50 VALUE

• the N50 length is defined as the length for which the collection of all contigs of that length or longer contains at least half of the sum of the lengths of all contigs

Farhat Habib

EFFECT OF K

from plantagora.org

Farhat Habib

EFFECT OF INSERT SIZE

Repetitive DNA and next-generation sequencing: computational challenges and solution,Todd J. Treangen & Steven L. Salzberg, Nature Reviews Genetics

DNA SEQUENCING - IISER Punefarhat/wordpress/wp-content/uploads/2011/06/... · DE NOVO GENOME...

Documents

Transcript of DNA SEQUENCING - IISER Punefarhat/wordpress/wp-content/uploads/2011/06/... · DE NOVO GENOME...