Genome Sequence determination 陳中庸 E-mail: [email protected]@cycu.edu.tw Web site: .
-
Upload
neal-stanley -
Category
Documents
-
view
239 -
download
4
Transcript of Genome Sequence determination 陳中庸 E-mail: [email protected]@cycu.edu.tw Web site: .
Complete Microbial Genomes
Genome what now? Sequencing is…
Determining the full nucleotide sequence of one strain of an organism
Making predictions of genes within that sequence & predicting the function of those genes
HARD!!!! Sequencing requires…
Time Money People Computers
Before Sequencing … Nature of an organism Genetic code Genome size Genome structure
Sequencing means… - Bioinformatic
- Functional Assay - More….
Genome what now?
Organism Selection
Library Creation
Organism Selection
Library Creation
Sequencing
Organism Selection
Library Creation
Sequencing
Assembly
Organism Selection
Library Creation
Sequencing
Assembly
Organism Selection
Library Creation
Sequencing
Assembly
Organism Selection
Library Creation
Sequencing
Assembly
Gap Closure
Organism Selection
Library Creation
Sequencing
Assembly
Gap Closure
Organism Selection
Library Creation
Sequencing
Assembly
Gap Closure
Finishing
Organism Selection
Library Creation
Sequencing
Assembly
Gap Closure
Finishing
Annotation
Organism Selection
Library Creation
Sequencing
Assembly
Gap Closure
Finishing
Annotation
Which steps are computationally expensive?
Organism Selection
Library Creation
Sequencing
Assembly
Gap Closure
Finishing
Annotation
Organism Selection
Library Creation
Sequencing
Assembly
Gap Closure
Finishing
Annotation
Which steps have notalready been exceptionallywell studied?
Organism Selection
Library Creation
Sequencing
Assembly
Gap Closure
Finishing
Annotation
Organism Selection
Library Creation
Sequencing
Assembly
Gap Closure
Finishing
Annotation
Which step has not beensubjected to a variety ofapproaches?
Organism Selection
Library Creation
Sequencing
Assembly
Gap Closure
Finishing
Annotation
Organism Selection
Nature of an organism: Pathogen?
Genetic code
Genome size
Genome structure
Strain: YJ016
Genome Size: 5.2 Mb
Source: Southern Taiwan
Significance: Virulence
Strategy: Whole Genome Shotgun
Sequencing Coverage: 10X
Vibrio vulnificus
Organism Selection
Nature of an organism: Pathogen?
Genetic code: Special Code?
Genome size
Genome structure
Genetic Code Tableshttp://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c
Genetic Code No Life forms
1 Standard Code
2 Vertebrate Mitochondrial Code
3 Yeast Mitochondrial Code
4 Mold, Protozoan, & Mycoplasma/Spiroplasma Code
5 Invertebrate Mitochondrial Code
6 Ciliate, Dasycladacean and Hexamita Nuclear Code
9 Echinoderm and Flatworm Mitochondrial Code
10 Euplotid Nuclear Code
11 Bacterial and Plant Plastid Code
12 Alternative Yeast Nuclear Code
13 Ascidian Mitochondrial Code
14 Alternative Flatworm Mitochondrial Code
15 Blepharisma Nuclear Code
16 Chlorophycean Mitochondrial Code
21 Trematode Mitochondrial Code
22 Scenedesmus obliquus mitochondrial
23 Thraustochytrium Mitochondrial Code
Organism Selection
Nature of an organism: Pathogen?
Genetic code: Special Code?
Genome size: How many Megabases?
Genome structure
Organism Selection
Nature of an organism: Pathogen?
Genetic code: Special Code?
Genome size: How many Megabases?
Genome structure: Linear/Circular Chromosome? How many?
How to sequence a complete genome?
Sizes of bacterial genomes vary between :Mycoplasma genitalium and Myxobacteria: 0.6 Mb to ~13 Mb
• reading length of DNA sequencing reactions is just ~600 bp (= 0.0006 Mb) ⇒ a subdivision of the genome is obviously necessary
• If the genome needs to be subdivided into small pieces of suitable sizes for sequencing, then• Individual sequences/fragments need to be ordered somehow into their "native" order• Therefore, overlaps between each other are necessary in order to re-assemble the pieces
⇒ there are two main sequencing strategies: 1. whole genome shotgun sequencing2. ordered shotgun sequencing
c = Coverage;
A. Two ends are overlappedB. Non overlappedC. Plasmid percentage in contigs
Library Creation
1.Team Works2.QC control3.Time Table4.Budget5.Paper
Standard Operation Procedures of a Genome projectA. Decision
Mapping Protocol 1
B. Library
PCR Confirm Protocol 2
Protocol 3 DNA purification
PFGFISH
PCR confirm
Protocol 4
Shotgun Library
Picking
Protocol 5
決定盤數
Plasmid DNA
Print Labels
Protocol 6
Sequencing Reactions
Dye Primers
Dye Terminator
Protocol 7
Protocol 8
Gel Running
Assemble
Annotation
377
3700
Protocol 9
Protocol 10
Protocol 11
Protocol 12
C. Sequencing
D. Finish
QC
QC
QC
QC
QC
1. Restriction enzyme: Sau3AI (GATC)--- affected by CG methylase MboI (GATC) – affected by dam methylase -- not affected by CG methylase
2. Sonication: Sonication – Bal31 repair – T4 DNApolymerase – Sizing – Recover –Ligation
3. GeneMachine: easy sizing by filter
Random Shearing of Genomic DNALibrary (1)
Library (2)Library clones & Sequencing clones
1.8 Mb3.3 Mb
Chromosome I Chromosome II
Shotgun library
Library 1: 2.5-3.5 kb inserts7X Coverage
Library 2: 5.5-7.5 kb inserts3X Coverage
Library 3: 30 kb inserts Cosmid library 10X Clone Coverage, 0.4X Sequence Coverage
Sequenced for both ends Sequenced for both ends Sequenced for both ends
Assemble the reads by using phred/phrap/consed softwares
Contig 1 Contig 2 Contig 3
Closing the gaps by primer walking, PCR or re-sequencing
Annotation
Library (2)Library clones & Sequencing clones
5,000,000 bp 1000 bp/per clone
5,000,000/1000 = 5000 clones =52 x 96 well plates
10 x redundancy
52 x10 x 96 wells plates Library clones
Both ends sequencing
2 x 52 x 10 x 96 well plates ≒ 1000 plates Sequencing clones
Sequencing (1)Time table
1. 377: 2 runs/per day (one run for one 96 well plate) 3700 : 6 runs/per day (POP6) 8 runs/per day (POP5) 3730 : 12 runs/per day
2. 377 x 2 sets = 4 runs/per day 3700 x 2 sets = 6 x1 + 8 x 1 = 14 runs/per day total 18 runs per day
3. 1000 plates / 18 = 56 days = 11 weeks (3 months)
4. Today, 3730 for 4 sets = 48 runs/per day; 1000 plats /48 = 20 days
Sequencing (2)Cost
Library Cost 50000 per Genome
subtotal10% fail rate
Plasmid Purification 10.3 cost per sample 10.3 11.33
Sequencung 81.96 cost per sample 81.96 90.156
維修費 4.7 cost per sample 4.7 4.7
Total 96.96 106.186
Sequening for 5 MB 5MB =500 x 2 x 96 x 106.19 = 10193856 Shotgun Library 50000 10,243,856
ABI 377
ABI 3700
硬體設施
MegaBace 4000
ABI 3730XL
The automated production line for sample preparation at the Whitehead Institute, Center for Genome Research. The system consists of custom-designed factory-style conveyor belt robots that perform all functions from purifying DNA from bacterial cultures through setting up and purifying sequencing reactions.
Reads vs. Assembled Contigs
0
300
600
900
1200
1500
0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
Assembled reads
Assem
ble
d
con
tig
s
5X coverage
166
279
243245
328
359
Reads and Assembled Size
3.5
3.7
3.9
4.1
4.3
4.5
4.7
4.9
5.1
5.3
0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
5X coverage
Assembled reads
Assem
ble
d s
ize
(Mb
ps)
5.17
5.135.12
5.105.085.07
How assemble software works?
What is Gap Closure? What are gaps?
Unsequenced regions located between assembly generated fragments of contiguous sequence (contigs)
What causes gaps? Host toxicity, secondary structure, ???
Back to “gap closure” Producing, purifying, and sequencing, or
locating, the missing regions of DNA
How Can I Close Gaps? Genome Walking
Blind PCR extension of contigs Multiplex PCR
Combinatorial trial of every contig pair Read Pair Analysis
Use information stored by the assembler to suggest alignments, then PCR
Comparative Alignment
Comparative Alignment(the Bioinformatics Approach)
Find locations where contigs are homologous to known sequences
Determine if any contigs share homology in the same region of the same sequence
Design primers Conduct PCR with those primers Sequence that product and use that
sequence to close the gap
Blast Organism X(cross) -
Comparison
Compares contig ends to NCBI “nr” database with BlastN
Parses all hits and finds biologically possible contig pairs
Using the flanking sequence and Primer3, designs primers that will produce a PCR product spanning that gap
Using the flanking sequence and Primer3, design primers that produce a PCR product spanning that gap
TTATGCTATCGAATTCCGACG GTCTGCAGGTCTTCCGACGTAG
Using the flanking sequence and Primer3, design primers that produce a PCR product spanning that gap
TTATGCTATCGAATTCCGACG GTCTGCAGGTCTTCCGACGTAG
Using the flanking sequence and Primer3, design primers that produce a PCR product spanning that gap
TTATGCTATCGAATTCCGACG GTCTGCAGGTCTTCCGACGTAG
Information to reduce gaps
1. The distance of both end sequences2. Cosmid anchors3. Known genes4. Compare with other genomes5. Good luck
Finishing Standards
1.GENERAL RULES FOR FINISHING Phase1: draft sequence assembled in contigs Phase2: Contigs in order and linking Phase3: Assembled as one contig with low error rate (0.01) 2. Strategy of finishing A. primer walking B. re-sequencing individual clone C. PCR and sequencing D. Screening new clones E.. Subcloning F. Deletion and sequencing G. Change sequencing chemical H. Restriction map I. End sequencing
Shotgun sequencing – analogy – shredding several copies of Essential Cell Biology, then putting back together via overlapping phrases
Really only good for small genomes – 1995 – used for genome of Haemophilus influenza
Problem: repetitive nucleotide sequences, which make up large part of vertebrate genomes
(Analogy -- phrases like “the human genome” and difficulties they cause)
10_10_Repetit.sequence.jpgRepetitive sequences make correct assembly difficult
AnnotationGene Name
Copy Number
Methyl-accepting chemotaxis protein
Tar54
EAL domain Rtn 28
Acetyltransferases; including N-acetylases of ribosomal proteins
RimL 14
Permeases of the drug/metabolite transporter (DMT) superfamily
RhaT 20
Permeases of the major facilitator superfamily
ProP 20
Multiple Genes
Timeline of large-scale genomic analyses. Shown are selected components of work on Several non-vertebrate model organisms (red), the mouse (blue) and the human (green) from 1990; earlier projects are described in the text. SNPs, single nucleotide polymorphisms; ESTs, expressed sequence tags.
SCIENCE VOL. 277, p1453-1462, 1997
1998
1999
2000
2001
2002
2003
Set up genome center
NLBL mappedOver 300 clones
榮陽團隊
千萬鹼基完成
台灣第一個細菌基因體計劃– 創傷弧菌
靈芝計劃
第二個細菌基因體計劃 : 十字花科黑死菌
第三個細菌基因體計劃 : 黴漿菌
第四個細菌基因體計劃 : 克雷氏肺炎菌
第五個細菌基因體計劃 : 固甲浣菌
中研院水稻基因體
食科所紅麴菌基因體
YMGRC/NHRI
Strain: YJ016
Genome Size: 5.2 Mb
Source: Southern Taiwan
Significance: Virulence
Strategy: Whole Genome Shotgun
Sequencing Coverage: 10X
Vibrio vulnificus
http://genome.nhri.org.tw/vv/
Vibrio vulnificus
V. vulnificus Chr1
V. vulnificus. Chr 2
V. vulnificus. plasmid
Size (bp)3340475 1857025 48508
Total number of sequencing reads54662 33696 2676
G+C percentage46.4 47.2 44.9
Total number of ORFs3147 1625 62
Average ORF size(bp)938 1026 659
Percentage coding88.4% 89.7% 84.2%
Number of rRNA operon9 1 0
Number of tRNA87 12 0
Global feature of the Vibrio vulnificus YJ016 genome
25
30
35
40
45
50
55
0 500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 3.5e+06
GC
%
GC% of V. vulnificus Chromosome 1 & 2
25
30
35
40
45
50
55
0 2000004000006000008000001e+061.2e+061.4e+061.6e+061.8e+062e+06
GC
%
Chromosome 1 Chromosome 2
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0 500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 3.5e+06
GC
skew
GC skew of V. vulnificus Chromosome 1 & 2
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0 2000004000006000008000001e+061.2e+061.4e+061.6e+061.8e+062e+06
GC
skew
Chromosome 1 Chromosome 2
VV2
VC2
VC1
VV1
VC2
VC1
ACT/blastn at E-value=1, Score>500
Comparison of the similarity between V.v. and V.c. genome
Circular presentation of Vibrio vulnificus YJ016 genome
Chromosome 1Chromosome 1
Chromosome 2Chromosome 2
Plasmid pYJ016Plasmid pYJ016
3.3 Mb
1.85 Mb
48.5 Kb
Category V.v. Paralogousgenes
Vulnificus-specific
V.c. E. coli
Cellular Processes 783 257(67) 168(66) 631 839
Cell envelop biogenesis and outer membrane 156 51(17) 33(12) 124 207
Cell motility and secretion 184 88(14) 11(7) 153 159Cell division and chromosome partitioning 34 12(5) 3(2) 30 29
Posttranscriptional modification, protein turnover, chaperones 108 31(16) 33(17) 91 117
Inorganic ion transport and metabolism 148 27(11) 58(20) 100 188
Signal transduction mechanisms 153 48(4) 30(8) 133 139Information Storage and Processing 618 204(71) 58(20) 415 651
DNA replication, recombination and repair 209 69(27) 19(6) 130 227
Transcription 220 69(20) 23(6) 132 261Translation, ribosomal structure and biogenesis 189 66(24) 16(8) 153 163
Metabolism 941 379(145) 110(55) 724 1303
Lipid metabolism 78 25(11) 21(8) 58 86Nucleotide transport and metabolism 75 28(13) 3(3) 62 86
Coenzyme metabolism 118 32(11) 8(3) 108 125Carbohydrate transport and metabolism 194 113(39) 17(6) 103 347
Amino acid transport and metabolism 227 67(31) 32(21) 200 353Energy production and conversion 173 68(32) 17(9) 125 215
Secondary metabolism biosynthesis, transport and catabolism
76 46(8) 12(5) 68 91
Poorly Characterized 2543 972
Function unknown 236 ND ND 199 302General function prediction only 269 ND ND 193 335
Hypothetical protein 2038 ND ND 580 799
Comparison of predicted genes of V. vulnificus YJ016, V. cholerae El Tor N16961, and E. coli K12
Some more technological approaches…(some of which really work!)
• Sequencing by hybridization (annealing)• Sequencing by “ligase-edited” annealing• PyrosequencingNote: there are also higher tech versions of “classic” Sanger sequencing in the works (see http://www.helicosbio.com)
Several companies are pursuing massively parallel(= cheaper) new DNA sequencing strategies,including some that involve single moleculeanalyses.
Some of the main players are given below:454 Life Sciences(http://www.454.com/enabling-technology/the-system.asp)
Solexa (now part of Illumina)(http://www.illumina.com/pages.ilmn?ID=203)Helicos BioSciences(http://www.helicosbio.com)VisiGen Biotechnologies(http://www.visigenbio.com/technology.html)
Solexa sequencing technology
Solexa sequencing technology
Solexa sequencing technology