Genome Sequence determination 陳中庸 E-mail: [email protected]@cycu.edu.tw Web site: .

Genome Sequence determination

陳中庸

E-mail: [email protected] site: www.cychen.idv.tw

Complete Microbial Genomes

Genome what now? Sequencing is…

Determining the full nucleotide sequence of one strain of an organism

Making predictions of genes within that sequence & predicting the function of those genes

HARD!!!! Sequencing requires…

Time Money People Computers

Before Sequencing … Nature of an organism Genetic code Genome size Genome structure

Sequencing means… - Bioinformatic

- Functional Assay - More….

Genome what now?

Organism Selection

Library Creation

Organism Selection

Library Creation

Sequencing

Organism Selection

Library Creation

Sequencing

Assembly

Organism Selection

Library Creation

Sequencing

Assembly

Gap Closure

Organism Selection

Library Creation

Sequencing

Assembly

Gap Closure

Finishing

Organism Selection

Library Creation

Sequencing

Assembly

Gap Closure

Finishing

Annotation

Organism Selection

Library Creation

Sequencing

Assembly

Gap Closure

Finishing

Annotation

Which steps are computationally expensive?

Organism Selection

Library Creation

Sequencing

Assembly

Gap Closure

Finishing

Annotation

Organism Selection

Library Creation

Sequencing

Assembly

Gap Closure

Finishing

Annotation

Which steps have notalready been exceptionallywell studied?

Organism Selection

Library Creation

Sequencing

Assembly

Gap Closure

Finishing

Annotation

Organism Selection

Library Creation

Sequencing

Assembly

Gap Closure

Finishing

Annotation

Which step has not beensubjected to a variety ofapproaches?

Organism Selection

Library Creation

Sequencing

Assembly

Gap Closure

Finishing

Annotation

Organism Selection

Nature of an organism: Pathogen?

Genetic code

Genome size

Genome structure

Strain: YJ016

Genome Size: 5.2 Mb

Source: Southern Taiwan

Significance: Virulence

Strategy: Whole Genome Shotgun

Sequencing Coverage: 10X

Vibrio vulnificus

Organism Selection


Genetic code: Special Code?

Genome size

Genome structure

Genetic Code Tableshttp://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c

Genetic Code No Life forms

1 Standard Code

2 Vertebrate Mitochondrial Code

3 Yeast Mitochondrial Code

4 Mold, Protozoan, & Mycoplasma/Spiroplasma Code

5 Invertebrate Mitochondrial Code

6 Ciliate, Dasycladacean and Hexamita Nuclear Code

9 Echinoderm and Flatworm Mitochondrial Code

10 Euplotid Nuclear Code

11 Bacterial and Plant Plastid Code

12 Alternative Yeast Nuclear Code

13 Ascidian Mitochondrial Code

14 Alternative Flatworm Mitochondrial Code

15 Blepharisma Nuclear Code

16 Chlorophycean Mitochondrial Code

21 Trematode Mitochondrial Code

22 Scenedesmus obliquus mitochondrial

23 Thraustochytrium Mitochondrial Code

Organism Selection



Genome size: How many Megabases?

Genome structure

Organism Selection



Genome size: How many Megabases?

Genome structure: Linear/Circular Chromosome? How many?

How to sequence a complete genome?

Sizes of bacterial genomes vary between :Mycoplasma genitalium and Myxobacteria: 0.6 Mb to ~13 Mb

• reading length of DNA sequencing reactions is just ~600 bp (= 0.0006 Mb) ⇒ a subdivision of the genome is obviously necessary

• If the genome needs to be subdivided into small pieces of suitable sizes for sequencing, then• Individual sequences/fragments need to be ordered somehow into their "native" order• Therefore, overlaps between each other are necessary in order to re-assemble the pieces

⇒ there are two main sequencing strategies: 1. whole genome shotgun sequencing2. ordered shotgun sequencing

c = Coverage;

A. Two ends are overlappedB. Non overlappedC. Plasmid percentage in contigs

Library Creation

1.Team Works2.QC control3.Time Table4.Budget5.Paper

Standard Operation Procedures of a Genome projectA. Decision

Mapping Protocol 1

B. Library

PCR Confirm Protocol 2

Protocol 3 DNA purification

PFGFISH

PCR confirm

Protocol 4

Shotgun Library

Picking

Protocol 5

決定盤數

Plasmid DNA

Print Labels

Protocol 6

Sequencing Reactions

Dye Primers

Dye Terminator

Protocol 7

Protocol 8

Gel Running

Assemble

Annotation

377

3700

Protocol 9

Protocol 10

Protocol 11

Protocol 12

C. Sequencing

D. Finish

QC

QC

QC

QC

QC

1. Restriction enzyme: Sau3AI (GATC)--- affected by CG methylase MboI (GATC) – affected by dam methylase -- not affected by CG methylase

2. Sonication: Sonication – Bal31 repair – T4 DNApolymerase – Sizing – Recover –Ligation

3. GeneMachine: easy sizing by filter

Random Shearing of Genomic DNALibrary (1)

Library (2)Library clones & Sequencing clones

1.8 Mb3.3 Mb

Chromosome I Chromosome II

Shotgun library

Library 1: 2.5-3.5 kb inserts7X Coverage

Library 2: 5.5-7.5 kb inserts3X Coverage

Library 3: 30 kb inserts Cosmid library 10X Clone Coverage, 0.4X Sequence Coverage

Sequenced for both ends Sequenced for both ends Sequenced for both ends

Assemble the reads by using phred/phrap/consed softwares

Contig 1 Contig 2 Contig 3

Closing the gaps by primer walking, PCR or re-sequencing

Annotation

Library (2)Library clones & Sequencing clones

5,000,000 bp 1000 bp/per clone

5,000,000/1000 = 5000 clones =52 x 96 well plates

10 x redundancy

52 x10 x 96 wells plates Library clones

Both ends sequencing

2 x 52 x 10 x 96 well plates ≒ 1000 plates Sequencing clones

Sequencing (1)Time table

1. 377: 2 runs/per day (one run for one 96 well plate) 3700 : 6 runs/per day (POP6) 8 runs/per day (POP5) 3730 : 12 runs/per day

2. 377 x 2 sets = 4 runs/per day 3700 x 2 sets = 6 x1 + 8 x 1 = 14 runs/per day total 18 runs per day

3. 1000 plates / 18 = 56 days = 11 weeks (3 months)

4. Today, 3730 for 4 sets = 48 runs/per day; 1000 plats /48 = 20 days

Sequencing (2)Cost

　　　　　　

Library Cost 　 50000 per Genome 　　

　　　　 subtotal10% fail rate

Plasmid Purification 　 10.3 cost per sample 10.3 11.33

Sequencung 　 81.96 cost per sample 81.96 90.156

維修費　 4.7 cost per sample 4.7 4.7

Total 　　　 96.96 106.186

　　　　　　　 Sequening for 5 MB 5MB =500 x 2 x 96 x 106.19 = 10193856 　　 Shotgun Library 　　 50000 　　　　　 10,243,856 　

ABI 377

ABI 3700

硬體設施

MegaBace 4000

ABI 3730XL

The automated production line for sample preparation at the Whitehead Institute, Center for Genome Research. The system consists of custom-designed factory-style conveyor belt robots that perform all functions from purifying DNA from bacterial cultures through setting up and purifying sequencing reactions.

Reads vs. Assembled Contigs

0

300

600

900

1200

1500

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Assembled reads

Assem

ble

d

con

tig

s

5X coverage

166

279

243245

328

359

Reads and Assembled Size

3.5

3.7

3.9

4.1

4.3

4.5

4.7

4.9

5.1

5.3

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

5X coverage

Assembled reads

Assem

ble

d s

ize

(Mb

ps)

5.17

5.135.12

5.105.085.07

How assemble software works?

What is Gap Closure? What are gaps?

Unsequenced regions located between assembly generated fragments of contiguous sequence (contigs)

What causes gaps? Host toxicity, secondary structure, ???

Back to “gap closure” Producing, purifying, and sequencing, or

locating, the missing regions of DNA

How Can I Close Gaps? Genome Walking

Blind PCR extension of contigs Multiplex PCR

Combinatorial trial of every contig pair Read Pair Analysis

Use information stored by the assembler to suggest alignments, then PCR

Comparative Alignment

Comparative Alignment(the Bioinformatics Approach)

Find locations where contigs are homologous to known sequences

Determine if any contigs share homology in the same region of the same sequence

Design primers Conduct PCR with those primers Sequence that product and use that

sequence to close the gap

Blast Organism X(cross) -

Comparison

Compares contig ends to NCBI “nr” database with BlastN

Parses all hits and finds biologically possible contig pairs

Using the flanking sequence and Primer3, designs primers that will produce a PCR product spanning that gap

Using the flanking sequence and Primer3, design primers that produce a PCR product spanning that gap

TTATGCTATCGAATTCCGACG GTCTGCAGGTCTTCCGACGTAG

Information to reduce gaps

1. The distance of both end sequences2. Cosmid anchors3. Known genes4. Compare with other genomes5. Good luck

Finishing Standards

1.GENERAL RULES FOR FINISHING Phase1: draft sequence assembled in contigs Phase2: Contigs in order and linking Phase3: Assembled as one contig with low error rate (0.01) 2. Strategy of finishing A. primer walking B. re-sequencing individual clone C. PCR and sequencing D. Screening new clones E.. Subcloning F. Deletion and sequencing G. Change sequencing chemical H. Restriction map I. End sequencing

Shotgun sequencing – analogy – shredding several copies of Essential Cell Biology, then putting back together via overlapping phrases

Really only good for small genomes – 1995 – used for genome of Haemophilus influenza

Problem: repetitive nucleotide sequences, which make up large part of vertebrate genomes

(Analogy -- phrases like “the human genome” and difficulties they cause)

10_10_Repetit.sequence.jpgRepetitive sequences make correct assembly difficult

AnnotationGene Name

Copy Number

Methyl-accepting chemotaxis protein

Tar54

EAL domain Rtn 28

Acetyltransferases; including N-acetylases of ribosomal proteins

RimL 14

Permeases of the drug/metabolite transporter (DMT) superfamily

RhaT 20

Permeases of the major facilitator superfamily

ProP 20

Multiple Genes

Timeline of large-scale genomic analyses. Shown are selected components of work on Several non-vertebrate model organisms (red), the mouse (blue) and the human (green) from 1990; earlier projects are described in the text. SNPs, single nucleotide polymorphisms; ESTs, expressed sequence tags.

SCIENCE VOL. 277, p1453-1462, 1997

1998

1999

2000

2001

2002

2003

Set up genome center

NLBL mappedOver 300 clones

榮陽團隊

千萬鹼基完成

台灣第一個細菌基因體計劃– 創傷弧菌

靈芝計劃

第二個細菌基因體計劃 : 十字花科黑死菌

第三個細菌基因體計劃 : 黴漿菌

第四個細菌基因體計劃 : 克雷氏肺炎菌

第五個細菌基因體計劃 : 固甲浣菌

中研院水稻基因體

食科所紅麴菌基因體

YMGRC/NHRI

Strain: YJ016

Genome Size: 5.2 Mb

Source: Southern Taiwan

Significance: Virulence

Strategy: Whole Genome Shotgun

Sequencing Coverage: 10X

Vibrio vulnificus

http://genome.nhri.org.tw/vv/

Vibrio vulnificus

　 V. vulnificus Chr1

V. vulnificus. Chr 2

V. vulnificus. plasmid

Size (bp)3340475 1857025 48508

Total number of sequencing reads54662 33696 2676

G+C percentage46.4 47.2 44.9

Total number of ORFs3147 1625 62

Average ORF size(bp)938 1026 659

Percentage coding88.4% 89.7% 84.2%

Number of rRNA operon9 1 0

Number of tRNA87 12 0

Global feature of the Vibrio vulnificus YJ016 genome

25

30

35

40

45

50

55

0 500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 3.5e+06

GC

%

GC% of V. vulnificus Chromosome 1 & 2

25

30

35

40

45

50

55

0 2000004000006000008000001e+061.2e+061.4e+061.6e+061.8e+062e+06

GC

%

Chromosome 1 Chromosome 2

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

0 500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 3.5e+06

GC

skew

GC skew of V. vulnificus Chromosome 1 & 2

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0 2000004000006000008000001e+061.2e+061.4e+061.6e+061.8e+062e+06

GC

skew

Chromosome 1 Chromosome 2

VV2

VC2

VC1

VV1

VC2

VC1

ACT/blastn at E-value=1, Score>500

Comparison of the similarity between V.v. and V.c. genome

Circular presentation of Vibrio vulnificus YJ016 genome

Chromosome 1Chromosome 1

Chromosome 2Chromosome 2

Plasmid pYJ016Plasmid pYJ016

3.3 Mb

1.85 Mb

48.5 Kb

Category V.v. Paralogousgenes

Vulnificus-specific

V.c. E. coli

Cellular Processes 783 257(67) 168(66) 631 839

Cell envelop biogenesis and outer membrane 156 51(17) 33(12) 124 207

Cell motility and secretion 184 88(14) 11(7) 153 159Cell division and chromosome partitioning 34 12(5) 3(2) 30 29

Posttranscriptional modification, protein turnover, chaperones 108 31(16) 33(17) 91 117

Inorganic ion transport and metabolism 148 27(11) 58(20) 100 188

Signal transduction mechanisms 153 48(4) 30(8) 133 139Information Storage and Processing 618 204(71) 58(20) 415 651

DNA replication, recombination and repair 209 69(27) 19(6) 130 227

Transcription 220 69(20) 23(6) 132 261Translation, ribosomal structure and biogenesis 189 66(24) 16(8) 153 163

Metabolism 941 379(145) 110(55) 724 1303

Lipid metabolism 78 25(11) 21(8) 58 86Nucleotide transport and metabolism 75 28(13) 3(3) 62 86

Coenzyme metabolism 118 32(11) 8(3) 108 125Carbohydrate transport and metabolism 194 113(39) 17(6) 103 347

Amino acid transport and metabolism 227 67(31) 32(21) 200 353Energy production and conversion 173 68(32) 17(9) 125 215

Secondary metabolism biosynthesis, transport and catabolism

76 46(8) 12(5) 68 91

Poorly Characterized 2543 972

Function unknown 236 ND ND 199 302General function prediction only 269 ND ND 193 335

Hypothetical protein 2038 ND ND 580 799

Comparison of predicted genes of V. vulnificus YJ016, V. cholerae El Tor N16961, and E. coli K12

Some more technological approaches…(some of which really work!)

• Sequencing by hybridization (annealing)• Sequencing by “ligase-edited” annealing• PyrosequencingNote: there are also higher tech versions of “classic” Sanger sequencing in the works (see http://www.helicosbio.com)

Several companies are pursuing massively parallel(= cheaper) new DNA sequencing strategies,including some that involve single moleculeanalyses.

Some of the main players are given below:454 Life Sciences(http://www.454.com/enabling-technology/the-system.asp)

Solexa (now part of Illumina)(http://www.illumina.com/pages.ilmn?ID=203)Helicos BioSciences(http://www.helicosbio.com)VisiGen Biotechnologies(http://www.visigenbio.com/technology.html)

Solexa sequencing technology

Genome Sequence determination 陳中庸 E-mail: [email protected]@cycu.edu.tw Web site: .

Documents

Transcript of Genome Sequence determination 陳中庸 E-mail: [email protected]@cycu.edu.tw Web site: .