Repeated (Conserved) Patterns in Bioinformatics2010/04/09 · appear many times, e.g., commonor...
Transcript of Repeated (Conserved) Patterns in Bioinformatics2010/04/09 · appear many times, e.g., commonor...
-
Repeated (Conserved) Patterns in Bioinformatics
Francis Y.L. Chin {钱玉麟教授}Taikoo Chair of Engineering
Chair Professor of Computer ScienceAssociate Dean of Engineering
University of Hong Kong
March, 2010
-
BioinformaticsUse laboratory experiments to understand biological processes is difficult, laborious, expensive and time-consuming.
Nowadays, large volumes of biological data are available.
Bioinformatics aims to exploit this data to understand biological processes through computational approach.
-
What are Repeated Patterns?Repeated patterns are similar patterns that appear many times, e.g., common or conserved patterns.Repeated patterns can be measured by the probability of their occurrence in a random environment (p-value)Low p-value means “information or signal bearing” and not “artifact”High p-value implies “inconclusive”
-
Why Repeated Patterns?
Finding repeated patterns is important in bioinformatics research
Sequence analysisAnalysis of mutations Comparative genomicsEvolutionary biology BiodiversityProtein-protein interaction
-
Repeated (Conserved) Patterns in DNA Sequences
-
The Central Dogma
DNA
produces
Protein
http://research.microsoft.com/uai2004/Slides/FriedmanUAI2004-part-II.ppt
Gene
-
Binding Site
Gene
Tyr Leu Protein
DNA
Transcription factor
C GA
AG
C AT
GG A
T
Identifying the TF binding sites on DNA sequence is an important problem.
Binding Site
-
Binding Sites Identification•
Find genes associated with the same TF.•
Search for short similar patterns (motif), i.e., binding sites, in the DNA sequences.
GCN4
DNA AAAATTGAGTCATATC…GAGAATGCCGGTCGTTCACGTG…
HIS7
GCN4
DNA AGTTATGACTAATATT …TATCATGTCCGAGGCGACTTTG…
TRP4
GCN4
DNA CCGAATGACTGCTCAT…AAAAATGTGTGGTATTTTAGGTA…
ADE4
Promoter regions | gene
-
Example>SPR3……CTGGTCGTAATACAAATAGAAGAGGTAAACCAATCAATGGCCC GTTAGTTTGCCATTTGCTGCATCCTTCCCATGCAAAGTGTCTT……>COX6……ACAGAAAATTCCAATCAAAAAGTTGGTGTTAGGCTATACTGAT GGCCGTATCGCTCCATACGAGCCAATCAGGGCCCCGCGCGTTA……>QCR8……CCACGTGACTAGTCCAAGGATTTTTTTTAAGCCAATTAAAATG AAGAAATGCGTGATCGGAAATTACGGGTAGTACGAGAAGGAAA……>CYC1……GGGCTTGATCCACCAACCAACGCTCGCCAAATGAACTGGCGCT TTGGTCTTCTGCCATCGTCCGTAAACCCCTTCCAAAGAGACCG……
Hypothesis: The binding sites are short similar string pattern in each sequence.
-
Example>SPR3……CTGGTCGTAATACAAATAGAAGAGGTAAACCAATCAATGGCCC GTTAGTTTGCCATTTGCTGCATCCTTCCCATGCAAAGTGTCTT……>COX6……ACAGAAAATTCCAATCAAAAAGTTGGTGTTAGGCTATACTGAT GGCCGTATCGCTCCATACGAGCCAATCAGGGCCCCGCGCGTTA……>QCR8……CCACGTGACTAGTCCAAGGATTTTTTTTAAGCCAATTAAAATG AAGAAATGCGTGATCGGAAATTACGGGTAGTACGAGAAGGAAA……>CYC1(reverse)……GGGCTTGATCCACCAACCAACGCTCGCCAAATGAACTGGCGCT TTGGTCTTCTGCCATCGTCCGTAAACCCCTTCCAAAGAGACCG……
All these strings are binding sites, and similar to a common pattern, CCAATCA, called motif.
-
String MotifGTTGTCATGGTGACGTTTCCATGGAAACGCTACCATGGCAACGTTACCATAGTAACGTTTCCATGGTAAC
Motif represents the common pattern of binding sitesThe binding sites are variants of the motif.Consensus motif gives the minimum total number of
errors (NP-complete for finding the motif with minimum maximum error) (Li et al, JCSS 2002)
GTTACCATGGTAAC – Consensus string (motif)
C. elegans Binding sites
-
Planted (l,d)-Motif Problem (PMP) (Pevzner and Sze, ISMB 2000)
l = length of motif
Md = Hamming DistanceInput:
T = t length-n sequences, each with at least one binding site.
Problem:Find M
and the binding sites (sub-strings
within Hamming distance d from M)
-
Can Motif Always be Found? Many methods, EM, Gibbs Sampling, exhaustive search,
maximal clique,… exist to find the motif.Motif can never be found when
Too few sequences/binding sites (t and n are too small)Binding sites “too” short (l is too small)Binding sites vary too much (d is too large)
Because p-vaule of those similar patterns is too high, i.e., more than one possible solutions will result.
Successful only when the “similar” patterns are “many” and “long” (low p-values) – such existence probability by random is very low.
-
Problems with String Motif
GTTACCATGGTAAC is a motifCTTACCATGGTAAC
(not binding site)Hamming distance cannot
model the real situation
GTTGTCATGGTGACGTTTCCATGGAAACGCTACCATGGCAACGTTACCATAGTAACGTTTCCATGGTAAC GTTACCATGGTAAC
Consensus string as motif
Hypothesis: The binding sites are short similar string pattern in each sequence, while some positions are conserved.
-
Motif represented by Matrix
Probability Matrix M M(α,j) = probability that the jth position is α
0 0 0 .8 0 0 0
.6 .8 1 0 0 0 .2
.2 .2 0 0 0 1 .8
.2 0 0 .2 1 0 0
AC
GT
Binding sites:TCCATGGCGCATGGCCCATGCGCCATGGCCCTTGG
Matrix motif (PSSM)
-
Problems about Matrix Representation
Given the matrix, what are binding sites?Hamming distance for string representation
Which matrix can be the motif?The pattern which gives most number of binding sites (string representation)
-
Motif RepresentationGiven matrix M,
is σ = GGCTTGC a binding
site?
Pr(σ generated by M)∏=
=l
j
jjMMp1
)],[(),( σσ
0 0 0 .8 0 0 0
.6 .8 1 0 0 0 .2
.2 .2 0 0 0 1 .8
.2 0 0 .2 1 0 0
AC
GT
p(M,“GGCTTGC”)= 0.2 ×
0.2 ×
1 ×
0.2
×
1 ×
1 ×
0.2
= 0.0016
-
Motif Representation
Pr(σ generated by Background)
∏=
=l
j
jBBp1
])[(),( σσ
B(A) = 0.2B(C) = 0.3B(G) = 0.3B(T) = 0.2
p(B,“GGCTTGC”)= 0.3 ×
0.3 ×
0.3 ×
0.2
×
0.2 ×
0.3 ×
0.3
= 0.0000972
-
What are the Binding Sites?σ is a binding site iff
Large ⇒
likely to be a binding sites
Example: “GGCTTGC”
= log 16.5 = 1.218 (a binding site if t
=1)
)(threshold ),(),(log t
BpMp
≥⎟⎟⎠
⎞⎜⎜⎝
⎛σσ
⎟⎠⎞
⎜⎝⎛=⎟⎟
⎠
⎞⎜⎜⎝
⎛0000972.0
0016.0log((log
)GGCTTGC"",)GGCTTGC"",
BpMp
-
Which is the Correct Matrix?
Each matrix M is given a score (Information Content), IC(M)
The score increases withNumber of binding sitesSimilarity with M, log(p(M,σ)/p(B,σ)) – t
The correct matrix M* has the maximum score.
∑≥
⎟⎟⎠
⎞⎜⎜⎝
⎛−⎟⎟⎠
⎞⎜⎜⎝
⎛=
tBpMpt
BpMpMIC
)),(/),(log(: ),(),(log)(
σσσ σσ
-
Motif-Finding ProblemInput:
A set of sequences bound by a particular transcription factor
Output:A motif (probability matrix)Positions of binding sites in each sequence
Leung and Chin, "Finding Exact Optimal Motif in Matrix Representation by Partitioning", Bioinformatics, Vol 21, Supp 2, ECCB/JBI, ii86-92 (September 2005)
-
False positivesPattern “CGCGCG” appears many times in the
sequences, but not binding sites. Why?Some genes are not regulated (contains no
binding sites), their promoter regions also contain many “CGCGCG” patternsThese sequences can be used as negative set (control set)
Hypothesis: Those sequences without binding sites probably do not contain any string patterns similar to the motif.
-
Generalized Motif-Finding Problem
Input:T = sequences containing binding sitesF = sequences not containing binding sites
Output:Motif M and the binding sites
Leung and Chin, "Finding Motifs from All Sequences With and Without Binding Sites," Bioinformatics 2006
-
Color Ratio and Binding EnergyMicroarray experiments can be used to indicate
gene expression which is measured by color intensity (the probability of TF binding)
The amount of binding energy (strong or weak) can be estimated by the color ratio.
Hypothesis: Each binding site, depending on its pattern, has different binding energy with the TF.
Hypothesis: Higher color intensity means stronger binding.
-
Energy-based Model
Seq 1
Seq 2
Seq 4
Seq 3
-4.8
-5.1
-4.6
-0.5
-
Energy-based Model
Seq 1
Seq 2
Seq 4
Seq 3
Seq 2
CCAGATGAGATG
GACGATGAACGC
AGTGCTGAGGCTCCACCAGCTATT
-0.5 -0.7 0.5 -0.5 -0.6 0.1
0.3 -0.4 -0.1 0.3 0.1 0.1
-1.1 0.5 0.2 -0.8 -0.2 -0.4
0.1 0.8 -1.5 -0.2 0.3 -0.2
A
C
G
T
Energy Matrix:
-5.1
-4.6
-4.8
-0.5
-
Problem with binding energyInput:
A set of DNA sequencesThe binding energy between TF and each sequence (color intensity in microarray)
Output:The motif (energy matrix) M which produces the binding energy of each sequence
The binding sites are those patterns in the sequence with the lowest energy.
-
Simulated Data
Expected number of matrices
EBMF AlignACE MEME
Find? rank Find? rank Find? rank
B = 7 149475 yes 1 no - no -
B = 8 0.000439 yes 1 no - yes 1
B = 9 7.7×10-07 yes 1 yes 1 yes 1
Results of the algorithms on simulated data for 200 sequences of length 700 where 10 of them contain B binding sites of length 17 with expected likelihood -10
Leung, Chin, Yiu, et al, "Finding Motifs with Insufficient Number of Strong Binding Sites", Journal of Computational Biology, 2005, preliminary version appeared in RECOMB 2004.
-
Real Data
EBMF AlignACE MEME
Find? rank Find? rank Find? rank
Using the top 100 sequences in theoriginal data yes 2 yes 1 yes 1
Using the top 100 sequences exceptsequences 2,3,4 and 6 yes 1 no - no -
Using the top 100 sequences exceptsequences 1 to 6 yes 10 no - no -
Using the top 100 sequences exceptsequences 1 to 8 yes 5 no - no -
GAL4 (motif pattern is CGGN11 CCG)
-
Further Information about Transcription FactorsProteins (Transcription Factors) are with 3D
structure and can be grouped into classes
Their binding sites have different characteristics
Zinc finger Leucine zipper Helix-Turn-Helix
-
Six Major Classes
Guess the motif class based on the characteristics of binding sites
Search motifs in that class by modifying their likelihood accordingly
Classes Sub-Classes Characteristics Freq
Zinc FingerI. Cys2 His2
G . . G | G . . G . . G | [CG] . . [CG] . . [CG] 13%
II. Cys4 AGGTCA | TGACCT 13%Leucine zipper III. bZip TGA .* TCA 23%
Helix-Turn-HelixIV. bHLH CA . . TG 3%V. Homedomain TAAT | ATTA 11%
VI. Others (e.g. Forkhead) unknown 33%
-
3D Binding Domains and DNA Binding Sites
TFs - classified by 3D binding domains.3D binding domains and DNA binding sites are
relatedDNA binding sites - classified accordingly
Hypothesis: Most transcription factors can be classified into a few protein structures and binding domains, most binding sites should have a few patterns.
Leung and Chin, "Discovering Motifs with Transcription Factor Domain Knowledge", PSB2007
-
Experimental ResultsNumber of motifs with known sites discovered
MEME / DIMDomMEME: 38 DIMDom: 47
Average accuracyMEME: 0.3141DIMDom: 0.4471
Higher accuracy MEME: 9 data setsDIMDom: 26 data setssame accuracy: 3 data sets
-
Protein Sequences: A Motif-Pair for Binding
Leung, Siu, Yiu, Chin, Sung, "Finding Linear Motif Pair from Protein Interaction Networks: A Probabilistic Approach", CSB2007
-
Protein-Protein Interaction (PPI) Network
VertexProtein sequence
Edge (interaction) between u
and v
Protein u and protein vcan bind together
Many interactions are missing and erroneous
-
Motif Pairs for Protein Interaction
Problem: When two proteins interact, where are their binding sites (domains)?
-
Binding sites and Motif
1) Different proteins may have similar binding sites.
GLFPSNY
GFIPGNY
GVFPGNY
GIFPLNY
2) The proteins that these proteins bind to also contain another set of similar binding sites.
PTLPPR
PIKPPR
PTAPQR
PTLPSR
PPLPNR
PPLPTR
Interaction
-
If M and M’
are two motifs representing two sets of real binding sites that interact,
we expect that the sequences containing instances of the corresponding motifs should have many interactions.
Hypothesis
-
The Motif Pair Finding ProblemInput: protein-protein interaction network
Problem: To find a pair of motifs (M1, M2) such that sequences containing M1 and sequences containing M2 have unexpectedly large number of interactions.
( , ) may be a real motif pair!
e.g. ( , ) is not a real motif pair
-
Another Problem related to PPI: Predicting Protein Complexes
Leung, Xiang, Yiu and Chin, "Predicting Protein Complexes from PPI Data: A Core-Attachment Approach", Journal of Computational Biology, 2009. Preliminary version presented in RECOMB Satellite 2008.
-
Protein Complex
Protein ComplexGroup of proteins bind togetherRepresented as a connectedsubgraph in PPI network
Problem: To predict protein complexes from PPI network
Hypothesis: Proteins in the same complex have more interactions among them.
-
Protein complexes (dense subgraphs)
-
Heuristics to Find Dense Sub-graphsMarkov Cluster (MCL) (Enright et al., Nucl Acids Res 2002)Bootstrapping through random walks in graphs,
i.e., trap in clusters and rarely go out to other clustersMolecular Complex Detection (MCODE)
(Bader and Hogue, BMC Bioinf 2003)Start with high-degree vertices and recursively merge with
neighbors to ensure its density is above a given thresholdCFinder (Adamcsek et al., Bioinformatics 2006)Locate overlapping cliques by merging two k-cliques
if they share k-1 nodes
Core and Attachment (Leung et al., JCB 2009)Based on biological information that each complex has a core.
-
DifficultiesMany interactions are missing
The protein complexes may not necessarily be dense subgraphs, especially when the complexes are large
Some proteins present in multiple complexes
-
Structure of Protein Complex
Proteins in a complex consist of [Gavin et al. 2006]Core proteinsAttachments
Each protein complex has a unique set of core proteins
Core
Attachments
Modules
-
Core ProteinsRelatively more interactions among themselves
Dense subgraphsAttachments bind to core protein to form
complexesAttachments are neighbors of cores
Each protein complex has a unique set of core proteins
Cores are disjointCores do not present in other complexes
-
Experimental Results on Cores
Compare with Mediator [Andreopoulos et al. 2007] on Gavin dataset
# of correct cores acc ≥
0.4 acc ≥
0.6 acc ≥
0.8
Mediator 29 8 0
Ours 267 169 103
-
Experiments on Complexes
Compare with 3 methods (MCL, MCODE and CFinder) on 3 datasets
Datasets Number of proteinsNumber of
interactionsAverage degree
DIP 4,928 17,201 6.98
Krogan 2,675 7,080 5.29
Gavin 1,430 6,531 9.13
-
Comparison with MCL
MCL/Ours# of correct complexes
acc ≥
0.6 acc ≥
0.7 acc ≥
0.8
DIP 30 / 36 20 / 26 10 / 15
Krogan 28 / 37 16 / 21 7 / 11
Gavin 32 / 35 26 / 29 11 / 17
-
Comparison with MCODE
MCODE/Ours# of correct complexes
acc ≥
0.6 acc ≥
0.7 acc ≥
0.8
DIP 17 / 29 13 / 22 7 / 13
Krogan 6 / 24 5 / 16 2 / 8
Gavin 23 / 32 19 / 27 8 / 17
-
Comparison with CFinder
CFinder/Ours# of correct complexes
acc ≥
0.6 acc ≥
0.7 acc ≥
0.8
DIP 19 / 28 14 / 22 10 / 13
Krogan 28 / 28 13 / 19 5 / 11
Gavin 25 / 29 20 / 26 13 / 16
-
Different Biological Processes between Two Groups of Species
Given two groups of species, each with a metabolic network
Find those reactions or metabolic pathways which belong to most of the networks in one group but not in the other.
Hypothesis: There must exist a set of reactions or metabolic pathways in one group of specifies but not in the other.
-
Repeated Patterns makes Genome Assembly Difficult
Peng, Leung, Yiu and Chin, “IDBA - Iterative de Bruijn Graph de Novo Assembler”, RECOMB 2010 (to appear).
-
Genome with unknown sequence
Sequencing
Assembling
De novo Assembling
Read (45 – 140bp)
Genome with known sequence
However, there are many problems
-
Problems
Error in reads1-2% error rate per nucleotide
e.g. 1% error rate, 75bp read length~1 – (1 – 1%)75 = 53% reads have error
Gap Positions with no read cover
Repeatlength of repeat ≥
read length
impossible to assemble
Repeats in E.coli.
length Repeat #
30 3899
40 2784
50 2248
100 1074
200 536
300 345
500 200
1000 101
-
Genome
Sequencing
Assembling
De novo Assembling
gap
RepeatRepeat
error
contig new gap
-
Input:A set of reads from a genome
Objective:Construct contigs of the genome
Accuracy (>99.9%)CoverageLength
Assembling Problem
N50 (length of shortest contig in a set cover ≥ 50% genome)
-
Existing Approaches
GreedySSAKE (Warren, 2007), SHARCGS (Dohm, 2007)
Work well inhigh coverage and error-free data
Stop (favors large k)
Contig:
Stop (favors small k)overlap < k
error or repeat
overlap ≥
k
-
String GraphEdena (Hernandez, 2008)Vertex: ReadEdge: overlap ≥ kSimple path: contig
Handle some errors (dead-end)Other errors
Large graph (requires memory)
Work in high coverage and error-free data
TACTCTA CTCTAGC CTAGCTG TAGCTGAAGCTGCT
GTACTCT
Contig: GTACTCTAGCTG
G
CTGCTCC
Dead-endCTCC…Ex: k= 5
-
De Bruijn GraphVelvet (Zerbino, 2008)Vertex: k-mer in read (length-k substring)Edge: overlap k – 1 bpSimple path: contig
Advantage:Gain information in error readsBetter filtering
TACTCTA CTCTAGC CTAGCTGGTACTCT
TACTC ACTCTGTACT CTCTA TCTAG CTAGC TAGCT
G
GCTAGC
AGCTG
Ex: k= 5
-
FilteringSequencing depth: 75Read length: 75bpString graph
correct read appear ~1 timeserror read appear ~1 times
De Bruijn Graphcorrect 30-mer appear~45 timeserror 30-mer appear~1 – 2 times
Can filter error read easily
ACGCCATCACGTTCCGCCTTCACGTTCT
CCATCACGTTCTCA CCATCACGCTCTCACATCACGTTCTCAG
Reads
Correct 7-mersCATCACG: 4 timesATCACGT: 3 timesTCACGTT: 4 times
Error 7-mersCTTCACG: 1 timesTTCACGT: 1 timesATCACGC: 1 timesTCACGCT: 1 times
-
Problem of Existing Algorithms
-
Find Suitable k
GTACTAC
TACTA ACTACGTACT CTACT ACTAT CTATG TATGC
TACTACTACTACTACTACTAT
ACTATGC
TACTAC ACTACTGTACTACTACTATACTAT
ACTATG CTATGC
Branch problem
Gap problem
k
= 5
k
= 6
GTACTACTATGC
-
No suitable kLarge k Small k
Branch problemdue to repeat / error
Less More
Gap problemdue to low coverage / error
More Less
Solved by select suitable kStill produce short contigsRemark: appear in greedy and string graph algorithms too
Target
Less
Less
-
Iterative de Bruijn Graph Assembler
Construct an algorithm achieve a better result than any selection of k.Start with small k,
less gap but more branchesmany short contigs
Iterate to higher kto resolve branchesto get longer contigs by merging.
65
-
Simulated DataGenome
Escherichia coli (O157:H7 str. EC4115)5.6 million bp
Sequencing depth: 30xRead length: 75bpError rate: 1%Paired-end
Insert distance: 250bpRequirements for contigs
Accuracy > 99.9%Length > 100bp
-
Comparison (Simulated)N50: The minimum length that all contigs longer
than it will cover more than 50% of the genome
E.Coli, L = 75, depth = 30, error rate = 1% 67
-
Other Statistics
time Mem k Contigs# N50 max len. avg. len. cov. wrong # wrong len.
velvet 155s 1641M 30 1369 19284 96905 2652 94.6% 19 9813edena 957s 678M 40 4672 5104 46908 900 97.2% 650 72019abyss 1113s 1749M 40 1390 22109 87118 2966 95.1% 66 34998IDBA 370s 360M 25-50 1550 63218 217365 2210 97.5% 10 3935
optimal 50 1561 63218 217365 2051 99.1%
-
Other SettingOther Setting
High Coverage
Low Error Rate
(100x,0.5%)
Low Coverage
Low Error Rate(30x, 1%)
High Coverage
High Error Rate(100x, 2%)
Low Coverage
High Error Rate(30x, 2%)
Edena (string Graph) 63256 5104 53491 147
Velvet (de Bruijn Graph) 63214 24772 59285 16527
Abyss (de Bruijn Graph) 58678 22109 50009 10992
IDBA (our algorithm) 63218 63218 59287 32612
69
-
Real Biological Data
GenomeBacillus Subtilis
Sequencing depth: ~45xRead length: 75bpError rate: ~1%Requirement for contigs
Length > 100bp
-
Comparison (Real)
Bacillus Subtilis, L = 75, depth = 45, error rate = 1%
71
-
Other Statistics
time Mem k Contigs# N50 max len. avg. len. total len.
velvet 89s 893M 35 476 35136 164023 8580 4.08Medena 649s 632M 40 926 19423 66455 4444 4.11Mabyss 729s 923M 45 445 30081 134067 9184 4.09MIDBA 313s 310M 25-50 283 122574 602412 14489 4.1M
-
Conclusions
IDBA outperforms existing assembling algorithms
Contig lengthAccuracy
Main IdeaUsing multiple values of kGuarantee better results by increasing values of k
Can be downloaded athttp://www.cs.hku.hk/~alse/idba/
http://www.cs.hku.hk/~alse/idba/
-
Future Works
Develop more effective algorithm for paired-end readsAssembling transcriptome (RNA)
Different expression levelAlternative splicing
Meta genomic sequencingMixed genomes with different coverages
-
Thanks and Questions
Repeated (Conserved) Patterns in BioinformaticsBioinformaticsWhat are Repeated Patterns?Why Repeated Patterns?Repeated (Conserved) Patterns in DNA Sequences The Central DogmaBinding Site Binding Sites IdentificationExampleExampleString MotifPlanted (l,d)-Motif Problem (PMP)�(Pevzner and Sze, ISMB 2000)Can Motif Always be Found? Problems with String MotifMotif represented by MatrixProblems about Matrix RepresentationMotif RepresentationMotif RepresentationWhat are the Binding Sites?Which is the Correct Matrix?Motif-Finding ProblemFalse positivesGeneralized Motif-Finding ProblemColor Ratio and Binding EnergyEnergy-based ModelEnergy-based ModelProblem with binding energySimulated DataReal DataFurther Information about Transcription FactorsSix Major Classes3D Binding Domains �and DNA Binding SitesExperimental ResultsProtein Sequences:�A Motif-Pair for BindingProtein-Protein Interaction (PPI) NetworkMotif Pairs for Protein InteractionSlide Number 37Slide Number 38Slide Number 39Another Problem related to PPI: Predicting Protein ComplexesProtein ComplexSlide Number 42Heuristics to Find Dense Sub-graphsDifficultiesStructure of Protein ComplexCore ProteinsExperimental Results on CoresExperiments on ComplexesComparison with MCLComparison with MCODEComparison with CFinderDifferent Biological Processes between Two Groups of SpeciesRepeated Patterns makes Genome Assembly DifficultDe novo AssemblingProblemsDe novo AssemblingAssembling ProblemExisting ApproachesString GraphDe Bruijn GraphFilteringProblem of Existing AlgorithmsFind Suitable kNo suitable kIterative de Bruijn Graph AssemblerSimulated DataComparison (Simulated)Other StatisticsOther SettingReal Biological DataComparison (Real)Other StatisticsConclusionsFuture WorksThanks and Questions