Repeated (Conserved) Patterns in Bioinformatics2010/04/09  · appear many times, e.g., commonor...

75
Repeated (Conserved) Patterns in Bioinformatics Francis Y.L. Chin {钱玉麟教授} Taikoo Chair of Engineering Chair Professor of Computer Science Associate Dean of Engineering University of Hong Kong March, 2010

Transcript of Repeated (Conserved) Patterns in Bioinformatics2010/04/09  · appear many times, e.g., commonor...

  • Repeated (Conserved) Patterns in Bioinformatics

    Francis Y.L. Chin {钱玉麟教授}Taikoo Chair of Engineering

    Chair Professor of Computer ScienceAssociate Dean of Engineering

    University of Hong Kong

    March, 2010

  • BioinformaticsUse laboratory experiments to understand biological processes is difficult, laborious, expensive and time-consuming.

    Nowadays, large volumes of biological data are available.

    Bioinformatics aims to exploit this data to understand biological processes through computational approach.

  • What are Repeated Patterns?Repeated patterns are similar patterns that appear many times, e.g., common or conserved patterns.Repeated patterns can be measured by the probability of their occurrence in a random environment (p-value)Low p-value means “information or signal bearing” and not “artifact”High p-value implies “inconclusive”

  • Why Repeated Patterns?

    Finding repeated patterns is important in bioinformatics research

    Sequence analysisAnalysis of mutations Comparative genomicsEvolutionary biology BiodiversityProtein-protein interaction

  • Repeated (Conserved) Patterns in DNA Sequences

  • The Central Dogma

    DNA

    produces

    Protein

    http://research.microsoft.com/uai2004/Slides/FriedmanUAI2004-part-II.ppt

    Gene

  • Binding Site

    Gene

    Tyr Leu Protein

    DNA

    Transcription factor

    C GA

    AG

    C AT

    GG A

    T

    Identifying the TF binding sites on DNA sequence is an important problem.

    Binding Site

  • Binding Sites Identification•

    Find genes associated with the same TF.•

    Search for short similar patterns (motif), i.e., binding sites, in the DNA sequences.

    GCN4

    DNA AAAATTGAGTCATATC…GAGAATGCCGGTCGTTCACGTG…

    HIS7

    GCN4

    DNA AGTTATGACTAATATT …TATCATGTCCGAGGCGACTTTG…

    TRP4

    GCN4

    DNA CCGAATGACTGCTCAT…AAAAATGTGTGGTATTTTAGGTA…

    ADE4

    Promoter regions | gene

  • Example>SPR3……CTGGTCGTAATACAAATAGAAGAGGTAAACCAATCAATGGCCC GTTAGTTTGCCATTTGCTGCATCCTTCCCATGCAAAGTGTCTT……>COX6……ACAGAAAATTCCAATCAAAAAGTTGGTGTTAGGCTATACTGAT GGCCGTATCGCTCCATACGAGCCAATCAGGGCCCCGCGCGTTA……>QCR8……CCACGTGACTAGTCCAAGGATTTTTTTTAAGCCAATTAAAATG AAGAAATGCGTGATCGGAAATTACGGGTAGTACGAGAAGGAAA……>CYC1……GGGCTTGATCCACCAACCAACGCTCGCCAAATGAACTGGCGCT TTGGTCTTCTGCCATCGTCCGTAAACCCCTTCCAAAGAGACCG……

    Hypothesis: The binding sites are short similar string pattern in each sequence.

  • Example>SPR3……CTGGTCGTAATACAAATAGAAGAGGTAAACCAATCAATGGCCC GTTAGTTTGCCATTTGCTGCATCCTTCCCATGCAAAGTGTCTT……>COX6……ACAGAAAATTCCAATCAAAAAGTTGGTGTTAGGCTATACTGAT GGCCGTATCGCTCCATACGAGCCAATCAGGGCCCCGCGCGTTA……>QCR8……CCACGTGACTAGTCCAAGGATTTTTTTTAAGCCAATTAAAATG AAGAAATGCGTGATCGGAAATTACGGGTAGTACGAGAAGGAAA……>CYC1(reverse)……GGGCTTGATCCACCAACCAACGCTCGCCAAATGAACTGGCGCT TTGGTCTTCTGCCATCGTCCGTAAACCCCTTCCAAAGAGACCG……

    All these strings are binding sites, and similar to a common pattern, CCAATCA, called motif.

  • String MotifGTTGTCATGGTGACGTTTCCATGGAAACGCTACCATGGCAACGTTACCATAGTAACGTTTCCATGGTAAC

    Motif represents the common pattern of binding sitesThe binding sites are variants of the motif.Consensus motif gives the minimum total number of

    errors (NP-complete for finding the motif with minimum maximum error) (Li et al, JCSS 2002)

    GTTACCATGGTAAC – Consensus string (motif)

    C. elegans Binding sites

  • Planted (l,d)-Motif Problem (PMP) (Pevzner and Sze, ISMB 2000)

    l = length of motif

    Md = Hamming DistanceInput:

    T = t length-n sequences, each with at least one binding site.

    Problem:Find M

    and the binding sites (sub-strings

    within Hamming distance d from M)

  • Can Motif Always be Found? Many methods, EM, Gibbs Sampling, exhaustive search,

    maximal clique,… exist to find the motif.Motif can never be found when

    Too few sequences/binding sites (t and n are too small)Binding sites “too” short (l is too small)Binding sites vary too much (d is too large)

    Because p-vaule of those similar patterns is too high, i.e., more than one possible solutions will result.

    Successful only when the “similar” patterns are “many” and “long” (low p-values) – such existence probability by random is very low.

  • Problems with String Motif

    GTTACCATGGTAAC is a motifCTTACCATGGTAAC

    (not binding site)Hamming distance cannot

    model the real situation

    GTTGTCATGGTGACGTTTCCATGGAAACGCTACCATGGCAACGTTACCATAGTAACGTTTCCATGGTAAC GTTACCATGGTAAC

    Consensus string as motif

    Hypothesis: The binding sites are short similar string pattern in each sequence, while some positions are conserved.

  • Motif represented by Matrix

    Probability Matrix M M(α,j) = probability that the jth position is α

    0 0 0 .8 0 0 0

    .6 .8 1 0 0 0 .2

    .2 .2 0 0 0 1 .8

    .2 0 0 .2 1 0 0

    AC

    GT

    Binding sites:TCCATGGCGCATGGCCCATGCGCCATGGCCCTTGG

    Matrix motif (PSSM)

  • Problems about Matrix Representation

    Given the matrix, what are binding sites?Hamming distance for string representation

    Which matrix can be the motif?The pattern which gives most number of binding sites (string representation)

  • Motif RepresentationGiven matrix M,

    is σ = GGCTTGC a binding

    site?

    Pr(σ generated by M)∏=

    =l

    j

    jjMMp1

    )],[(),( σσ

    0 0 0 .8 0 0 0

    .6 .8 1 0 0 0 .2

    .2 .2 0 0 0 1 .8

    .2 0 0 .2 1 0 0

    AC

    GT

    p(M,“GGCTTGC”)= 0.2 ×

    0.2 ×

    1 ×

    0.2

    ×

    1 ×

    1 ×

    0.2

    = 0.0016

  • Motif Representation

    Pr(σ generated by Background)

    ∏=

    =l

    j

    jBBp1

    ])[(),( σσ

    B(A) = 0.2B(C) = 0.3B(G) = 0.3B(T) = 0.2

    p(B,“GGCTTGC”)= 0.3 ×

    0.3 ×

    0.3 ×

    0.2

    ×

    0.2 ×

    0.3 ×

    0.3

    = 0.0000972

  • What are the Binding Sites?σ is a binding site iff

    Large ⇒

    likely to be a binding sites

    Example: “GGCTTGC”

    = log 16.5 = 1.218 (a binding site if t

    =1)

    )(threshold ),(),(log t

    BpMp

    ≥⎟⎟⎠

    ⎞⎜⎜⎝

    ⎛σσ

    ⎟⎠⎞

    ⎜⎝⎛=⎟⎟

    ⎞⎜⎜⎝

    ⎛0000972.0

    0016.0log((log

    )GGCTTGC"",)GGCTTGC"",

    BpMp

  • Which is the Correct Matrix?

    Each matrix M is given a score (Information Content), IC(M)

    The score increases withNumber of binding sitesSimilarity with M, log(p(M,σ)/p(B,σ)) – t

    The correct matrix M* has the maximum score.

    ∑≥

    ⎟⎟⎠

    ⎞⎜⎜⎝

    ⎛−⎟⎟⎠

    ⎞⎜⎜⎝

    ⎛=

    tBpMpt

    BpMpMIC

    )),(/),(log(: ),(),(log)(

    σσσ σσ

  • Motif-Finding ProblemInput:

    A set of sequences bound by a particular transcription factor

    Output:A motif (probability matrix)Positions of binding sites in each sequence

    Leung and Chin, "Finding Exact Optimal Motif in Matrix Representation by Partitioning", Bioinformatics, Vol 21, Supp 2, ECCB/JBI, ii86-92 (September 2005)

  • False positivesPattern “CGCGCG” appears many times in the

    sequences, but not binding sites. Why?Some genes are not regulated (contains no

    binding sites), their promoter regions also contain many “CGCGCG” patternsThese sequences can be used as negative set (control set)

    Hypothesis: Those sequences without binding sites probably do not contain any string patterns similar to the motif.

  • Generalized Motif-Finding Problem

    Input:T = sequences containing binding sitesF = sequences not containing binding sites

    Output:Motif M and the binding sites

    Leung and Chin, "Finding Motifs from All Sequences With and Without Binding Sites," Bioinformatics 2006

  • Color Ratio and Binding EnergyMicroarray experiments can be used to indicate

    gene expression which is measured by color intensity (the probability of TF binding)

    The amount of binding energy (strong or weak) can be estimated by the color ratio.

    Hypothesis: Each binding site, depending on its pattern, has different binding energy with the TF.

    Hypothesis: Higher color intensity means stronger binding.

  • Energy-based Model

    Seq 1

    Seq 2

    Seq 4

    Seq 3

    -4.8

    -5.1

    -4.6

    -0.5

  • Energy-based Model

    Seq 1

    Seq 2

    Seq 4

    Seq 3

    Seq 2

    CCAGATGAGATG

    GACGATGAACGC

    AGTGCTGAGGCTCCACCAGCTATT

    -0.5 -0.7 0.5 -0.5 -0.6 0.1

    0.3 -0.4 -0.1 0.3 0.1 0.1

    -1.1 0.5 0.2 -0.8 -0.2 -0.4

    0.1 0.8 -1.5 -0.2 0.3 -0.2

    A

    C

    G

    T

    Energy Matrix:

    -5.1

    -4.6

    -4.8

    -0.5

  • Problem with binding energyInput:

    A set of DNA sequencesThe binding energy between TF and each sequence (color intensity in microarray)

    Output:The motif (energy matrix) M which produces the binding energy of each sequence

    The binding sites are those patterns in the sequence with the lowest energy.

  • Simulated Data

    Expected number of matrices

    EBMF AlignACE MEME

    Find? rank Find? rank Find? rank

    B = 7 149475 yes 1 no - no -

    B = 8 0.000439 yes 1 no - yes 1

    B = 9 7.7×10-07 yes 1 yes 1 yes 1

    Results of the algorithms on simulated data for 200 sequences of length 700 where 10 of them contain B binding sites of length 17 with expected likelihood -10

    Leung, Chin, Yiu, et al, "Finding Motifs with Insufficient Number of Strong Binding Sites", Journal of Computational Biology, 2005, preliminary version appeared in RECOMB 2004.

  • Real Data

    EBMF AlignACE MEME

    Find? rank Find? rank Find? rank

    Using the top 100 sequences in theoriginal data yes 2 yes 1 yes 1

    Using the top 100 sequences exceptsequences 2,3,4 and 6 yes 1 no - no -

    Using the top 100 sequences exceptsequences 1 to 6 yes 10 no - no -

    Using the top 100 sequences exceptsequences 1 to 8 yes 5 no - no -

    GAL4 (motif pattern is CGGN11 CCG)

  • Further Information about Transcription FactorsProteins (Transcription Factors) are with 3D

    structure and can be grouped into classes

    Their binding sites have different characteristics

    Zinc finger Leucine zipper Helix-Turn-Helix

  • Six Major Classes

    Guess the motif class based on the characteristics of binding sites

    Search motifs in that class by modifying their likelihood accordingly

    Classes Sub-Classes Characteristics Freq

    Zinc FingerI. Cys2 His2

    G . . G | G . . G . . G | [CG] . . [CG] . . [CG] 13%

    II. Cys4 AGGTCA | TGACCT 13%Leucine zipper III. bZip TGA .* TCA 23%

    Helix-Turn-HelixIV. bHLH CA . . TG 3%V. Homedomain TAAT | ATTA 11%

    VI. Others (e.g. Forkhead) unknown 33%

  • 3D Binding Domains and DNA Binding Sites

    TFs - classified by 3D binding domains.3D binding domains and DNA binding sites are

    relatedDNA binding sites - classified accordingly

    Hypothesis: Most transcription factors can be classified into a few protein structures and binding domains, most binding sites should have a few patterns.

    Leung and Chin, "Discovering Motifs with Transcription Factor Domain Knowledge", PSB2007

  • Experimental ResultsNumber of motifs with known sites discovered

    MEME / DIMDomMEME: 38 DIMDom: 47

    Average accuracyMEME: 0.3141DIMDom: 0.4471

    Higher accuracy MEME: 9 data setsDIMDom: 26 data setssame accuracy: 3 data sets

  • Protein Sequences: A Motif-Pair for Binding

    Leung, Siu, Yiu, Chin, Sung, "Finding Linear Motif Pair from Protein Interaction Networks: A Probabilistic Approach", CSB2007

  • Protein-Protein Interaction (PPI) Network

    VertexProtein sequence

    Edge (interaction) between u

    and v

    Protein u and protein vcan bind together

    Many interactions are missing and erroneous

  • Motif Pairs for Protein Interaction

    Problem: When two proteins interact, where are their binding sites (domains)?

  • Binding sites and Motif

    1) Different proteins may have similar binding sites.

    GLFPSNY

    GFIPGNY

    GVFPGNY

    GIFPLNY

    2) The proteins that these proteins bind to also contain another set of similar binding sites.

    PTLPPR

    PIKPPR

    PTAPQR

    PTLPSR

    PPLPNR

    PPLPTR

    Interaction

  • If M and M’

    are two motifs representing two sets of real binding sites that interact,

    we expect that the sequences containing instances of the corresponding motifs should have many interactions.

    Hypothesis

  • The Motif Pair Finding ProblemInput: protein-protein interaction network

    Problem: To find a pair of motifs (M1, M2) such that sequences containing M1 and sequences containing M2 have unexpectedly large number of interactions.

    ( , ) may be a real motif pair!

    e.g. ( , ) is not a real motif pair

  • Another Problem related to PPI: Predicting Protein Complexes

    Leung, Xiang, Yiu and Chin, "Predicting Protein Complexes from PPI Data: A Core-Attachment Approach", Journal of Computational Biology, 2009. Preliminary version presented in RECOMB Satellite 2008.

  • Protein Complex

    Protein ComplexGroup of proteins bind togetherRepresented as a connectedsubgraph in PPI network

    Problem: To predict protein complexes from PPI network

    Hypothesis: Proteins in the same complex have more interactions among them.

  • Protein complexes (dense subgraphs)

  • Heuristics to Find Dense Sub-graphsMarkov Cluster (MCL) (Enright et al., Nucl Acids Res 2002)Bootstrapping through random walks in graphs,

    i.e., trap in clusters and rarely go out to other clustersMolecular Complex Detection (MCODE)

    (Bader and Hogue, BMC Bioinf 2003)Start with high-degree vertices and recursively merge with

    neighbors to ensure its density is above a given thresholdCFinder (Adamcsek et al., Bioinformatics 2006)Locate overlapping cliques by merging two k-cliques

    if they share k-1 nodes

    Core and Attachment (Leung et al., JCB 2009)Based on biological information that each complex has a core.

  • DifficultiesMany interactions are missing

    The protein complexes may not necessarily be dense subgraphs, especially when the complexes are large

    Some proteins present in multiple complexes

  • Structure of Protein Complex

    Proteins in a complex consist of [Gavin et al. 2006]Core proteinsAttachments

    Each protein complex has a unique set of core proteins

    Core

    Attachments

    Modules

  • Core ProteinsRelatively more interactions among themselves

    Dense subgraphsAttachments bind to core protein to form

    complexesAttachments are neighbors of cores

    Each protein complex has a unique set of core proteins

    Cores are disjointCores do not present in other complexes

  • Experimental Results on Cores

    Compare with Mediator [Andreopoulos et al. 2007] on Gavin dataset

    # of correct cores acc ≥

    0.4 acc ≥

    0.6 acc ≥

    0.8

    Mediator 29 8 0

    Ours 267 169 103

  • Experiments on Complexes

    Compare with 3 methods (MCL, MCODE and CFinder) on 3 datasets

    Datasets Number of proteinsNumber of

    interactionsAverage degree

    DIP 4,928 17,201 6.98

    Krogan 2,675 7,080 5.29

    Gavin 1,430 6,531 9.13

  • Comparison with MCL

    MCL/Ours# of correct complexes

    acc ≥

    0.6 acc ≥

    0.7 acc ≥

    0.8

    DIP 30 / 36 20 / 26 10 / 15

    Krogan 28 / 37 16 / 21 7 / 11

    Gavin 32 / 35 26 / 29 11 / 17

  • Comparison with MCODE

    MCODE/Ours# of correct complexes

    acc ≥

    0.6 acc ≥

    0.7 acc ≥

    0.8

    DIP 17 / 29 13 / 22 7 / 13

    Krogan 6 / 24 5 / 16 2 / 8

    Gavin 23 / 32 19 / 27 8 / 17

  • Comparison with CFinder

    CFinder/Ours# of correct complexes

    acc ≥

    0.6 acc ≥

    0.7 acc ≥

    0.8

    DIP 19 / 28 14 / 22 10 / 13

    Krogan 28 / 28 13 / 19 5 / 11

    Gavin 25 / 29 20 / 26 13 / 16

  • Different Biological Processes between Two Groups of Species

    Given two groups of species, each with a metabolic network

    Find those reactions or metabolic pathways which belong to most of the networks in one group but not in the other.

    Hypothesis: There must exist a set of reactions or metabolic pathways in one group of specifies but not in the other.

  • Repeated Patterns makes Genome Assembly Difficult

    Peng, Leung, Yiu and Chin, “IDBA - Iterative de Bruijn Graph de Novo Assembler”, RECOMB 2010 (to appear).

  • Genome with unknown sequence

    Sequencing

    Assembling

    De novo Assembling

    Read (45 – 140bp)

    Genome with known sequence

    However, there are many problems

  • Problems

    Error in reads1-2% error rate per nucleotide

    e.g. 1% error rate, 75bp read length~1 – (1 – 1%)75 = 53% reads have error

    Gap Positions with no read cover

    Repeatlength of repeat ≥

    read length

    impossible to assemble

    Repeats in E.coli.

    length Repeat #

    30 3899

    40 2784

    50 2248

    100 1074

    200 536

    300 345

    500 200

    1000 101

  • Genome

    Sequencing

    Assembling

    De novo Assembling

    gap

    RepeatRepeat

    error

    contig new gap

  • Input:A set of reads from a genome

    Objective:Construct contigs of the genome

    Accuracy (>99.9%)CoverageLength

    Assembling Problem

    N50 (length of shortest contig in a set cover ≥ 50% genome)

  • Existing Approaches

    GreedySSAKE (Warren, 2007), SHARCGS (Dohm, 2007)

    Work well inhigh coverage and error-free data

    Stop (favors large k)

    Contig:

    Stop (favors small k)overlap < k

    error or repeat

    overlap ≥

    k

  • String GraphEdena (Hernandez, 2008)Vertex: ReadEdge: overlap ≥ kSimple path: contig

    Handle some errors (dead-end)Other errors

    Large graph (requires memory)

    Work in high coverage and error-free data

    TACTCTA CTCTAGC CTAGCTG TAGCTGAAGCTGCT

    GTACTCT

    Contig: GTACTCTAGCTG

    G

    CTGCTCC

    Dead-endCTCC…Ex: k= 5

  • De Bruijn GraphVelvet (Zerbino, 2008)Vertex: k-mer in read (length-k substring)Edge: overlap k – 1 bpSimple path: contig

    Advantage:Gain information in error readsBetter filtering

    TACTCTA CTCTAGC CTAGCTGGTACTCT

    TACTC ACTCTGTACT CTCTA TCTAG CTAGC TAGCT

    G

    GCTAGC

    AGCTG

    Ex: k= 5

  • FilteringSequencing depth: 75Read length: 75bpString graph

    correct read appear ~1 timeserror read appear ~1 times

    De Bruijn Graphcorrect 30-mer appear~45 timeserror 30-mer appear~1 – 2 times

    Can filter error read easily

    ACGCCATCACGTTCCGCCTTCACGTTCT

    CCATCACGTTCTCA CCATCACGCTCTCACATCACGTTCTCAG

    Reads

    Correct 7-mersCATCACG: 4 timesATCACGT: 3 timesTCACGTT: 4 times

    Error 7-mersCTTCACG: 1 timesTTCACGT: 1 timesATCACGC: 1 timesTCACGCT: 1 times

  • Problem of Existing Algorithms

  • Find Suitable k

    GTACTAC

    TACTA ACTACGTACT CTACT ACTAT CTATG TATGC

    TACTACTACTACTACTACTAT

    ACTATGC

    TACTAC ACTACTGTACTACTACTATACTAT

    ACTATG CTATGC

    Branch problem

    Gap problem

    k

    = 5

    k

    = 6

    GTACTACTATGC

  • No suitable kLarge k Small k

    Branch problemdue to repeat / error

    Less More

    Gap problemdue to low coverage / error

    More Less

    Solved by select suitable kStill produce short contigsRemark: appear in greedy and string graph algorithms too

    Target

    Less

    Less

  • Iterative de Bruijn Graph Assembler

    Construct an algorithm achieve a better result than any selection of k.Start with small k,

    less gap but more branchesmany short contigs

    Iterate to higher kto resolve branchesto get longer contigs by merging.

    65

  • Simulated DataGenome

    Escherichia coli (O157:H7 str. EC4115)5.6 million bp

    Sequencing depth: 30xRead length: 75bpError rate: 1%Paired-end

    Insert distance: 250bpRequirements for contigs

    Accuracy > 99.9%Length > 100bp

  • Comparison (Simulated)N50: The minimum length that all contigs longer

    than it will cover more than 50% of the genome

    E.Coli, L = 75, depth = 30, error rate = 1% 67

  • Other Statistics

    time Mem k Contigs# N50 max len. avg. len. cov. wrong # wrong len.

    velvet 155s 1641M 30 1369 19284 96905 2652 94.6% 19 9813edena 957s 678M 40 4672 5104 46908 900 97.2% 650 72019abyss 1113s 1749M 40 1390 22109 87118 2966 95.1% 66 34998IDBA 370s 360M 25-50 1550 63218 217365 2210 97.5% 10 3935

    optimal 50 1561 63218 217365 2051 99.1%

  • Other SettingOther Setting

    High Coverage

    Low Error Rate

    (100x,0.5%)

    Low Coverage

    Low Error Rate(30x, 1%)

    High Coverage

    High Error Rate(100x, 2%)

    Low Coverage

    High Error Rate(30x, 2%)

    Edena (string Graph) 63256 5104 53491 147

    Velvet (de Bruijn Graph) 63214 24772 59285 16527

    Abyss (de Bruijn Graph) 58678 22109 50009 10992

    IDBA (our algorithm) 63218 63218 59287 32612

    69

  • Real Biological Data

    GenomeBacillus Subtilis

    Sequencing depth: ~45xRead length: 75bpError rate: ~1%Requirement for contigs

    Length > 100bp

  • Comparison (Real)

    Bacillus Subtilis, L = 75, depth = 45, error rate = 1%

    71

  • Other Statistics

    time Mem k Contigs# N50 max len. avg. len. total len.

    velvet 89s 893M 35 476 35136 164023 8580 4.08Medena 649s 632M 40 926 19423 66455 4444 4.11Mabyss 729s 923M 45 445 30081 134067 9184 4.09MIDBA 313s 310M 25-50 283 122574 602412 14489 4.1M

  • Conclusions

    IDBA outperforms existing assembling algorithms

    Contig lengthAccuracy

    Main IdeaUsing multiple values of kGuarantee better results by increasing values of k

    Can be downloaded athttp://www.cs.hku.hk/~alse/idba/

    http://www.cs.hku.hk/~alse/idba/

  • Future Works

    Develop more effective algorithm for paired-end readsAssembling transcriptome (RNA)

    Different expression levelAlternative splicing

    Meta genomic sequencingMixed genomes with different coverages

  • Thanks and Questions

    Repeated (Conserved) Patterns in BioinformaticsBioinformaticsWhat are Repeated Patterns?Why Repeated Patterns?Repeated (Conserved) Patterns in DNA Sequences The Central DogmaBinding Site Binding Sites IdentificationExampleExampleString MotifPlanted (l,d)-Motif Problem (PMP)�(Pevzner and Sze, ISMB 2000)Can Motif Always be Found? Problems with String MotifMotif represented by MatrixProblems about Matrix RepresentationMotif RepresentationMotif RepresentationWhat are the Binding Sites?Which is the Correct Matrix?Motif-Finding ProblemFalse positivesGeneralized Motif-Finding ProblemColor Ratio and Binding EnergyEnergy-based ModelEnergy-based ModelProblem with binding energySimulated DataReal DataFurther Information about Transcription FactorsSix Major Classes3D Binding Domains �and DNA Binding SitesExperimental ResultsProtein Sequences:�A Motif-Pair for BindingProtein-Protein Interaction (PPI) NetworkMotif Pairs for Protein InteractionSlide Number 37Slide Number 38Slide Number 39Another Problem related to PPI: Predicting Protein ComplexesProtein ComplexSlide Number 42Heuristics to Find Dense Sub-graphsDifficultiesStructure of Protein ComplexCore ProteinsExperimental Results on CoresExperiments on ComplexesComparison with MCLComparison with MCODEComparison with CFinderDifferent Biological Processes between Two Groups of SpeciesRepeated Patterns makes Genome Assembly DifficultDe novo AssemblingProblemsDe novo AssemblingAssembling ProblemExisting ApproachesString GraphDe Bruijn GraphFilteringProblem of Existing AlgorithmsFind Suitable kNo suitable kIterative de Bruijn Graph AssemblerSimulated DataComparison (Simulated)Other StatisticsOther SettingReal Biological DataComparison (Real)Other StatisticsConclusionsFuture WorksThanks and Questions