Bioinformática 2007-I Prof. Mirko Zimic
Lunes -Alineamiento simple de secuencias (pairwise alignment). - Alineamiento local y global. - Matrices de ‘score’ -Algoritmos de Programación Dinámica-Dot Plot
MiércolesAlineamiento simple de secuencias: Manejo de los programas: Clustal, Macaw y servidores en línea
“Nada en Biología tiene sentido a menos que se entienda en términos
de Evolución”
T. Dobzhansky
“Alinear” = “Comparar”
Finches of the Galápagos Islands observed by Charles Darwin on the voyage of HMS Beagle
Sequence alignment is similar to other types of comparative analysis
Involves scoring similarities and differences among a group of related entities
Homología
Homology Is the central concept for all of biology. Whenever we say that a mammalian hormone is the ‘same’ hormone as a fish hormone, that a human gene sequence is the ‘same’ as a sequence in a chimp or a mouse, that a HOX gene is the ‘same’ in a mouse, a fruit fly, a frog and a human - even when we argue that discoveries about a worm, a fruit fly, a frog, a mouse, or a chimp have relevance to the human condition - we have made a bold and direct statement about homology. The aggressive confidence of modern biomedical science implies that we know what we are talking about.”
David B. Wake
Similitud ≠ Homología
1) 25% similarity ≥ 100 AAs is likely homology
2) Homology is an evolutionary statement which means “descent from a common ancestor” –common 3D structure–usually common function–all or nothing, cannot say "50%
homologous"
C O M P A R A T I V E A N A L Y S I S
Alignment algorithms model evolutionary processes
GATTACCA
GATGACCA GATTACCA
Derivation from a common ancestor through incremental change due to dna replication errors, mutations, damage, or unequal crossing-over.
insertion
GATCATCA GATTGATCA
GATTACCA GATTATCA GATTACCA
deletionSubstitution
GAT ACCA
T
C O M P A R A T I V E A N A L Y S I S
Alignment algorithms model evolutionary processes
GATTACCA
GATGACCA GATTACCA
Derivation from a common ancestor through incremental change
GATCATCA GATTGATCA
GATTACCA GATTATCA GATTACCA
GATACCA
Only extant sequences are known, ancestral sequences are postulated.
GATCATCA GATTGATCA
GATTACCA
GATACCA
The term homology implies a common ancestry, which may be inferred from observations of sequence similarity
C O M P A R A T I V E A N A L Y S I S
Alignment algorithms model evolutionary processes
GATTACCA
GATGACCA GATTACCA
Derivation from a common ancestor through incremental change. Mutations that do not kill the host may carry over to the population. Rarely are mutations kept/rejected by natural selection.
GATCATCA GATTGATCA
GATTACCA GATTATCA GATTACCA
GATACCA
Sequence AlignmentsSequence Alignments
• Why align?
Can delineate sequence elements that are functionally significant Illuminates phylogenetic relationships
• Algorithms for sequence alignment
Dynamic programming Dot-matrix Word-based algorithms Bayesian methods
What is Meant by Alignment?What is Meant by Alignment?
Identical nucleotide sequences (trivial example)
A better alignment
ATTCGGCATTCAGTGCTAGAATTCGGCATTCAGTGCTAGA
Score = 20(20 1)
Imperfect match
ATTCGGCATTCAGTGCTAGAATTCGGCATTGCTAGA
Score = 11
ATTCGGCATTCAGTGCTAGAATTCGGCATT----GCTAGA
Score = 14= 10 + 6 + 4(-0.5){
Gap penalty
Beware of aligning apples and Beware of aligning apples and oranges oranges [[and grapefruitand grapefruit]]!!
Parologous versus orthologous;
genomic versus cDNA;
mature versus precursor.
Los alineamientos se pueden efectuar tanto en secuencias de ADN como en secuencias de
proteínas…
Why Do We Want To Compare Sequences
wheat --DPNKPKRAMTSFVFFMSEFRSEFKQKHSKLKSIVEMVKAAGER | | |||||||| || | ||| ||| | |||| ||||????? KKDSNAPKRAMTSFMFFSSDFRS----KHSDL-SIVEMSKAAGAA
EXTRAPOLATE
??????
Homology?
SwissProt
Why Does It Make Sense To Align Sequences ?
-Evolution is our Real Tool.
-Nature is LAZY and Keeps re-using Stuff.
-Evolution is mostly DIVERGEANT
Same Sequence Same Ancestor
Why Does It Make Sense To Align Sequences ?
SameSequence
Same Function
Same 3D Fold
Same Origin
Comparing Is Reconstructing Evolution
An Alignment is a STORY
ADKPKRPLSAYMLWLN
ADKPKRPKPRLSAYMLWLNADKPRRPLS-YMLWLN
ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN
Mutations+
Selection
An Alignment is a STORY
ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN
Mutation
InsertionDeletion
ADKPKRPLSAYMLWLN
ADKPKRPKPRLSAYMLWLNADKPRRPLS-YMLWLN
ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN
Mutations+
Selection
Evolution is NOT Always Divergent…
AFGP with (ThrAlaAla)nSimilar To Trypsynogen
AFGP with (ThrAlaAla)nNOT
Similar to Trypsinogen
N
S
SIMILAR Sequences
BUTDIFFERENT origin
…But in MOST cases, you may assume it is.
How Do Sequences Evolve ?
CONSTRAINED Genome Positions Evolve SLOWLY
EVERY Protein Family Has its Own Level Of Constraint
Family KS KA
Histone3 6.4 0Insulin 4.0 0.1Interleukin I 4.6 1.4Globin 5.1 0.6Apolipoprot. AI 4.5 1.6Interferon G 8.6 2.8
Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (80 Million years)Ks Synonymous Mutations, Ka Non-Neutral.
GC
LIV A
F
Aliphatic
Aromatic
Hydrophobic
C
How Do Sequences Evolve ?The amino Acids Venn Diagram
To Make Things Worse, Every Residue has its Own Personality
ST
WY
QHK
R
ED N
Polar
PG
Small
C
How Do Sequences Evolve ?
In a structure, each Amino Acid plays a Special Role
OmpR, Cter Domain
In the core, SIZE MATTERS
On the surface, CHARGE MATTERS
--+
How Do Sequences Evolve ?
Accepted Mutations Depend on the Structure
Big -> BigSmall ->SmallNO DELETION
--+
Charged -> ChargedSmall <-> Big or SmallDELETIONS
How Can We Compare Sequences ?
To Compare Two Sequences, We need:
Their Function
Their Structure
We Do Not Have Them !!!
How Can We Compare Sequences ?
We will Need To Replace Structural Information With Sequence Information.
SameSequence
Same Function
Same 3D Fold
Same Origin
It CANNOT Work ALL THE TIME !!!
How Can We Compare Sequences ?
To Compare Sequences, We need to Compare ResiduesWe Need to Know How Much it COSTS to SUBSTITUTE
an Alanine into an Isoleucinea Tryptophan into a Glycine…The table that contains the costs for all the
possible substitutions is called the SUBSTITUTION MATRIX
How to derive that matrix?
How Can We Compare Sequences ?Making a Substitution Matrix
-Take 100 nice pairs of Protein Sequences, easy to align (80% identical).
-Align them…
-Count each mutations in the alignments
-25 Tryptophans into phenylalanine-30 Isoleucine into Leucine…
-For each mutation, set the substitution score to the log odd ratio:
Expected by chance
ObservedLog
How Can We Compare Sequences ?Making a Substitution Matrix
The Diagonal Indicates How Conserved a residue tends to be.W is VERY Conserved
Some Residues are Easier To mutate into other similar
Cysteins that make disulfide bridges and those that do not get averaged
How Can We Compare Sequences ?Using Substitution Matrix
ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN
Mutation
InsertionDeletion
Given two Sequences and a substitution Matrix,We must Compute the CHEAPEST Alignment
Most popular Subsitution Matrices • PAM250• Blosum62 (Most widely used)
Raw Score
TPEA¦| |APGA
TPEA¦| |APGA
Score =1 = 9
• Question: Is it possible to get such a good alignment by chance only?
+ 6 + 0 + 2
Scoring an Alignment
Insertions and Deletions
Gap Penalties
• Opening a gap is more expensive than extending it
Seq AGARFIELDTHE----CAT||||||||||| |||
Seq BGARFIELDTHELASTCAT
Seq AGARFIELDTHE----CAT||||||||||| |||
Seq BGARFIELDTHELASTCAT
gap
Gap Opening PenaltyGap Extension Penalty
How Can We Compare Sequences ?Limits of the substitution Matrices
They ignore non-local interactions and Assume that identical residues are equal
They assume evolution rate to be constant
ADKPKRPLSAYMLWLN
ADKPKRPKPRLSAYMLWLN
ADKPRRPLS-YMLWLN
ADKPKRPLSAYMLWLNADKPKRPLSAYMLWLN
Mutations+
Selection
How Can We Compare Sequences ?Limits of the substitution Matrices
Substitution Matrices Cannot Work !!!
How Can We Compare Sequences ?Limits of the substitution Matrices
I know… But at least, could I get some idea of when they are likely to do all right
How Can We Compare Sequences ?The Twilight Zone
Length
%Sequence Identity
100
Same 3D Fold
Twilight Zone
Similar SequenceSimilar Structure
30%
Different SequenceStructure ????
30
How Can We Compare Sequences ?The Twilight Zone
Substitution Matrices Work Reasonably Well on Sequences that have more than 30 % identity over more than 100 residues
PAM BLOSUM
Built from global alignments Built from local alignments
Built from small amout of Data Built from vast amout of Data
Counting is based on minimumreplacement or maximum parsimony
Counting based on groups ofrelated sequences counted as one
Perform better for finding globalalignments and remote homologs
Better for finding localalignments
Higher PAM series means moredivergence
Lower BLOSUM series meansmore divergence
Major Differences between PAM and BLOSUM
How Can We Compare Sequences ?Which Matrix Shall I use
PAM: Distant Proteins High Index (PAM 350)BLOSUM: Distant Proteins Low Index (Blosum30)
•GONNET 250> BLOSUM62>PAM 250.
•But This will depend on:
•The Family.•The Program Used and Its Tuning.
Choosing The Right Matrix may be Tricky…
•Insertions, Deletions?
Dot MatricesGlobal AlignmentsLocal Alignment
HOW Can we Align Two Sequences ?
Cost
L
Afine Gap Penalty
Global Alignments
-Take 2 Nice Protein Sequences
-A good Substitution Matrix (blosum)
-A Gap opening Penalty (GOP)
-A Gap extension Penalty (GEP)
GOP
GEP
GOP GOP
GOP
Parsimony: Evolution takes the simplest path
(So We Think…)
Insertions and Deletions
Gap Penalties
• Opening a gap is more expensive than extending it
Seq AGARFIELDTHE----CAT||||||||||| |||
Seq BGARFIELDTHELASTCAT
Seq AGARFIELDTHE----CAT||||||||||| |||
Seq BGARFIELDTHELASTCAT
gap
Gap Opening PenaltyGap Extension Penalty
Global Alignments
-Take 2 Nice Protein Sequences
-A good Substitution Matrix (blosum)
-A Gap opening Penalty (GOP)
-A Gap extension Penalty (GEP)
>Seq1THEFATCAT>Seq2THEFASTCAT
-DYNAMIC PROGRAMMING
DYNAMICPROGRAMMING
THEFA-TCATTHEFASTCAT
Global Alignments
F A S T
F A T
----FATFAST---
(L1+l2)!
(L1)!*(L2)!
---FAT-FAST---
--F-AT-FAST---
Brut Force Enumeration
2
( )
DYNAMIC PROGRAMMING
G A T A C T AG A T T A C C A
Construct an optimal of these two sequences:
Using these scoring rules: Match:
Mismatch:Gap:
+1-1-1
D Y N A M I C P R O G R A M M I N G
Dynamic Programming Example
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Arrange the sequence residues along a two-dimensional lattice
Vertices of the lattice fall between letters
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
The goal is to find the optimal path
from here
to here
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Each path corresponds to a unique alignment
Which one is optimal?
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
The score for a path is the sum of its incremental edges scores
A aligned with AMatch = +1
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
The score for a path is the sum of its incremental edges scores A aligned with T
Mismatch = -1
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
The score for a path is the sum of its incremental edges scores
T aligned with NULL
Gap = -1
NULL aligned with T
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0 -1
+1-1
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0
+1-1
-2
-2
-1
Remember the best sub-path leading to each point on the lattice
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0
-1
-2
Remember the best sub-path leading to each point on the lattice
0 +2
+1
-1
-20
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0 -2
Remember the best sub-path leading to each point on the lattice
0 +2
+1
-1
-20
-2
-1
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0
Remember the best sub-path leading to each point on the lattice
+1
-1
-2-1
-3-2
-3
-2
+3
-1
-1
0
0
+1
+1
+2
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0
Remember the best sub-path leading to each point on the lattice
+1
-1
-1
-2
-2 0
0
+1+2
-5-4
-5
-4
-3
-3
-1 -3-2
-10
+1
+2
0
+1-1
+2
-3 -1
-2
+1 +3
+2 +1
+2+3
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
Remember the best sub-path leading to each point on the lattice
0
+1
-1
-1
-2
-2 0
0
+1+2
-4
-4
-3
-3
-1 -2
0
+2
0
+1-1
+2-2 +2 +1
+2+3
-8
-7
-6
-5
-7-6-5
-5-3
-2 -3
-4
-1
-1
0+1
+1
+1 +3
+2
-4
-6
-3
-2
-3
-1
-4
-5
+1 +3
+1
0 +2
+4
+4
+3
+2
+2
+3
-2 0
-1
+2 +2
+3
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Trace-back to get optimal path and alignment
0
+1
-1
-1
-2
-2 0
0
+1+2
-4
-4
-3
-3
-1 -2
0
+2
0
+1-1
+2-2 +2 +1
+2+3
-8
-7
-6
-5
-7-6-5
-5-3
-2 -3
-4
-1
-1
0+1
+1
+1 +3
+2
-4
-6
-3
-2
-3
-1
-4
-5
+1 +3
+1
0 +2
+4
+4
+3
+2
+2
+3
-2 0
-1
+2 +2
+3
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Print out the alignment
AA-TTTAACCTCAA
GG
Global AlignmentsDYNAMIC PROGRAMMING
Match=1 MisMatch=-1Gap=-1
FAT
F A S T
1
-1
-1
-2
-3
0
-2 -3 -4
2
0
0
Dynamic Programming (Needlman and Wunsch)
FAT
F A S T
1
-1
-1
-2
-3
0
-2 -3 -4
2
0
0 -1 0
0
21-1-1
1
FAT
F A S T
1
-1 -2 -3 -4
2
0
2
1
F A S TF A - T
Local Alignments
GLOBAL Alignment
LOCAL Alignment
Smith And Waterman (SW)=LOCAL Alignment
Two different types of Alignment
Needleman & Wunch (J. Mol. Biol. (1970) 48,443-453 : Problem of finding the best path. Revelation: Any partial sub-path that ends at a point along the true optimal path must itself be the optimal path leading to that point. This provides a method to create a matrix of path “score”, the score of a path leading to that point. Trace the optimal path from one end to the other of the two sequences.
Global Alignment methods:
Smith & Waterman.(J. Mol. Biol. (1981), 147,195-197: Use Needleman &Wunch, but report all non-overlapping paths, starting at the highest scoring points in the path graph.
FASTP(Lipman &Pearson(1985),Science 227,1435-1441
BLAST (Altschul et al (1990),J. Mol. Bio. 215,408-410): don’t report all overlapping paths, but only attempt to find paths if there are words that are high-scoring. Speeds up considerably the alignments.
Local Alignment methods:
Global vs. Local AlignmentGlobal vs. Local Alignment
High-scoringsubsequence Gap
Global alignment
Local alignment
Global alignment: best overall alignment independent of whether local high-scoring sequences are included
Local alignment: alignments involving high-scoring sequences take precedence of global features
G L O B A L & L O C A L S I M I L A R I T Y
Implementations of dynamic programming for global and local similarities
Optimal global alignment
Needleman & Wunsch (1970)
Sequences align essentially from end to end
Optimal local alignment
Smith & Waterman (1981)
Sequences align only in small, isolated regions
Filtering low complexity sequences
• Filters out short repeats and low complexity regions from the query sequences before searching the database
• Filtering helps to obtain statistically significant results and reduce the background noise resulting from matches with repeats and low complexity regions
• The output shows which regions of the query sequence were masked
Sequence Periodicities in Kinetoplast DNASequence Periodicities in Kinetoplast DNA
Marini et al. Proc. Natl. Acad. Sci. USA 79, 7664-7668 (1982)
Local Alignments
We now have a PairWise Comparison Algorithm,
We are ready to search Databases
Database Search
1.10e-20
10
1.10e-100
1.10e-2
1.10e-1
10
3
1
3
6
1.10e-2
1
20
15
13
QUERRY
Comparison Engine
Database
E-valuesHow many time do we expect such anAlignment by chance?
SWQ
Top Related