PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019...
-
Upload
lora-goodman -
Category
Documents
-
view
232 -
download
0
Transcript of PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019...
PatternHunter: faster and more sensitive homology search
By Bin Ma, John Tromp and Ming Li
B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽B92902072 張智翔 B92902086 洪錫全 B92902087 郭立翔
Agenda PatternHunter
Spaced Seed Algorithm Performance
PatternHunter II Algorithm Performance
Translated PatternHunter
PatternHunter – Spaced Seed
Outline
A short review about BLAST. Some definition and background. What’s the difference and the same
between BLAST and PatternHunter. Why PatternHunter is better??
Nonconsecutive seeds Proof
Blast Algorithm Find seeded matches
Extent to HSP’s (High scoring Segment Pairs)
Gapped Extension, dynamic programming
Report significant local alignments
A short review about BLAST
Find hits. BLAST first scans the database for
words that score at least T when aligned with some word within the query sequence. Any aligned word pair satisfying this condition is called a hit.
A short review about BLAST
Find HSPs HSP (High scoring Segment Pair) is much
longer than a single word pair, and may therefore entail multiple hits on the same diagonal within a relative shot distance of one another.
A short review about BLAST
Generate gapped alignment This means that two or more HSPs in BLA
ST with scores well below 38 bits can, in combination, rise to statistical significance. If any one of these HSPs is missed, so may be the combined result.
A short review about BLAST
In summary, the new gapped BLAST algorithm requires two non-overlapping hits of score at least T, within a distance A of one another, to invoke an ungapped extension of the second hit. If the HSP generated normalized score at least Sg bits, then a gapped extension is triggered.
Some definition, some background Similarity
How similar it is between two sequences? Usually mean that the probability of the same
symbol appear in anywhere of two sequences. Sensitivity
The probability to find a local alignment. Specificity
In all local alignments, how many alignments are homologous.
Define the Seed
Defining the seed: w -> weight or number of positions to m
atch Blastn: 11 MegaBlast: 28
model -> relative position of letters for each w
m -> length of model “window”
Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li
BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002
1 1 1 0 1 0 0 1 0 1 0 0 1 1 0 1 1 1
m = 18
w = 11
model
Patternhunter most sensitive model
Seed Parameters:
11 – – exact match requiredexact match required
00 – – no match required, any no match required, any valuevalue
{
letters:
0,0, 11
Blastn seed is all “1”s
Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li
BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002
Seed, Hit, Homology
What is a seed? Seeds determine how an algorithm
looks for hits What is a hit?
Hits indicate a similarity that may indicate a homology
Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li
BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002
GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC|| ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | |||||GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC
GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| ||||||||| |||||| | ||||| |||||||| ||| |||||||| | | | || GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG
------------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| |TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA
GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG||||||| |||| | | |||| ||||| || ||||| || |||||| |||||||||||||||GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG
Human-Mouse genome homology
hit
Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li
BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002
Example:
Consider the following two sequences:GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT
|| ||||||||| |||||||| |||||| ||||||
GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT
What’s the differences in finding the seed between Blast and PatternHunter?
Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li
BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002
BLAST uses“consecutive seeds” In BLAST, we often use the
consecutive model with weight 11.GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT
|| ||||||||| |||||||| |||||| ||||||
GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT
→ 11111111111 → … →… … → 11111111111 ←
However, it fails to find the alignment in the two sequence.
Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li
BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002
Consecutive seeds
There’s also a dilemma for BLAST type of search.
Dilemma Sensitivity – needs shorter seeds
too many random hits, slow computation Speed – needs longer seeds
lose distant homologies
Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li
BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002
PatternHunter uses “non-consecutive seed” In PatternHunter, we often use the sp
aced model with weight 11 and length 18.GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT
|| ||||||||| |||||||| |||||| ||||||
GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT
111010010100110111
Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li
BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002
Consecutive vs. Nonconsecutive?
The non-consecutive seed is the primary difference and strength of Patternhunter
Blastn: 1 1 1 1 1 1 1 1 1 1 1
PatternHunter: 1 1 1 0 1 0 0 1 0 1 0 0 1 1 0 1 1 1
Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li
BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002
A trivial comparison between spaced and consecutive seed
Consider 111 and 1101. To fail seed 111, we can use
110110110110… 66.66% similarity
But we can prove, seed 1101 will hit every region with 61% similarity for sufficient long region.
Reference Ming Li, NHC2005Reference Ming Li, NHC2005
Proof Suppose there is a length 100 region which is
not hit by 1101. We can break the region into blocks of 1a0b.
Besides the last block, the other blocks have the following few cases:
10b for b>=1 110b for b>=2 1110b for b>=2
In each block, similarity <= 3/5. The last block has at most 3 matches. So, in total there are at most 61 matches in
100 positions. The similarity is <=61%.
Reference Ming Li, NHC2005Reference Ming Li, NHC2005
Formalize Given i.i.d. sequence (homology region) wi
th Pr(1)=p and Pr(0)=1-p for each bit:
1100111011101101011101101011111011101
Which seed is more likely to hit this region: BLAST seed: 11111111111 Spaced seed: 111*1**1*1**11*111
111*1**1*1**11*111
Reference Ming Li, NHC2005Reference Ming Li, NHC2005
Expect Less, Get More Lemma: The expected number of hits of a w
eight W length M seed model within a length L region with homology level p is
(L-M+1)pW
Proof. E(#hits) = ∑i=1 … L-M+1 pW ■
Example: In a region of length 64 with p=0.7 Pr(BLAST seed hits)=0.3 E(# of hits by BLAST seed)=1.07 Pr(optimal spaced seed hits)=0.466, 50% more E(# of hits by spaced seed)=0.93, 14% less
Reference Ming Li, NHC2005Reference Ming Li, NHC2005
Why Is Spaced Seed Better?
A wrong, but intuitive, proof: seed s, interval I, similarity p E(#hits) = Pr(s hits) E(#hits | s hits)Thus: Pr(s hits) = Lpw / E(#hits | s hits)For optimized spaced seed, E(#hits | s hits) 111*1**1*1**11*111 Non overlap Prob 111*1**1*1**11*111 6 p6
111*1**1*1**11*111 6 p6
111*1**1*1**11*111 6 p6 111*1**1*1**11*111 7 p7
….. For spaced seed: the divisor is 1+p6+p6+p6+p7+ … For BLAST seed: the divisor is bigger: 1+ p + p2 + p3 + …
Reference Ming Li, NHC2005Reference Ming Li, NHC2005
Simulated sensitivity curves
Reference Ming Li, NHC2005Reference Ming Li, NHC2005
Observations of spaced seeds
Seed models with different shapes can detect different homologies.
Two consequences: Some models may detect more homologies
than others More sensitive homology search PatternHunter I
Can use several seed models simultaneously to hit more homologies
Approaching 100% sensitive homology search PatternHunter II
Reference Ming Li, NHC2005Reference Ming Li, NHC2005
PatternHunter – Algorithm & Performance
Outline
Hit generation Hit extension Gapped extension Performance
Hit generation
Index created for each position in the query sequence
Hit generation Similar to MegaBlast: Hash tables
Encode ATCG into binary code 00, 01, 10, 11 respectively
Find each situations in one of the sequence and record the offsets in the hash table
Hit generation
An example:Now we want to find hits between sequences S and T
Spaced seed
For sequence T:
ModelSeed
A T A T G C A T
1 1 0 1 0 1 1 0
A 00
T 01
C 10
G 11A T T C A 0001011000 =
88
‧‧ ‧‧
Scan
Weight=5 the value is between 0~2^10-1
After filling in the hash table…
0
1
2
3
‧‧‧‧‧‧
87
88
‧‧‧
10
19
34(NULL)
1410
48
134
2 8 33
‧‧‧
Position in TFor each position in S:
1.Calculate int value
2. Find hits in S by the lookup value
Hash tables: space required
3
2
1
0‧‧‧
‧‧‧
88
87
‧‧‧
341910
(NULL)14
1344810
3382
‧‧‧
Position in T
4^w integers
|T| integers
Total: 4^(w+1)+4|T| bytes
Cost a lot to make a hash table?
If the number of hits found for one index is large, the cost of computing index is relatively negligible.
Hit extension
HSP: Highscoring Segment Pair
Scan those hits with a window, and choose the highest-scored one.
Hit extension
S
T
The chosen hit
Hit extension
Set the mid point of the chosen hit as the cut point, split the graph into 4
Hit extension
S
T
Hit extension
And then do the Smith-Waterman in 2 of the 4, until it reaches the dropoff score.
Hit extension
S
T
Smith-Waterman
Smith-Waterman
Cost=1/2*O(mn)
Hit extension
If the resulting segment pair has a score below certain minimum, then ignore it.
Else we gain a HSP and do the next step-gap extension.
Hit extension
A question: when doing extension in 2 ways, how to synchronize the score?
To find the best way to extend an HSP to the left across gaps.
To extend an HSP we try all candidates from a diagonal-sorted set. Penalty for gap open + gap extension
+ cropping
Gapped Extension
Search front
Gapped Extension
Optimal Left
Too Far Right
Too Far RightOptimal Left
From left to right
Too Far Right
Optimal Left
From left to right
Descriptions in the paper
We use a red-black tree for this. Insert HSP when the optimal
alignment to its left is found Retired from the tree once newly
generated HSPs are too far beyond its right endpoint to make use of it.
Thought 1 The first one will be inserted Fast
Better
Worse
Start
End
May not find the best one
Thought 1
Not complete HSP
Insert HSP when the optimal alignment to its left is found
Thought 2
Close but short (Bad)
Far but long (Good)Insert both HSPs
Next turn
Thought 2
Tree 2
Tree 1
Thought 1 Retired alignments are put into a priority queue according to
their scores.
Performance
Ref. Bin Ma, John Tromp, Ming Li Ref. Bin Ma, John Tromp, Ming Li BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002
Ref. Altschul,S.F. et al (1997) Ref. Altschul,S.F. et al (1997) Nucleic Acids Res.Nucleic Acids Res., 25, 3389–3402., 25, 3389–3402.
PatternHunter II
Outline
Overview PatternHunter II design Computing hit probability Finding seeds set Seed performance PHII performance
Overview PatternHunter: spaced seed PH2: design for better sensitivity
“Achieve a sensitivity approaching that of Smith-Waterman with a speed similar to the default Blastn”
Extend single spaced seed to multiple ones Two main problem:
Large memory required for multiple hash tables Complexity of finding optimal seed combination
PatternHunter II design A hash table is built for each seeds All hits generated from all hash tables
are used for gap extension In two-hit mode, two nearby hits can
be from different hash tables
PatternHunter II design (cont.) Large memory problem:
Divide into smaller segments e.g., with k = 8, w = 11, and n = 32 x 106, the hash tables use about 256MBytes of memory Extend alignments across division boundary Still may lose alignments
Computing hit probability
Use DP, but extend the algorithm from single seed to multiple seeds
Definition Homologous region R with length L Substring from i to j is denoted by R[i : j] A set of k seeds A = {a1, … ,ak} A hits R if there’s an ai that hits R p is called the similarity level of R if R = p% identities
Computing hit probability (cont.)
For a binary string b and , define
The goal is to find f(L, ε) For any i > |b|, we have
We can compute f(i,b) from other f(i’,b’) computed earlier
Computing hit probability (cont.)
Definition b is compatible with a seed a if b[|b|-j] =1 whenev
er a[|a|-j] = 1 for 0 < j ≦ min(|a|, |b|)
Define B be the set of binary strings that are not hit by A but compatible with some a in A.
B(x) denote the longest proper prefix of x in B
Computing hit probability (cont.)
First, εis in B Suppose b is in B, then b is compatible with some a
in A by definition. Therefore, 1b is also compatible with some a in A
If 1b is not in B, it must hit some a’ in A, so f(i,1b)=1 If 0b is not in B, it cannot be hit by A, therefore it ca
nnot be compatible with any a in A, so f(i,0b)=f(i-|b|+|b’|, 0b’), where 0b’=B(0b)
Computing hit probability (cont.)
Ref. Li,M. et al, (2004) Ref. Li,M. et al, (2004) Comput. Biol.Comput. Biol., , 22, 417–440., 417–440.
Computing hit probability (cont.)
Can also compute k-hits probability Change f(i,b) to f(i,b,k) We already have k = 1. By induction, comput
e each f(i,b,k) from f(i,b,k-1)
Computing hit probability (cont.)
Ref. Li,M. et al, (2004) Ref. Li,M. et al, (2004) Comput. Biol.Comput. Biol., , 22, 417–440., 417–440.
Computing hit probability (cont.)
Complexity It is proved that computing the hit
probability of multiple seeds is NP-hard The time complexity of the algorithm is
which
Computing hit probability (cont.)
Implement Algorithm DP on PC It took 0.70 sec to compute hit probability
for a set of 16 weight-11 seeds with length < 21 on a random region with length 64
It only took 0.37 sec for the same number of set and the same length but change the weight to 12
The running time largely depends on the maximum number of 0 in every seed
Finding seeds set Cannot enumerate all possible seed sets
by Algorithm DP The number of them are exponential!
Also, finding the optimal space seed set is proved NP-hard
Use a “greedy” method
Finding seeds set (cont.)
Compute the first seed a1 which maximizes the hit probability of the set {a1}
Then computer the second seed a2 for the set {a1, a2}. Then a3…
Compute ai until Achieve the desire number of seeds Achieve the desire hit probability
Finding seeds set (cont.) May not optimize the hit probability It is still time-consuming
e.g. It took 12 CPU days for a Pentium 4 3GHz PC to compute a set of 16 weight-11 seeds, each of them are no longer then 21
It take much longer time if the seeds become slightly longer
Need a different approach
Finding seeds set (cont.) Suppose we already have N seeds, and C
is the candidate set for the (N+1)-th seed For each c in C, estimates the hit
probability in m random region samples m is reasonably large, such as 500 Remove the worst performing halve from C, and increase m to 2m
Repeat until only one seed left
Seed performance Two ways to increase the sensitivity:
Increase the number of seeds Reduce the weight of a single seed
Both increase running time The sensitivity of “doubling the number
of seeds” is approximately equal to “reducing the weight of a single seed by 1”
At high level, doubling the number of seeds achieves better sensitivity
Seed performance (cont.) From low to high:
Solid curves: using the first k=(1, 2, 4, 8, 16) weight-11 seeds
Dashed curves: single optimal weight w=(10, 9, 8, 7) seeds
Ref. Li,M. et al, (2004) Ref. Li,M. et al, (2004) Comput. Biol.Comput. Biol., , 22, 417–440., 417–440.
Comparison
Sensitivity / Speed PatternHunter II Blast Smith-Waterman algorithm
SSearch
SSearch Configuration Smith-Waterman algorithm A sub-program in the FASTA
package FASTA package
ftp://ftp.virginia.edu/pub/FASTA/
Common Environment
Score scheme Match = 1 Mismatch = -1 Gapopen = -5 Gapextension = -1 Local alignments scores >= 16
Common Environment
DNA sequences 2 sets of human and mouse EST sequenc
es ftp://ftp.ncbi.nlm.nih.gov
/blast/db/FASTA/ month.est_human.Z month.est_mouse.Z
Pentium IV 3GHz Linux PC
Term Explanation
EST Expressed Sequence Tag A unique stretch of DNA within a
coding region of a gene that is useful for identifying.
A short sub-sequence of a transcribed sequence.
Term Explanation
Coding Regions Regions of DNA/RNA sequences that code
for proteins. Usually starts with a start codon (ATG) and ends with a stop codon.
The coding region of a gene is the portion of DNA that is transcribed into mRNA and translated into proteins.
Repeat Masking
Fact: Long sequences of identical letters Especially of As and Ts example (Will be shown later)
Solution: Turn all those sequences of ten or
more repetitive letters to Ns.
SSearch Result Num of human’s EST: 4 Num of mouse’s EST: 2005 EST example (show)
Ref. Li,M. et al, (2004) Ref. Li,M. et al, (2004) Comput. Biol.Comput. Biol., , 22, 417–440., 417–440.
Optimal Versus Sub-Optimal
Neither PatternHunter nor Blast tries to compute the optimal alignments for the homologies they have found.
Q: Why not find the optimal alignments? Ans:
use Blast or PH2 to “detect”, then compute.
Found
SSearch finds a local alignment score = x
PatternHunter II finds a local alignment score >= x/2
Then “found” for a pair of ESTs
Sensitivity Definition
Smith-Waterman Finds y pairs of ESTs Local alignment score at least x
Other programs y’ of the y pairs can be found With alignment score >= x/2
Ratio: y’ / y
Blastn Configuration Version 2.2.6
NCBI’s website -F F option
To turn off the low-complexity region filtering
Weight 11 seeds 11111111111
Speed comparison
Ref. Li,M. et al, (2004) Ref. Li,M. et al, (2004) Comput. Biol.Comput. Biol., , 22, 417–440., 417–440.
Sensitivity comparison
From low to high Dashed: Blastn, seed weight 11 Solid: PH II, 1, 2, 4, 8 seeds weight 11
Compare with other seeds
From left to right PH II, two weight 11 seeds PH II, one weight 10 seed
1101100101000101101 HMM model ,
Seed Selection
Use heuristic or exponential time algorithms For general seed selection problem
PTAS polynomial time approximation
scheme
Homology Search
Time-consuming DNA-DNA searches
Blastn translated DNA-protein searches
tBlastx tPH
protein-protein searches Small query and database sizes
Conclusion
Optimized spaced seeds Blastn & PH II Same sensitivity Speeds up by 5-100 times
Optimized multiple spaced seeds PH II & Smith-Waterman Approximately same sensitivity >1000 times faster
Translated PatternHunter
Outline
What’s translated search? BLAST’s translated search Translated Pattern Hunter Performance
What’s translated search? To translate a DNA sequence into a
protein sequence for alignment with another protein sequence
But what’s “translation”?
What’s translation? In biology, “translation” means to translate
DNA into amino acids (AA) with a universal genetic code map on a 3-codon basis.
The DNA sequence is transcribed into a RNA sequence in which all T’s are replaced by U’s
AUG UCA CUA GAA UCG UUA UAG
Met Ser Val Glu Ser Leu .
The Genetic code We can use translation in homology search
since the genetic code is universal Degeneracy: some DNA codons map to the s
ame AA They usually differs in the third codon
Translation is one-way: DNA → ProteinAUG UCA CUA GAA UCG UUA UAG
Met Ser Val Glu Ser Leu .
Why we need translated search?
When a DNA database or a Protein database is not available Blastx: DNA query, protein database tBlastn: protein query, DNA database
To find very distant homologies tBlastx: DNA query & database, both translated Slowest but more functional & structural homol
ogy in addition to sequential homology Why?
Substitution Matrix Some AAs are similar in their chemical or ph
ysical properties Not only match/mismatch in substitution anym
ore! Stop codon is assigned the most negative score
in BLAST and tPH PAM (Point Accepted Mutation)
Based on global alignment of closely related proteins (1% divergence for PAM1)
BLOSUM (BLOck SUbstitution Matrix) Based on local alignment of divergent proteins
(62% similarity for BLOSUM 62)
Substitution Matrix Short alignments need to be relatively
strong to rise above background noise, so can only detect close related homologies
Query Length
Substitution Matrix
Gap costs
<35 PAM-30 (9,1)
35-50 PAM-70 (10,1)
50-85 BLOSUM-80 (10,1)
85 BLOSUM-62 (10,1)
Related
Divergent
adapted from NCBI: substitution matrix
BLAST’s translated search The same in tBlast, tBlastn, tBlastx Aligns the 6-frame translations of the
DNA sequence against another protein sequence
Reading Frame of DNA Sequence
The DNA sequence can be read in six reading frames, three in the forward and three in the reverse direction.
A A C G U U UU C U A C U AG A A A G A GCA
Open Reading Frame
U U G C A A AA G A U G A UC U U U C U CGU
Asn Asp Thr Arg Ile Val IleMet Thr Val Glu Ser Leu .
. His . Asn Arg Tyr Ser
His Cys . Phe Arg . Leu. Val Leu Ile Thr Ile Ala
Ile Val Ser Ser Asp Asn Tyr
BLAST’s translated search1. Translate the DNA sequence into all
6 possible frames2. Align each frame against the protein
sequence, just like BLASTp.3. The pairs with significant scores are
reported
How good is significant? The expected number of alignments scorin
g S or greater between two sequences m, n is
E = mnKe–λS or E = mne-S’
where K,λ, used for normalization, depend on the sequence composition
Different K,λis used for each frame Non-conding sequence tend to yield align
ments of marginal significance
Translated PatternHunter The version of PH for translated searc
h Compared with PatternHunter, tPH us
es very different algorithms for hit generation and gapped extensions
Hit Generation in tPH Weight = 5 instead of 11
Space complexity: 520 ~ 114 in PH Length = 6 or 7
Does not require exact matches Hit = all the five pairs have scores ≥ 0
and the total score is above a tolerance T
Use BLOSUM 62 Multiple seeds are used
Hit Generation in tPHSeed = 1011, T=7
Met Thr Val Glu Ser Leu .
A A C G U U UU C U A C U AG A A A G A G
. His . Asn Arg Tyr Ser
Asn Asp Thr Arg Ile Val Ile
CA
Met Phe Ala Gln Ser Val Leu
Query
Indexed Subject
All possible hits
Met X Ala Gln
Met X Val GluMet X Ala Glu
5 + 4 + 5 ≥ 65 + 0 + 5 ≥ 65 + 0 + 2 ≥ 6
Gln X Val Leu
Arg X Val IleArg X Val Leu
5 + 4 + 4 ≥ 65 + 4 + 2 ≥ 61 + 4 + 2 ≥ 6
Gapped Extension in tPH The same as in BLAST? BLAST can’t handle frame shift errors
Huh?
Frame Shift Error When a single DNA is deleted/inserted,
it cause the reading frame to shift
Met Thr ValGlu Ser Leu .
A A C G U U UU C U A C U AG A A A G A G
Asn Asp Thr Arg Ile Val Ile
A
BLAST can’t detect such variation It aligns the 6 frames with subject
independently In fact, most frame shift mutations can
completely abolish the protein’s function They are usually lethal
Frame Shift Error In this example
BLAST can only find at most two separated segments tPH can connect them with a single deletion of “C”
How?
Gapped Extension in tPH tPH regards the DNA sequences as a sequen
ce of overlapped codons Use a modified Smith-Waterman algorithm
that can take frame shift into account Substitution: S(i-1, j-3) + σ (pi, n[j-2..j]) Insertion of DNA: S(i, j-1) + frameshift Insertion of DNA: S(i, j-2) + frameshift Insertion of AA: S(i, j-3) + gap Deletion of AA: S(i-1, j) + gap
Scoring Scheme
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 6 4 3
0 0 0 8
0 0 0 6
0 0 0 10
nGACACUAGAAUCG
P A
spA
rgT
yrS
er
Query: GAC ACU A-- GAA --- UCG
Asp Thr --- Glu Tyr Ser
Subject: Asp --- --- Arg Tyr Ser
Insertion
S(i-1, j-3) + σ (pi, n[j-2..j])S(i, j-1) + frameshift (-1)S(i, j-2) + frameshift (-1)S(i, j-3) + gap (-2)S(i-1, j) + gap (-2)
Frameshift Deletion
Substitution
Substitution
Substitution
Performance Evaluation
4407 human expressed sequence tag (EST) sequences
Split in the middle as subject and query
Number of Alignments Found
T=12 for BLAST
3x speed Higher
sensitivity
Ref. Derek Kisman et al, Bioinformatics Vol. 21 no. 4 2005
Unique Alignment Found Most contains fr
ameshifts
Ref. Derek Kisman et al, Bioinformatics Vol. 21 no. 4 2005
Using 4 Seeds
Differs from PH2
Short seeds High
dependency between seeds
Ref. Derek Kisman et al, Bioinformatics Vol. 21 no. 4 2005
Reference
PatternHunter Bin Ma, John Tromp, Ming Li Bioinformati
cs Vol. 18 no. 3 2002 Ming Li, NHC2005
PatternHunter II Li,M., Ma,B., Kisman,D. and Tromp,J. (200
4) Comput. Biol., 2, 417–440. NTU R94922059 林語君’ s powerpoint
Reference
tPatternHunter Derek Kisman, Ming Li, Bin Ma, and Li Wa
ng, Bioinformatics Vol. 21 no. 4 2005 Others
Wikipedia http://en.wikipedia.org/wiki NCBI http://www.ncbi.nlm.nih.gov
Thank you for your Thank you for your attention!attention!