PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019...

119
PatternHunter: faster an d more sensitive homolog y search By Bin Ma, John Tromp and M ing Li B92902019 鍾鍾鍾 B92902033 鍾鍾鍾 B92902039 鍾鍾鍾 B92902072 鍾鍾鍾 B92902086 鍾鍾鍾 B92902087 鍾鍾鍾

Transcript of PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019...

Page 1: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

PatternHunter: faster and more sensitive homology search

By Bin Ma, John Tromp and Ming Li

B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽B92902072 張智翔 B92902086 洪錫全 B92902087 郭立翔

Page 2: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Agenda PatternHunter

Spaced Seed Algorithm Performance

PatternHunter II Algorithm Performance

Translated PatternHunter

Page 3: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

PatternHunter – Spaced Seed

Page 4: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Outline

A short review about BLAST. Some definition and background. What’s the difference and the same

between BLAST and PatternHunter. Why PatternHunter is better??

Nonconsecutive seeds Proof

Page 5: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Blast Algorithm Find seeded matches

Extent to HSP’s (High scoring Segment Pairs)

Gapped Extension, dynamic programming

Report significant local alignments

Page 6: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

A short review about BLAST

Find hits. BLAST first scans the database for

words that score at least T when aligned with some word within the query sequence. Any aligned word pair satisfying this condition is called a hit.

Page 7: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

A short review about BLAST

Find HSPs HSP (High scoring Segment Pair) is much

longer than a single word pair, and may therefore entail multiple hits on the same diagonal within a relative shot distance of one another.

Page 8: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

A short review about BLAST

Generate gapped alignment This means that two or more HSPs in BLA

ST with scores well below 38 bits can, in combination, rise to statistical significance. If any one of these HSPs is missed, so may be the combined result.

Page 9: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

A short review about BLAST

In summary, the new gapped BLAST algorithm requires two non-overlapping hits of score at least T, within a distance A of one another, to invoke an ungapped extension of the second hit. If the HSP generated normalized score at least Sg bits, then a gapped extension is triggered.

Page 10: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Some definition, some background Similarity

How similar it is between two sequences? Usually mean that the probability of the same

symbol appear in anywhere of two sequences. Sensitivity

The probability to find a local alignment. Specificity

In all local alignments, how many alignments are homologous.

Page 11: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Define the Seed

Defining the seed: w -> weight or number of positions to m

atch Blastn: 11 MegaBlast: 28

model -> relative position of letters for each w

m -> length of model “window”

Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li

BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002

Page 12: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

1 1 1 0 1 0 0 1 0 1 0 0 1 1 0 1 1 1

m = 18

w = 11

model

Patternhunter most sensitive model

Seed Parameters:

11 – – exact match requiredexact match required

00 – – no match required, any no match required, any valuevalue

{

letters:

0,0, 11

Blastn seed is all “1”s

Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li

BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002

Page 13: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Seed, Hit, Homology

What is a seed? Seeds determine how an algorithm

looks for hits What is a hit?

Hits indicate a similarity that may indicate a homology

Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li

BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002

Page 14: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC|| ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | |||||GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC

GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----|| ||||||||| |||||| | ||||| |||||||| ||| |||||||| | | | || GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG

------------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| |TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA

GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG||||||| |||| | | |||| ||||| || ||||| || |||||| |||||||||||||||GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG

Human-Mouse genome homology

hit

Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li

BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002

Page 15: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Example:

Consider the following two sequences:GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT

|| ||||||||| |||||||| |||||| ||||||

GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT

What’s the differences in finding the seed between Blast and PatternHunter?

Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li

BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002

Page 16: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

BLAST uses“consecutive seeds” In BLAST, we often use the

consecutive model with weight 11.GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT

|| ||||||||| |||||||| |||||| ||||||

GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT

→ 11111111111 → … →… … → 11111111111 ←

However, it fails to find the alignment in the two sequence.

Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li

BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002

Page 17: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Consecutive seeds

There’s also a dilemma for BLAST type of search.

Dilemma Sensitivity – needs shorter seeds

too many random hits, slow computation Speed – needs longer seeds

lose distant homologies

Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li

BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002

Page 18: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

PatternHunter uses “non-consecutive seed” In PatternHunter, we often use the sp

aced model with weight 11 and length 18.GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT

|| ||||||||| |||||||| |||||| ||||||

GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT

111010010100110111

Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li

BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002

Page 19: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Consecutive vs. Nonconsecutive?

The non-consecutive seed is the primary difference and strength of Patternhunter

Blastn: 1 1 1 1 1 1 1 1 1 1 1

PatternHunter: 1 1 1 0 1 0 0 1 0 1 0 0 1 1 0 1 1 1

Reference Bin Ma, John Tromp, Reference Bin Ma, John Tromp, Ming Li Ming Li

BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002

Page 20: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

A trivial comparison between spaced and consecutive seed

Consider 111 and 1101. To fail seed 111, we can use

110110110110… 66.66% similarity

But we can prove, seed 1101 will hit every region with 61% similarity for sufficient long region.

Reference Ming Li, NHC2005Reference Ming Li, NHC2005

Page 21: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Proof Suppose there is a length 100 region which is

not hit by 1101. We can break the region into blocks of 1a0b.

Besides the last block, the other blocks have the following few cases:

10b for b>=1 110b for b>=2 1110b for b>=2

In each block, similarity <= 3/5. The last block has at most 3 matches. So, in total there are at most 61 matches in

100 positions. The similarity is <=61%.

Reference Ming Li, NHC2005Reference Ming Li, NHC2005

Page 22: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Formalize Given i.i.d. sequence (homology region) wi

th Pr(1)=p and Pr(0)=1-p for each bit:

1100111011101101011101101011111011101

Which seed is more likely to hit this region: BLAST seed: 11111111111 Spaced seed: 111*1**1*1**11*111

111*1**1*1**11*111

Reference Ming Li, NHC2005Reference Ming Li, NHC2005

Page 23: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Expect Less, Get More Lemma: The expected number of hits of a w

eight W length M seed model within a length L region with homology level p is

(L-M+1)pW

Proof. E(#hits) = ∑i=1 … L-M+1 pW ■

Example: In a region of length 64 with p=0.7 Pr(BLAST seed hits)=0.3 E(# of hits by BLAST seed)=1.07 Pr(optimal spaced seed hits)=0.466, 50% more E(# of hits by spaced seed)=0.93, 14% less

Reference Ming Li, NHC2005Reference Ming Li, NHC2005

Page 24: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Why Is Spaced Seed Better?

A wrong, but intuitive, proof: seed s, interval I, similarity p E(#hits) = Pr(s hits) E(#hits | s hits)Thus: Pr(s hits) = Lpw / E(#hits | s hits)For optimized spaced seed, E(#hits | s hits) 111*1**1*1**11*111 Non overlap Prob 111*1**1*1**11*111 6 p6

111*1**1*1**11*111 6 p6

111*1**1*1**11*111 6 p6 111*1**1*1**11*111 7 p7

….. For spaced seed: the divisor is 1+p6+p6+p6+p7+ … For BLAST seed: the divisor is bigger: 1+ p + p2 + p3 + …

Reference Ming Li, NHC2005Reference Ming Li, NHC2005

Page 25: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Simulated sensitivity curves

Reference Ming Li, NHC2005Reference Ming Li, NHC2005

Page 26: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Observations of spaced seeds

Seed models with different shapes can detect different homologies.

Two consequences: Some models may detect more homologies

than others More sensitive homology search PatternHunter I

Can use several seed models simultaneously to hit more homologies

Approaching 100% sensitive homology search PatternHunter II

Reference Ming Li, NHC2005Reference Ming Li, NHC2005

Page 27: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

PatternHunter – Algorithm & Performance

Page 28: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Outline

Hit generation Hit extension Gapped extension Performance

Page 29: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Hit generation

Index created for each position in the query sequence

Page 30: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Hit generation Similar to MegaBlast: Hash tables

Encode ATCG into binary code 00, 01, 10, 11 respectively

Find each situations in one of the sequence and record the offsets in the hash table

Page 31: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Hit generation

An example:Now we want to find hits between sequences S and T

Page 32: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Spaced seed

For sequence T:

ModelSeed

A T A T G C A T

1 1 0 1 0 1 1 0

A 00

T 01

C 10

G 11A T T C A 0001011000 =

88

‧‧ ‧‧

Scan

Weight=5 the value is between 0~2^10-1

Page 33: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

After filling in the hash table…

0

1

2

3

‧‧‧‧‧‧

87

88

‧‧‧

10

19

34(NULL)

1410

48

134

2 8 33

‧‧‧

Position in TFor each position in S:

1.Calculate int value

2. Find hits in S by the lookup value

Page 34: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Hash tables: space required

3

2

1

0‧‧‧

‧‧‧

88

87

‧‧‧

341910

(NULL)14

1344810

3382

‧‧‧

Position in T

4^w integers

|T| integers

Total: 4^(w+1)+4|T| bytes

Page 35: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Cost a lot to make a hash table?

If the number of hits found for one index is large, the cost of computing index is relatively negligible.

Page 36: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Hit extension

HSP: Highscoring Segment Pair

Scan those hits with a window, and choose the highest-scored one.

Page 37: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Hit extension

S

T

The chosen hit

Page 38: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Hit extension

Set the mid point of the chosen hit as the cut point, split the graph into 4

Page 39: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Hit extension

S

T

Page 40: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Hit extension

And then do the Smith-Waterman in 2 of the 4, until it reaches the dropoff score.

Page 41: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Hit extension

S

T

Smith-Waterman

Smith-Waterman

Cost=1/2*O(mn)

Page 42: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Hit extension

If the resulting segment pair has a score below certain minimum, then ignore it.

Else we gain a HSP and do the next step-gap extension.

Page 43: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Hit extension

A question: when doing extension in 2 ways, how to synchronize the score?

Page 44: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

To find the best way to extend an HSP to the left across gaps.

To extend an HSP we try all candidates from a diagonal-sorted set. Penalty for gap open + gap extension

+ cropping

Gapped Extension

Page 45: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Search front

Gapped Extension

Page 46: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Optimal Left

Too Far Right

Too Far RightOptimal Left

From left to right

Page 47: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Too Far Right

Optimal Left

From left to right

Page 48: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Descriptions in the paper

We use a red-black tree for this. Insert HSP when the optimal

alignment to its left is found Retired from the tree once newly

generated HSPs are too far beyond its right endpoint to make use of it.

Page 49: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Thought 1 The first one will be inserted Fast

Page 50: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Better

Worse

Start

End

May not find the best one

Thought 1

Page 51: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Not complete HSP

Insert HSP when the optimal alignment to its left is found

Thought 2

Page 52: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Close but short (Bad)

Far but long (Good)Insert both HSPs

Next turn

Thought 2

Page 53: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Tree 2

Tree 1

Thought 1 Retired alignments are put into a priority queue according to

their scores.

Page 54: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Performance

Ref. Bin Ma, John Tromp, Ming Li Ref. Bin Ma, John Tromp, Ming Li BioinformaticsBioinformatics Vol. 18 no. 3 2002 Vol. 18 no. 3 2002

Ref. Altschul,S.F. et al (1997) Ref. Altschul,S.F. et al (1997) Nucleic Acids Res.Nucleic Acids Res., 25, 3389–3402., 25, 3389–3402.

Page 55: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

PatternHunter II

Page 56: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Outline

Overview PatternHunter II design Computing hit probability Finding seeds set Seed performance PHII performance

Page 57: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Overview PatternHunter: spaced seed PH2: design for better sensitivity

“Achieve a sensitivity approaching that of Smith-Waterman with a speed similar to the default Blastn”

Extend single spaced seed to multiple ones Two main problem:

Large memory required for multiple hash tables Complexity of finding optimal seed combination

Page 58: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

PatternHunter II design A hash table is built for each seeds All hits generated from all hash tables

are used for gap extension In two-hit mode, two nearby hits can

be from different hash tables

Page 59: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

PatternHunter II design (cont.) Large memory problem:

Divide into smaller segments e.g., with k = 8, w = 11, and n = 32 x 106, the hash tables use about 256MBytes of memory Extend alignments across division boundary Still may lose alignments

Page 60: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Computing hit probability

Use DP, but extend the algorithm from single seed to multiple seeds

Definition Homologous region R with length L Substring from i to j is denoted by R[i : j] A set of k seeds A = {a1, … ,ak} A hits R if there’s an ai that hits R p is called the similarity level of R if R = p% identities

Page 61: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Computing hit probability (cont.)

For a binary string b and , define

The goal is to find f(L, ε) For any i > |b|, we have

We can compute f(i,b) from other f(i’,b’) computed earlier

Page 62: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Computing hit probability (cont.)

Definition b is compatible with a seed a if b[|b|-j] =1 whenev

er a[|a|-j] = 1 for 0 < j ≦ min(|a|, |b|)

Define B be the set of binary strings that are not hit by A but compatible with some a in A.

B(x) denote the longest proper prefix of x in B

Page 63: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Computing hit probability (cont.)

First, εis in B Suppose b is in B, then b is compatible with some a

in A by definition. Therefore, 1b is also compatible with some a in A

If 1b is not in B, it must hit some a’ in A, so f(i,1b)=1 If 0b is not in B, it cannot be hit by A, therefore it ca

nnot be compatible with any a in A, so f(i,0b)=f(i-|b|+|b’|, 0b’), where 0b’=B(0b)

Page 64: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Computing hit probability (cont.)

Ref. Li,M. et al, (2004) Ref. Li,M. et al, (2004) Comput. Biol.Comput. Biol., , 22, 417–440., 417–440.

Page 65: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Computing hit probability (cont.)

Can also compute k-hits probability Change f(i,b) to f(i,b,k) We already have k = 1. By induction, comput

e each f(i,b,k) from f(i,b,k-1)

Page 66: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Computing hit probability (cont.)

Ref. Li,M. et al, (2004) Ref. Li,M. et al, (2004) Comput. Biol.Comput. Biol., , 22, 417–440., 417–440.

Page 67: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Computing hit probability (cont.)

Complexity It is proved that computing the hit

probability of multiple seeds is NP-hard The time complexity of the algorithm is

which

Page 68: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Computing hit probability (cont.)

Implement Algorithm DP on PC It took 0.70 sec to compute hit probability

for a set of 16 weight-11 seeds with length < 21 on a random region with length 64

It only took 0.37 sec for the same number of set and the same length but change the weight to 12

The running time largely depends on the maximum number of 0 in every seed

Page 69: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Finding seeds set Cannot enumerate all possible seed sets

by Algorithm DP The number of them are exponential!

Also, finding the optimal space seed set is proved NP-hard

Use a “greedy” method

Page 70: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Finding seeds set (cont.)

Compute the first seed a1 which maximizes the hit probability of the set {a1}

Then computer the second seed a2 for the set {a1, a2}. Then a3…

Compute ai until Achieve the desire number of seeds Achieve the desire hit probability

Page 71: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Finding seeds set (cont.) May not optimize the hit probability It is still time-consuming

e.g. It took 12 CPU days for a Pentium 4 3GHz PC to compute a set of 16 weight-11 seeds, each of them are no longer then 21

It take much longer time if the seeds become slightly longer

Need a different approach

Page 72: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Finding seeds set (cont.) Suppose we already have N seeds, and C

is the candidate set for the (N+1)-th seed For each c in C, estimates the hit

probability in m random region samples m is reasonably large, such as 500 Remove the worst performing halve from C, and increase m to 2m

Repeat until only one seed left

Page 73: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Seed performance Two ways to increase the sensitivity:

Increase the number of seeds Reduce the weight of a single seed

Both increase running time The sensitivity of “doubling the number

of seeds” is approximately equal to “reducing the weight of a single seed by 1”

At high level, doubling the number of seeds achieves better sensitivity

Page 74: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Seed performance (cont.) From low to high:

Solid curves: using the first k=(1, 2, 4, 8, 16) weight-11 seeds

Dashed curves: single optimal weight w=(10, 9, 8, 7) seeds

Ref. Li,M. et al, (2004) Ref. Li,M. et al, (2004) Comput. Biol.Comput. Biol., , 22, 417–440., 417–440.

Page 75: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Comparison

Sensitivity / Speed PatternHunter II Blast Smith-Waterman algorithm

SSearch

Page 76: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

SSearch Configuration Smith-Waterman algorithm A sub-program in the FASTA

package FASTA package

ftp://ftp.virginia.edu/pub/FASTA/

Page 77: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Common Environment

Score scheme Match = 1 Mismatch = -1 Gapopen = -5 Gapextension = -1 Local alignments scores >= 16

Page 78: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Common Environment

DNA sequences 2 sets of human and mouse EST sequenc

es ftp://ftp.ncbi.nlm.nih.gov

/blast/db/FASTA/ month.est_human.Z month.est_mouse.Z

Pentium IV 3GHz Linux PC

Page 79: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Term Explanation

EST Expressed Sequence Tag A unique stretch of DNA within a

coding region of a gene that is useful for identifying.

A short sub-sequence of a transcribed sequence.

Page 80: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Term Explanation

Coding Regions Regions of DNA/RNA sequences that code

for proteins. Usually starts with a start codon (ATG) and ends with a stop codon.

The coding region of a gene is the portion of DNA that is transcribed into mRNA and translated into proteins.

Page 81: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Repeat Masking

Fact: Long sequences of identical letters Especially of As and Ts example (Will be shown later)

Solution: Turn all those sequences of ten or

more repetitive letters to Ns.

Page 82: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

SSearch Result Num of human’s EST: 4 Num of mouse’s EST: 2005 EST example (show)

Ref. Li,M. et al, (2004) Ref. Li,M. et al, (2004) Comput. Biol.Comput. Biol., , 22, 417–440., 417–440.

Page 83: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Optimal Versus Sub-Optimal

Neither PatternHunter nor Blast tries to compute the optimal alignments for the homologies they have found.

Q: Why not find the optimal alignments? Ans:

use Blast or PH2 to “detect”, then compute.

Page 84: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Found

SSearch finds a local alignment score = x

PatternHunter II finds a local alignment score >= x/2

Then “found” for a pair of ESTs

Page 85: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Sensitivity Definition

Smith-Waterman Finds y pairs of ESTs Local alignment score at least x

Other programs y’ of the y pairs can be found With alignment score >= x/2

Ratio: y’ / y

Page 86: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Blastn Configuration Version 2.2.6

NCBI’s website -F F option

To turn off the low-complexity region filtering

Weight 11 seeds 11111111111

Page 87: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Speed comparison

Ref. Li,M. et al, (2004) Ref. Li,M. et al, (2004) Comput. Biol.Comput. Biol., , 22, 417–440., 417–440.

Page 88: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Sensitivity comparison

From low to high Dashed: Blastn, seed weight 11 Solid: PH II, 1, 2, 4, 8 seeds weight 11

Page 89: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Compare with other seeds

From left to right PH II, two weight 11 seeds PH II, one weight 10 seed

1101100101000101101 HMM model ,

Page 90: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Seed Selection

Use heuristic or exponential time algorithms For general seed selection problem

PTAS polynomial time approximation

scheme

Page 91: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Homology Search

Time-consuming DNA-DNA searches

Blastn translated DNA-protein searches

tBlastx tPH

protein-protein searches Small query and database sizes

Page 92: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Conclusion

Optimized spaced seeds Blastn & PH II Same sensitivity Speeds up by 5-100 times

Optimized multiple spaced seeds PH II & Smith-Waterman Approximately same sensitivity >1000 times faster

Page 93: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Translated PatternHunter

Page 94: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Outline

What’s translated search? BLAST’s translated search Translated Pattern Hunter Performance

Page 95: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

What’s translated search? To translate a DNA sequence into a

protein sequence for alignment with another protein sequence

But what’s “translation”?

Page 96: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

What’s translation? In biology, “translation” means to translate

DNA into amino acids (AA) with a universal genetic code map on a 3-codon basis.

The DNA sequence is transcribed into a RNA sequence in which all T’s are replaced by U’s

AUG UCA CUA GAA UCG UUA UAG

Met Ser Val Glu Ser Leu .

Page 97: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

The Genetic code We can use translation in homology search

since the genetic code is universal Degeneracy: some DNA codons map to the s

ame AA They usually differs in the third codon

Translation is one-way: DNA → ProteinAUG UCA CUA GAA UCG UUA UAG

Met Ser Val Glu Ser Leu .

Page 98: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Why we need translated search?

When a DNA database or a Protein database is not available Blastx: DNA query, protein database tBlastn: protein query, DNA database

To find very distant homologies tBlastx: DNA query & database, both translated Slowest but more functional & structural homol

ogy in addition to sequential homology Why?

Page 99: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Substitution Matrix Some AAs are similar in their chemical or ph

ysical properties Not only match/mismatch in substitution anym

ore! Stop codon is assigned the most negative score

in BLAST and tPH PAM (Point Accepted Mutation)

Based on global alignment of closely related proteins (1% divergence for PAM1)

BLOSUM (BLOck SUbstitution Matrix) Based on local alignment of divergent proteins

(62% similarity for BLOSUM 62)

Page 100: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Substitution Matrix Short alignments need to be relatively

strong to rise above background noise, so can only detect close related homologies

Query Length

Substitution Matrix

Gap costs

<35 PAM-30 (9,1)

35-50 PAM-70 (10,1)

50-85 BLOSUM-80 (10,1)

85 BLOSUM-62 (10,1)

Related

Divergent

adapted from NCBI: substitution matrix

Page 101: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

BLAST’s translated search The same in tBlast, tBlastn, tBlastx Aligns the 6-frame translations of the

DNA sequence against another protein sequence

Page 102: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Reading Frame of DNA Sequence

The DNA sequence can be read in six reading frames, three in the forward and three in the reverse direction.

A A C G U U UU C U A C U AG A A A G A GCA

Open Reading Frame

U U G C A A AA G A U G A UC U U U C U CGU

Asn Asp Thr Arg Ile Val IleMet Thr Val Glu Ser Leu .

. His . Asn Arg Tyr Ser

His Cys . Phe Arg . Leu. Val Leu Ile Thr Ile Ala

Ile Val Ser Ser Asp Asn Tyr

Page 103: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

BLAST’s translated search1. Translate the DNA sequence into all

6 possible frames2. Align each frame against the protein

sequence, just like BLASTp.3. The pairs with significant scores are

reported

Page 104: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

How good is significant? The expected number of alignments scorin

g S or greater between two sequences m, n is

E = mnKe–λS or E = mne-S’

where K,λ, used for normalization, depend on the sequence composition

Different K,λis used for each frame Non-conding sequence tend to yield align

ments of marginal significance

Page 105: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Translated PatternHunter The version of PH for translated searc

h Compared with PatternHunter, tPH us

es very different algorithms for hit generation and gapped extensions

Page 106: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Hit Generation in tPH Weight = 5 instead of 11

Space complexity: 520 ~ 114 in PH Length = 6 or 7

Does not require exact matches Hit = all the five pairs have scores ≥ 0

and the total score is above a tolerance T

Use BLOSUM 62 Multiple seeds are used

Page 107: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Hit Generation in tPHSeed = 1011, T=7

Met Thr Val Glu Ser Leu .

A A C G U U UU C U A C U AG A A A G A G

. His . Asn Arg Tyr Ser

Asn Asp Thr Arg Ile Val Ile

CA

Met Phe Ala Gln Ser Val Leu

Query

Indexed Subject

All possible hits

Met X Ala Gln

Met X Val GluMet X Ala Glu

5 + 4 + 5 ≥ 65 + 0 + 5 ≥ 65 + 0 + 2 ≥ 6

Gln X Val Leu

Arg X Val IleArg X Val Leu

5 + 4 + 4 ≥ 65 + 4 + 2 ≥ 61 + 4 + 2 ≥ 6

Page 108: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Gapped Extension in tPH The same as in BLAST? BLAST can’t handle frame shift errors

Huh?

Page 109: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Frame Shift Error When a single DNA is deleted/inserted,

it cause the reading frame to shift

Met Thr ValGlu Ser Leu .

A A C G U U UU C U A C U AG A A A G A G

Asn Asp Thr Arg Ile Val Ile

A

BLAST can’t detect such variation It aligns the 6 frames with subject

independently In fact, most frame shift mutations can

completely abolish the protein’s function They are usually lethal

Page 110: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Frame Shift Error In this example

BLAST can only find at most two separated segments tPH can connect them with a single deletion of “C”

How?

Page 111: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Gapped Extension in tPH tPH regards the DNA sequences as a sequen

ce of overlapped codons Use a modified Smith-Waterman algorithm

that can take frame shift into account Substitution: S(i-1, j-3) + σ (pi, n[j-2..j]) Insertion of DNA: S(i, j-1) + frameshift Insertion of DNA: S(i, j-2) + frameshift Insertion of AA: S(i, j-3) + gap Deletion of AA: S(i-1, j) + gap

Page 112: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Scoring Scheme

0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 6 4 3

0 0 0 8

0 0 0 6

0 0 0 10

nGACACUAGAAUCG

P A

spA

rgT

yrS

er

Query: GAC ACU A-- GAA --- UCG

Asp Thr --- Glu Tyr Ser

Subject: Asp --- --- Arg Tyr Ser

Insertion

S(i-1, j-3) + σ (pi, n[j-2..j])S(i, j-1) + frameshift (-1)S(i, j-2) + frameshift (-1)S(i, j-3) + gap (-2)S(i-1, j) + gap (-2)

Frameshift Deletion

Substitution

Substitution

Substitution

Page 113: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Performance Evaluation

4407 human expressed sequence tag (EST) sequences

Split in the middle as subject and query

Page 114: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Number of Alignments Found

T=12 for BLAST

3x speed Higher

sensitivity

Ref. Derek Kisman et al, Bioinformatics Vol. 21 no. 4 2005

Page 115: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Unique Alignment Found Most contains fr

ameshifts

Ref. Derek Kisman et al, Bioinformatics Vol. 21 no. 4 2005

Page 116: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Using 4 Seeds

Differs from PH2

Short seeds High

dependency between seeds

Ref. Derek Kisman et al, Bioinformatics Vol. 21 no. 4 2005

Page 117: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Reference

PatternHunter Bin Ma, John Tromp, Ming Li Bioinformati

cs Vol. 18 no. 3 2002 Ming Li, NHC2005

PatternHunter II Li,M., Ma,B., Kisman,D. and Tromp,J. (200

4) Comput. Biol., 2, 417–440. NTU R94922059 林語君’ s powerpoint

Page 118: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Reference

tPatternHunter Derek Kisman, Ming Li, Bin Ma, and Li Wa

ng, Bioinformatics Vol. 21 no. 4 2005 Others

Wikipedia http://en.wikipedia.org/wiki NCBI http://www.ncbi.nlm.nih.gov

Page 119: PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086.

Thank you for your Thank you for your attention!attention!