서열정렬

서열정렬

2010.11.4Written By 배형섭

Outline

서열분석 목적 Pairwise alignment/Multiple Sequence Align-ment dot-matrix representation Needleman-wunsch algorithm Smith-Waterman algorithm PAM BLOSUM

서열분석의 목적1. 분자생물학의 경험적 지식 사용 - 두 분자가 유사한 서열을 가지면 , 진화적인 관계나 물리 화학적인 제약때문에 유사한 3 차원구조와 비 슷한 생물학적 기능을 갖기 쉽다 .

2. 뉴클레오티드 또는 아미노산 서열로 쓰여진 고차원 의 구조와 기능적인 정보를 밝혀내기 위함

- 구조적 , 기능적 속성으로 확장 될 수 있는 서열 특 징을 찾고자 함

서열분석 방법서열 해석에 대한 두가지 접근법

1) 서열유사성 검색 또는 상동성 검색2) 모티브 검색

여전히 가장 선호 되는 접근 방법

모든 형태의 기능 속성에 대한 경험적인 법칙들을 만드는 것이 가능 하지 않음

서열 정렬 최적의 서열 정렬을 얻는 문제는 전반적인 유사성을 나타내는 점수 함수 (Score Function) 를 최적화 하는것

공백은 일치하는 문자쌍의 수를 최대화하거나 , 치환 , 그리고 삽입 / 삭제와 같 은 편집 연산의 수를 최소화 하기 위해서 추가

Score Function 은 아미노산 또는 뉴클레오티드들 사이의 유사성 측정값과 삽입 및 삭제에 대한 벌점 (penalties) 을 통해 계산

서열 정렬법 Global 정렬 & Local 정렬 : 전체 / 부분적인 유사 영역 비교 Pair wise 정렬 & Multiple 정렬 : 두개 / 세개 이상의 서열 정렬 3 차원 정렬 : 두가지 단백질의 3 차원 구조 비교 3 차 -1 차 정렬 : 3 차원 구조 라이브러리에 대한 아미노산 서열 비교

The best alignment over the entire length of two sequences

Suitable when the two sequences are of similar length, with a signicant degree of similarity throughout

Example: SIMILARITY PI-LLAR--

Global sequence alignment

Involving stretches that are shorter than the en-tire sequences, possibly more than one.

Suitable when comparing substantially different sequences, which possibly differ signicantly in length, and have only a short patches of similar-ity.

For example, the local alignment of SIMILARITY and PILLAR:

MILAR ILLAR

Local sequence alignment

Pairwise Alignment

Multiple Alignment

dot-matrix representation

최적 정렬 방법

1. Optimal alignment - proven mathmatically to produce the best-scoring between two sequences for a given scoring system

2. Global and local alignment - A global alignment program -> Needleman-Wunsch algorithm - A local alignment program -> Smith-Waterman algorithm

3. Choice in dynamic programming - alignments obtained depend on the choice of a scoring system for compar-

ing character pairs and on penalty scores for gaps

4. Alignments, alignment scores, and alterative alignments -5. Analysis goals

-Dayhoff PAM matrices -> evolutionary model of protein change

- BLOSUM matrices -> identify members of the same family

Dynamic Programming Method

Needleman-wunsch algo-rithm

A matrix D(I, j) indexed by residues of each sequence is built recursively, such that

D(i; j) = max D(i-1, j -1) + s(xi,yj) D(i -1 1; j) + g D(i; j -,1) + g subject to a boundary conditions. s(I, j) is the substitution score for residues i and j, and g is the gap penalty

The mathematics

The Needleman-Wunsch algorithm consists of three steps:

1. Initialisation of the score matrix 2. Calculation of scores and filling the traceback matrix 3. Deducing the alignment from the trace-

back matrix

overview

The alignment of two sequences SEND and AND with the

BLOSUM62 substitution matrix and gap opening penalty

of 10 (no gap extension): SEND -AND score: +1 A-ND score: +3 the best AN-D score: -3 AND- score: -8

Consider the simple exam-ple

C(1,1)

C(1,2)

C(1,3)

C(1,4)

C(1,5)

C(2,1)

C(3,1)

C(4,1)

The score and traceback ma-tricesThe cells of the score matrix are labelled C(I,j) wherei = 1; 2; :::;N and j = 1; 2; :::;M

C(1,1)

C(1,2)

C(1,3)

C(1,4)

C(1,5)

C(2,1)

C(3,1)

C(4,1)

S E N D

A

N

D

Initialization

0 -10 -20 -30 -40

-10

-20

-30

The first row and the the first column of the score andtraceback matrices are filled during the initialization

S E N D

A

N

D

done

left

left

left left

up

up

up

S E N D

The score matrix cells are filled by row starting from the cell C(2; 2) The score of any cell C(I,j) is the maximum of: qdiag = C(i-1, j-1) + S(I, j) qup = C(i–1, j) + g qleft = C(i, j-1) + g

where S(i; j) is the substitution score for Letters i and j, and g is the gap penalty

Scoring

The calculation for the cell C(2; 2): qdiag = C(1, 1) + S(S, A) = 0 + 1 = 1 qup = C(1, 2) + g = 10 + (-10) = -20 qleft = C(2, 1) + g = 10 + (-10) = -20

Where C(1,1), C(1, 2), and C(2,1) are read from the score matrix, and S(S,A) is the score for the S <-> A taken from the BLO-SUM62 matrix.

The value of C(2,2)

Filling the score and traceback ma-tricesFor the score matrix C(2, 2) = qdiag which is 1. The corresponding cell of the traceback matrix is "diag":

0 -10 -20 -30 -40

-10 1

-20

-30

done

left left

left

left

up diag

up

up

S E N D

A

N

D

The traceback path

done

left left

left

left

up diag left

up diag

up diag

The traceback performed on the completed traceback matrix

S E N D

A

N

DTraceback starts here

S E N DA- N D

The alignment is over the entire length of two se-quences:

the traceback starts from the lower right corner of thetraceback matrix, and completes in the upper left cell ofthe matrix.

The Needleman-Wunsch algorithm works in the same Way regardless of the length or complexity ofsequences, and guarantees to find the best alignment.

The Needleman-Wunsch algorithm is appropriate forfinding the best alignment of two sequences which are (i) of the similar length; (ii) similar across

their Entire lengths

The summary

Smith-Waterman algo-rithm

Local Alignment 알고리즘 주어진 치환행렬과 갭 감점을 기반으로 두서열의 유 사도가 높은 영역을정렬 최적의점수를기록한서열의정렬을찾아냄 부분적인정렬스코어를가진 2

차원의테이블을만듦 - 테이블안의 각셀은 최적의 부분정렬을 찾아 낼수 있는 스코어를 포함 두서열의 상동성 검색에 사용됨

overview

Smith-Waterman Algo-rithm

The mathematics

Substitution Matrix

What is Substitution Matrix(Score Matrix)? 정의 :

- 어떤 아미노산이가 다른 아미노산로 치환되었을때 , 전체적인 단백질에영향을 미치는 확률적인 정도 .- 서열정렬시 , 아미노산이 서로 비슷한 정도 .

치환행렬의 기본형태 :log ( 공통조상에 의한 확률 / 우연에 의한 확률 )

가장 단순한 형태 :실무율 (all-or-nothing)Uij= 1 or 0 (1 for i=j, 0 for the rest)

Dayhoff 등이 1960 년대 중반부터 72 개 family 에 속하는 1572 개의 단백질 서열데이터와 Pairwise alignment 방법을

이용하여 20 가지 아미노산이 치환될 확률을 통계적으로 분석 .

-PAM1, PAM10, PAM20,…., PAM500 처럼 뒤에 나오는 숫자는 단위 진화시간이 그숫자 만큼 반복할

경우 나타날 아미노산 치환확률이다 . 즉 , PAM10 은 PAM1 을 10 번 곱해서 만들고 PAM1 단위의 진화적 시간이 10 번 반복될경우 나타나는 아미노산의 치환확률의 수치화된 표현이다 .

- 일반적 기본값 : PAM120 or PAM250

PAM (Point Accepted Muta-tion)

1 PAM 이란 ?:100 개의 아미노산중 치환돌연변이 가 1 개 존재할때 ( 따라서 PAM 단위는 진화적 거리의 척도로도 사용된다 .)

PAM Unit

PAM Matrix

-Steven Henikoff 등이 1992 년 PAM 을 개선하기위하여 고안되었고 따라서 PAM 보다 우수한결과를 제공한다 .

-BLOSUM(m) 의 형태 : m 은 다중서열일치정도

- 일반적 기본값 : BLOSUM62 (NCBI 의 BLASTP) or BLOSUM50 (FASTA)

BLOSUM (Blocks Substitution Matrix)

BLOSUM 행렬 형성 과정

BLOSUM Matrix

PAM vs. BLOSUM

서열정렬

Documents

Transcript of 서열정렬