基于聚类的蛋白质家族建立

LOGO

基于聚类的蛋白质家族建立

组长：许坤

组员：高晨曦、曹天骄、韩蕊

主讲：曹天骄

韩蕊

联系方式： [email protected]

——Final Project Proposal 2010

www.themegallery.com

备选提案

多国语词典多媒体推荐系统蛋白质比对分析处理系统地震 / 气象预测系统


可行性分析

分析标准

方案所要用到的主要技术与课程的相关性方案所要用到的数据集规模和来源能否满足课程设

计要求实现方案可能遇到的问题现有


确定选用的方案

蛋白质比对分析处理系统


Introduction

世界上所有蛋白质的种类难以估计，一个细胞内就有上千种结构、功能、分子质量不同的蛋白质。


What is protein?Components of organisms:Enzymes (metabolism) Transport (O2, membrane …)Movements (muscles) Antibodies (immunity )Brain … … Protections (horns, skins…)


氨基酸

20 种基本氨基酸（一级结构）蛋白质的结构和功能（三级结构）

a carboxyl group( 羧基 )an amino group （氨基）

side chains, or R groups


同源蛋白质 Protein sequences can elucidate the history of life o

n earth The study of molecular evolution generally focuses

on families of closely related proteins. The members of protein families are called homolo

gous proteins or homologs.

同源蛋白质可以在物种内也可以在物种间。蛋白质之间的关系远近可以体现出物种间进化关系的

远近。蛋白质的氨基酸序列包含了判断这一关系所需要的全

部信息，因此通过氨基酸序列比对，可以得到物种的进化树。


蛋白质的氨基酸序列数据库（约 80G ）(download from uniprot)

ftp://ftp.ncbi.nlm.nih.gov美国生物信息中心


Expectation

通过比对蛋白质氨基酸序列，得到蛋白质的相似度，从而得到同源性高的蛋白质

最终建立蛋白质家族


初步思路

1 、输入输出：输入：蛋白质的氨基酸序列Key/value: 蛋白质名称 / 氨基酸序列

输出：同源性高的蛋白质序列2 、方法： cluster


3 、抽象模型：（ 1 ）坐标系的建立：

·维度：以最长的蛋白质序列的氨基酸个数作为维度数目，张开一个空间；每个坐标轴上有 20 个离散

刻度（分别是每个氨基酸对应的数值）；

·坐标：根据氨基酸各个参数确定一个公式，以确定每个氨基酸对应的数值；


（ 2 ）散点空间位置的确定：根据每个蛋白质的氨基酸序列把它对应到空间上

的点。（ 3 ）两点距离公式（比对）：


参考模型


Set-Similarity Join

partition the data across nodes balance the workload minimize the need for replication

self-join and R-S join cases control the amount of data kept in main memoryon each nod

e.even if we use the most fine-grained partitioning, the data

experiments on uniprot datasets Synthetically increased in size, to evaluate the speedup and s

cale upproperties of the proposed algorithms using Hadoop.


Clustering


结果评价


参考文献

Nelson, D. L., and Cox, M. M. (2005) Lehninger Principles of Biochemistry, fourth edition, Worth Publishers.

基于聚类的蛋白质家族建立

Documents

Transcript of 基于聚类的蛋白质家族建立