整合式基因體與蛋白體 資料庫

47
劉 劉 劉 (Chih-Chin Liu) 劉劉劉劉 劉劉劉劉劉 July 2008 整整整整整 整整整整 體體 整整整

description

整合式基因體與蛋白體 資料庫. 劉 志 俊 ( Chih-Chin Liu) 中華大學 資訊工程系 July 2008. Outline. 生物資訊 (Bioinformatics): 資料庫觀點 生物資訊四大資料型態 (Data Types) 生物資料庫設計與 UML 整合式生物資料庫 : UniBio 豬 / 土雞基因體資料庫 蛋白體資料庫. 當生物遇見資訊. 生物學 分子遺傳學 分子生物學 生物化學 細胞生物學 蛋白質學 免疫學. 資訊學 程式語言 資料結構 演算法 資料庫 平行處理 資料探勘. 生物資訊. - PowerPoint PPT Presentation

Transcript of 整合式基因體與蛋白體 資料庫

Page 1: 整合式基因體與蛋白體 資料庫

劉 志 俊 (Chih-Chin Liu)

中華大學 資訊工程系

July 2008

整合式基因體與蛋白體資料庫

整合式基因體與蛋白體資料庫

Page 2: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 2

Outline

生物資訊 (Bioinformatics): 資料庫觀點 生物資訊四大資料型態 (Data Types)

生物資料庫設計與 UML

整合式生物資料庫 : UniBio

豬 /土雞基因體資料庫 蛋白體資料庫

Page 3: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 3

當生物遇見資訊

生物學分子遺傳學分子生物學生物化學細胞生物學蛋白質學免疫學

資訊學程式語言資料結構演算法資料庫平行處理資料探勘

生物資訊

Page 4: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 4

基因體、轉錄體、蛋白體、代謝體

基因體 (Genome): 轉錄體 (Transcriptome): The complement of expressed

gene that are found in a particular cell or tissue. 蛋白體 (Proteome): The complement of proteins that are

found in a particular cell or tissue. 代謝體 (Metabolome): The assembly of substrates,

metabolites, and other small molecules that are present in

a population of cells.

Page 5: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 5

更多的【體】

結構體 (∑ Structures, Structurome) 變異體 (∑ SNPs, SNPome) 文獻體 (∑ Literatures, Literaturome) 訊號傳導體 (∑ Transductions,

Transductome) 反應路徑體 (∑ Pathways, Pathwayome) 遺傳疾病體 (∑ Diseases, Diseasome)

體 資料庫體 資料庫

Page 6: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 6

Research Issues in Biological Databases

Data Modeling How to store/represent biological data

Data Retrieval How to retrieve similar biological objects

Data Mining How to find rules behind biological data

Simulation Pathway Simulation, Virtual Cell, Virtual Life

Page 7: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 7

New Data Types in Bio-Databases

Large Strings DNA Sequences, Protein Sequences

Biological Images 2D Gels, Microarray Images

3D Structures Proteins, Compounds

Network Pathways

Page 8: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 8

New Data Types in Bio-Databases

Large Strings: DNA Sequences

現代人第 1 號染色體的完整序列 , 長度為

245,564,334 bp是 GenBank 最長的一筆序列紀錄

Page 9: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 9

New Data Types in Bio-Databases

Large Strings: Protein Sequences

PIR: I38344

PIR 資料庫最長的蛋白質序列

26,926 個氨基酸 titin, cardiac muscle

[validated] - human

Page 10: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 10

New Data Types in Bio-Databases

Images: Microarray (Stanford Microarray Database)

Page 11: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 11

New Data Types in Bio-Databases

Images: 1D-Gel, 2D-Gel

Page 12: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 12

New Data Types in Bio-Databases

3D Structures: Chemical Compound

Page 13: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 13

New Data Types in Bio-Databases

3D Structures

Page 14: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 14

New Data Types in Bio-Databases

3D Structures

ATOM 1 N VAL 1 -

4.004 15.224 13.636 1.00 32.64

N

ANISOU 1 N VAL 1 4512

3449 4441 -335 -2675 320

N

ATOM 2 CA VAL 1 -

3.526 15.758 14.900 1.00 18.42

C

ANISOU 2 CA VAL 1 1478

2233 3289 -286 -467 555

C

ATOM 3 C VAL 1 -

2.662 14.733 15.628 1.00 17.06

C

ANISOU 3 C VAL 1 1603

1981 2899 -152 -466 234

C

ATOM 4 O VAL 1 -

3.053 13.569 15.714 1.00 18.61

O

ANISOU 4 O VAL 1 1758

2150 3163 -489 -394 501

O

ATOM 1 N VAL 1 -

4.004 15.224 13.636 1.00 32.64

N

ANISOU 1 N VAL 1 4512

3449 4441 -335 -2675 320

N

ATOM 2 CA VAL 1 -

3.526 15.758 14.900 1.00 18.42

C

ANISOU 2 CA VAL 1 1478

2233 3289 -286 -467 555

C

ATOM 3 C VAL 1 -

2.662 14.733 15.628 1.00 17.06

C

ANISOU 3 C VAL 1 1603

1981 2899 -152 -466 234

C

ATOM 4 O VAL 1 -

3.053 13.569 15.714 1.00 18.61

O

ANISOU 4 O VAL 1 1758

2150 3163 -489 -394 501

O

Page 15: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 15

New Data Types in Bio-Databases

Network: Pathways

Page 16: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 16

Database Design

Conceptual Database Design Class Diagram (ER Model, UML Class Diagram)

Entities(Classes), Relationships, Attributes

Logical Database Design Relational Schema

Normalization, ER to Relational Data Model Mapping

Physical Database Design Implementation (e.g. Oracle, MySQL, SQL Server)

Indexes and Storage Methods

Page 17: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 17

The UniBio Project

完整性 收集所有生物相關之可下載資料庫

整合性 所有資料互相參考 , 邏輯上為單一資料庫

中文化 盡可能提供對應之中文資料 , 降低學習障礙

Page 18: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 18

The UniBio Project

下載原始格式生物資訊

調整生物資訊格式

生物資料庫

生物資料庫設計

生物資料庫建置

生物資訊網站

生物資訊網站

PerlMySQL

UML

phpMyAdmin

Page 19: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 19

The UniBio ProjectDeveloping Environment

RedHat Linux 9.0 (Free, 穩定 , 高效能 )

MySQL (Free, 跑的最快的資料庫 )

Apache (Free, 穩定 , 功能強大 , 高效能 )

Perl (Free, 生物資訊主要程式語言 , 程式精簡 ,跨平台 )

PHP (Free, 函數眾多 , 容易撰寫 , 跨平台 )

C/C++ (Free, 歷史悠久 , 功能強大 )

Java (Free, 可 Web顯示 , 跨平台 )

Page 20: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 20

The UniBio Projecthttp://140.126.11.172/

Page 21: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 21

Genome Data Management

SampleDatabase

Sampling Cloning Sequencing BLASTing Submitting

CloneDatabase

cDNADatabase

BLASTReport

Database

GenBankSubmission

Files

GenBankEMBLDDBJ

RefSeqTIGRTGI

UniGene

Page 22: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 22

Functional Genome Data Management

MicroArrayDatabase

GeneExpression

GeneExpression

Profile

in silicoSimulation

in situVerification

in vivoTesting

ProfileDatabase

SimulationResult

Database

VerificationReport

Database

New Drug $$$

EnzymeKEGGcDNADatabase

???

Page 23: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 23

豬 /土雞基因體資料庫

Page 24: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 24

豬 /土雞基因體資料庫

Page 25: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 25

豬 /土雞基因體資料庫

Page 26: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 26

豬 /土雞基因體資料庫

Page 27: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 27

豬 /土雞基因體資料庫BLAST Results (GenBank)

Page 28: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 28

豬 /土雞基因體資料庫dbEST Submission

TYPE: ESTSTATUS: NewCONT_NAME: Wen-Chuan LeeCITATION:Porcine testis EST projectLIBRARY: Porcine testis cDNA library IEST#: PDUts1001A02CLONE: PDUts1001A02SOURCE: Division of Biotechnology, Animal Technology Institute Taiwan...SEQ_PRIMER: T7 promoter primerHIQUAL_START: 1HIQUAL_STOP: 306DNA_TYPE: cDNAPUBLIC: 12/31/2005SEQUENCE:CTCAACCATTGATGGAGCATATTTCTCTATTTTTAGTAGATCTAGAAAAAAATAGTATGAAGTTAGATATCCTAAGAAGAGCAATTACCGCTATTTCATTATATTTTGCTTAAAAAAAAACAAGATTATTTTAATGGATATATCAAATCCTCGTGCACGATGTACAAAAATTAAAGCACGTCTGGGGCCACAAAGCACATCTCGATGAACTCTGAATAGATAGTACCAAGCAATTAGGTTATAAATTAATACTTTACAAGAGAATTTAGAAAATTTCATAGTTGCCCAGTGTAAGCTACCTTTCTA||

Page 29: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 29

Integrated Proteomic Database

SWISS-PROT

KEGG

PDBPIR

MIPS/JIPID CATH

SCOP

LIGAND

ENZYME BRENDA

PROSITE

PRINTS

BLOCKS

Pfam

EMOTIF

Dali/FSSP

BioCyc

WIT

Siena-2DPAGE

PMMA-2DPAGE

SWISS-2DPAGE

RESIDPlasma-2DPAGE

ATIT-2DPAGE

UniProt

MassSpec

Page 30: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 30

2D Gel Electrophoresis

Molecular Weight Markers

Separation by Charge (pI)

Se

para

tion

by

Mol

ecu

lar

We

igh

t (M

W)

Page 31: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 31

Exploring Diseases

Detect the spots that changed.Identify which proteins they are by PMF (Peptide Mass Fingerprinting)They could be candidates for drug screening.

Page 32: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 32

2D-PAGE Example2D123456_1.tif

Page 33: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 33

2D-PAGE Spot Examples2D123456_1.out

"SSP" "MR" "PI" "TA20040301PH4~7""" "" "" "quantity"0105 14.000000 0.940249 17718.580304 20.000000 0.100000 3015.930409 27.025288 2.881626 4703.690410 28.200542 3.015601 7963.920411 26.410089 3.035875 5168.190510 30.000000 0.100000 568.170610 45.000000 -1.000000 256.190708 70.379211 4.008969 12372.920709 60.177605 4.017597 60490.970710 71.341202 4.018401 20098.130711 68.146568 4.018714 25632.640712 57.148594 4.023514 73912.910713 66.000000 -1.000000 940.280902 116.400002 4.000000 160499.94

Page 34: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 34

Gel Database

A Gel UML Class Diagram for Modeling 2D-PAGE Images and Their SpotsDatabase: GelDBDate: 2004/03/05DBA: Chih-Chin Liu

SpotSSPMWPIQty

SampleSample_IDDescriptionDateQtyMethodPrepareSampleTypeSpeciesOrganTissueSexAgeGenotypePhenotype

GelGel_IDExpt_NoImageFileIPG_StrippH_LowpH_HighLinearpI_LowpI_HighMW_LowMW_HighComplexityProperty

1..n1 1..n1

electrophoresis

Page 35: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 35

MassSpec Database

Samples

MassSpec Analysis Results (.pkl)

Mascot Configuration

Mascot Query

Mascot Result (.dat)

Mascot Protein Reports

Mascot Peptide Reports

Page 36: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 36

MassSpec Sample

Page 37: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 37

MassSpec Instruments

Page 38: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 38

Mass Spectrum ExampleMIxxxxxx.pkl

Page 39: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 39

Mascot Query Example

Page 40: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 40

Mascot Search ResultFxxxxxx.dat

Page 41: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 41

MassSpec DatabaseA MassSpec UML Class Diagram for Modeling Mascot Search ResultsDatabase: MassSpecDate: 2003/12/20DBA: Chih-Chin Liu

PeakPeakMassPeakIntensity

MS_PeptideQueryRankPrettyRankMatchedMissedCleaveMrCalcDeltaObservedChargeMrExpIonsMatchedPeptideStrPeaksUsed1VarModsStrVarModsIonsScoreSeriesUsedPeaksUsed2PeaksUsed3PeptideIdThHomologyThProbOfPep

MS_ProteinAccessionDescriptionScoreMassFrameCoverageNumPeptides

MascotConfigFastaVerMascotVerMSParserVerDatabaseNumSeqsNumResidues

MascotQueryUserNameUserEmailTaxonomyFilterCleaveEnzymeMissedCleaveStaticModsICATPeptideTolPeptideTolUnitFragmentTolFragmentTolUnitChargeStateMassTypeTypeOfSearchPrecursorMassCTermMassNTermMass

11 11

config

2D_PAGE_Spot

MascotResultFileNameNumHitsExecTimeObservedMassObservedChargeObservedMrValueRepeatSearchString

hit

MassSpecResultFileNameFileTypeInstrument

0..*

1

0..*

1

associated_with

1..n1 1..n1

query

PeakListMassMinMassMaxIntMinIntMaxNumPeaks

1

1

1

1contain

Page 42: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 42

Flowchart

*.txt*.pkl

*.dat

MassSpecDatabase

MascotSearch(PMF)

MascotParser

Page 43: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 43

Proteome Data Management

*.tiff

GelDatabase

Sample 2D-PAGE SpotMass

Spectrum

Protein/PeptideReport

*.out *.pkl *.dat

upload upload upload/parsing

key-in

MassSpecDatabase

upload/parsing

Page 44: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 44

蛋白體資料庫

Page 45: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 45

蛋白體資料庫

Page 46: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 46

蛋白體資料庫

Page 47: 整合式基因體與蛋白體 資料庫

Assistant Prof. Chih-Chin Liu Page 47

蛋白體資料庫