XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Visualization of K-Tuple Distribution in Prokaryote Complete Genomes and

Their Randomized Counterparts

XIE Huimin ( 谢惠民 )

Department of Mathematics, Suzhou University

HAO Bailin ( 郝柏林 )

Ｔ－ Life Research Center, Fudan University

Beijing Genomics Institute, Academia Sinica

Institute of Theoretical Physics, Academia Sinica

Prokaryote Complete Genomes ( PCG )

K Biology-inspired mathematics

Combinatorics Goulden-Jackson

cluster method

FactorizableLanguage

Avioded and RareK-strings

K=6-9,15,18

Species-specificityof avoidance

Phylogeny( Failure )

PhylogenyBased on PCG Compositional

Distance( Success )

Decomposition and

Reconstruction of AA sequences

Graph theory:Euler paths

1. 2D Histogram of K-Tuples

Thermoanaerobacter tengchongenic

(K = 8)

姓名职称

The Algorithm (Hao Histogram) Implemented at:

National Institute for Standard and Technology (NIST)

http://math.nist.gov/~FHunt/GenPatterns/

European Bioinformatics Institute (EBI)

http://industry.ebi.ac.uk/openBSA/bsa_viewers

However, 2D only, no 1D histograms.

Two Mathematical Problems

Dimensions of the complementary sets of portraits of tagged strings.

Number of true and redundant missing strings.

The two problems turn out to be one and the same, the first being graphic representation of the second.

Two Methods to Solve the Problem

Combinatorial solution: Goulden-Jackson cluster method (1979); number of dirty and clean words.

Language theory solution: factorizable language, minimal deterministic finite-state automaton.

2. 1D Histogram of K-Tuples

Collect those K-tuples whose count fall in a bin from to ,

Plot the number of such K-tuples versus the counts,

This is a 1D histogram or An expectation curve.

The effect of c+g content in 2D histograms oforiginal genome and randomized sequence:

Escherichia coli original genome

Escherichia coli randomized sequence

Haemophilus influenzae randomized sequence

Mycobacterium leprae original genome

Mycobacterium laprae randomized sequence

Mycobacterium tuberculosis original genome

Mycobacterium tuberculosis randomized sequence

G+C Content of Some Bacteria

Species G+C Content

H. influenzae 38.15%

E. coli 50.79%

M. laprae 57.80%

M. tuberculosis 65.61%

3. Three Artificial Models Generating Sequences

Eiid: equal-probability independently and identically distributed model.

Niid: nonequal-probability independently and identically distributed model.

MMn: Markov model of order n

Monte Carlo Methodestimation of expectation (ex) and stan

dard deviation (sd) for an niid model

(the compositions of a,c,g,t are 15:35:35:15, the length ofsequence is , the value of K=8.)610

Validation about the Robustness of

K-Histograms: a comparison of absolute error from ex

in an experiment with sd as reference

Compare the population of shuffling a given sequence and the population of sequence generated from a stochastic

model.

F-test t-test

Definition. For each , define a random variable

innnnin IIIIX

1,4,2,1, ,

Where random variable takes value 1 if the i-th K-tuple occursexactly n times in the sequence, or takes value 0 if it does notoccur.

)( ,, nini IEp

4. A Theory for the Expectation Curve (1)

Theorem. For each , the mathematical expectation of random variable is given by

innnnin ppppXE

1,4,2,1, ,)(

Where the random variable is the occurrence number ofK-tuples of I-th type.

A Theory for the Expectation Curve (2)

The Exact Computation of Expectation Curve

In order to compute the expectation curve we need to know the probability for each and .

The Goulden-Jackson cluster method can be used successfully for the model of eiid.

It is still difficult to do the computation for other models.

nip , }4,,2,1{ Ki 0n

Two Experiments (for the model of eiid):compare with a K-histogram compare with Monte Carlo method

the red curves are the standard deviation estimationobtained by Monte Carlo method.

Poisson Approximation for

the Expectation Curve

For each K-tuple calculate its expected number of appearing in sequence of length N, then use the formula of probability function of Poisson distribution and sum them for all K-tuples:

,0,41,!, nie

Remark. This follows from a theorem in Percus and Whitlock, ACM Transaction on Modeling and Computer Simulation, 5 (1995) 87—100(the model, however, can only be eiid, and the tuples must be overlapless).

Comparison of Poissonapproximation with K-histogram fo

r U. urealyticum

Comparison of Poissonapproximation with 7-histogram fo

r Haemophilus influenzae

Comparison of Poissonapproximation with 8-histogram fo

r Haemophilus influenzae

A comparison of Poisson approximation with

Monte Carlo method

In this computation the model is an niid, in which the parametersare taken from the randomized sequence of H. influenzae.

5. Analysis of the Mechanism of Multi-Modal K-histogr

kk 0 1 2 3

4 5 6 7 8

153.1 94.4 58.2 35.9 22.2 13.7 8.4 5.2 3.2

An example for H. influenzae. The length of its genome is1830023. Under the simplified conditions of

for , there are only 9 types of different of as shownin the following list.

,19075.0 ,30925.0 gcta pppp

The following map shows the nine individual probability functions and

their sum

Notice that the effect from the ratio of successive modes:

7,,1,0for 616815.030925.019075.01 k

For E. coli the ratio is 0.968931, hence the result is

quite different

6. Analysis of Short-Range Correlation by K-

HistogramsTwo 8-histograms for E. coli, the left one is from its genome,

and the right one is from its Markov model of order 1.

Compare the 8-histograms of Markov Models of order from 2—7

for E. coli

Using Markov model of order 5 and Monte Carlo methodto compare the 8-histogram of E. coli’s complete genome sequence with the ex and sd of MM5.

the red curve is the expectationcurve estimated by doing 50 times of simulation.

this is the ratio curve

for sd

exX n ||

.200,,0 n

Reference:

Huimin Xie, Bailin Hao, “Visualization of K-tuple distribution in prokaryote complete genomes and their randomized counterparts”, CSB2002: IEEE Computer Systems Bioinformatics Conference Proceedings, IEEE Computer Society, Los Alamitos, 2002, 31-42.

7. Discussion

Most of the results shown above are of experimental nature, many problems are left for fut

ure study. How to select reasonably the value of K. How to use 1D visualization to protein? What are the properties of random variables

? How to compute exactly the expectation cur

ve for the model of niid and MMn? Why the Poisson approximation is effective

without considering the overlap of K-tuples?

XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

Documents

Transcript of XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 )

金蝶软件小企业事业本部 郝登胜

总课题组执行组长 郝少林 2014年8月25日

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Quantized Resistance - The Budker Groupbudker.berkeley.edu/Physics141_2013/Zhifan He and Huimin Yang... · Quantized resistance is quantized hall resistance. Quantized resistance

报告人： 张娇 导师： 郝红勋 副教授 报告时间： 2013-5-25

Fifi Bailin Lestari 1109 100 042 Dosen Pembimbing: Dr. M ...digilib.its.ac.id/public/ITS-paper-31909-1109100042-Presentation.pdfmeningkatkan perlindungan korosi yang dilapiskan pada

Presented by 李连硕，王婷婷 and 郝志伟 2013.04.19

A Survey of ICASSP 2013 Language Model Department of Computer Science & Information Engineering National Taiwan Normal University 報告者：郝柏翰 2013/06/19.

Ácido Hialuronico, Huimin Yu et al 2008

プラットフォーム戦略の視点から見たアリババグ …プラットフォーム戦略の視点から見たアリババグループ の成功要因と課題分析 郝

2018 年注册会计师任职资格检查情况名单 · 分所负责人：郝树平 办公电话：0351-4081990 注册会计师姓名： 郝树平 郝文明 刘春娟 李 勇 穆建永

郝玉英教授科研团队 有机光电子材料及器件

SQL Server 2000 Overview: 新特性 郝 雪莹 Microsoft Corporation

北京理工大学机电工程学院 郝 新 红 E-mail: haoxinhong@bit Tel: 68914850

XL CAMPEONATO DEL MUNDO DE CAMPO TRAVÉS · 80 Michele Cristina DAS CHAGAS BRA 27:10 81 Sophie BARKER AUS 27:13 82 Hitomi NIIYA JPN 27:20 83 Huimin XIAO CHN 27:21 84 Urszula NECKA

鄭子太極拳美人手研究 - WordPress.com · 至遠，二宗之楊家及武家亦皆出自陳家太極拳，李、郝、孫、吳四小家當中，吳 源於楊家 9，李、郝、孫出自武家。亦即全部原於陳家溝。

年金保險 郝充仁 淡江大學保險系

The Internet and Telecommunication Reading and vocabulary 日照六中：郝 晓 丽.

Control of Crystal Size of MOR Huimin Zhan 2013.10.11.

ALAT MUSIK TRADISIONIL Chen Huimin Pauline Guo …courses.nus.edu.sg/course/clsloef/BI(07)/PP.pdfberada di Siglap Road, dekat sekali dari rumah saya, jadi saya tidak usah repot-repot

金蝶软件小企业事业本部郝登胜

总课题组执行组长郝少林 2014年8月25日

报告人：张娇导师：郝红勋副教授报告时间： 2013-5-25

プラットフォーム戦略の視点から見たアリババグ …プラットフォーム戦略の視点から見たアリババグループの成功要因と課題分析郝

2018 年注册会计师任职资格检查情况名单 · 分所负责人：郝树平办公电话：0351-4081990 注册会计师姓名：郝树平郝文明刘春娟李勇穆建永

郝玉英教授科研团队有机光电子材料及器件

SQL Server 2000 Overview: 新特性郝雪莹 Microsoft Corporation

北京理工大学机电工程学院郝新红 E-mail: haoxinhong@bit Tel: 68914850

鄭子太極拳美人手研究 - WordPress.com · 至遠，二宗之楊家及武家亦皆出自陳家太極拳，李、郝、孫、吳四小家當中，吳源於楊家 9，李、郝、孫出自武家。亦即全部原於陳家溝。

年金保險郝充仁淡江大學保險系

The Internet and Telecommunication Reading and vocabulary 日照六中：郝晓丽.