Modeling protein sequence evolution: Lets get real(er)!

44
Modeling protein sequence evolution: Lets get real(er)! Andrew J. Roger Dept. of Biochemistry & Molecular Biology Dalhousie University, Halifax, N.S. Canada

description

Modeling protein sequence evolution: Lets get real(er)!. Andrew J. Roger. Dept. of Biochemistry & Molecular Biology Dalhousie University, Halifax, N.S. Canada. Dr. Christian Blouin Fac. of Comp Sci. Dr. Ed Susko (Dept. of Math/Stats). Dr. Matt Spencer Univ. of Liverpool. Karen Li - PowerPoint PPT Presentation

Transcript of Modeling protein sequence evolution: Lets get real(er)!

Page 1: Modeling protein sequence evolution: Lets get real(er)!

Modeling protein sequence evolution: Lets get real(er)!

Andrew J. Roger

Dept. of Biochemistry & Molecular BiologyDalhousie University, Halifax, N.S. Canada

Page 2: Modeling protein sequence evolution: Lets get real(er)!

Karen Li

(smart summer student)

Dr. Ed Susko(Dept. of Math/Stats)

Dr. Huaichun Wang(postdoctoral fellow)

Dan Gaston(Bioinf./Comp. Biol. M.Sc. student)

Dr. Christian BlouinFac. of Comp Sci

Dr. Matt SpencerUniv. of Liverpool

Page 3: Modeling protein sequence evolution: Lets get real(er)!

LactobacillusE. coliHumanShiitake mush.

……STTTGHLIYKCGGIDKR…STTTGHLIYKCGGIDKR………STTMGNLAYQLGVFDQR…STTMGNLAYQLGVFDQR………STTVGNLAFQLGAIDAR…STTVGNLAFQLGAIDAR………STTVGMLSYQLGAVDKR…STTVGMLSYQLGAVDKR…

protein g

A ‘super-alignment’ of proteins

site x

i

II

FF

j

II

VV

branch e

Probability of going from state i to j at protein g, site x, branch e:

Pij

Probability of going from state i to j at protein g, site x, branch e:

Pij

Page 4: Modeling protein sequence evolution: Lets get real(er)!

Current phylogenetic models of Current phylogenetic models of protein evolutionprotein evolution

• Codon models Codon models – parameterized in terms of rates of interchange between synonymous and non-parameterized in terms of rates of interchange between synonymous and non-

synonymous codonssynonymous codons

• Model of amino acid interchange are assembled from frequencies of Model of amino acid interchange are assembled from frequencies of changes observed in large databaseschanges observed in large databases– PAM, JTT, VT, mtREV, WAG, PMBPAM, JTT, VT, mtREV, WAG, PMB

• Usually combined with model of among-site rate variation Usually combined with model of among-site rate variation – e.g. JTT+e.g. JTT+ or JTT+ or JTT++invariable sites models+invariable sites models

• Adjust the matrix to reflect the equilibrium (stationary) frequencies of Adjust the matrix to reflect the equilibrium (stationary) frequencies of amino acids in your datasetamino acids in your dataset– JTT+F+ JTT+F+

Page 5: Modeling protein sequence evolution: Lets get real(er)!

Probability of going from state i to j at protein g, site x, edge e

Pij exp(R te rx ) ij

Human

Shiitake mushroomE. coli

Lactobacillus

i j

A 0 0 0

0 R 0 0

0 0 ... 0

0 0 0 v

ABCDEF

ABCDEF

Uniform rate model Rates-across-sites model

ABCDEF

Punctuated rates-across-sites model

ABCDEF

Covarion model

r1 r2 r1r3

ee

Page 6: Modeling protein sequence evolution: Lets get real(er)!

The problem…The problem…• Such models are a DRASTIC over-simplification of what is really Such models are a DRASTIC over-simplification of what is really

going ongoing on– Average over sites, average over lineages, average across familiesAverage over sites, average over lineages, average across families

• Sites in proteins can change function over timeSites in proteins can change function over time– sites under purifying selection <--> neutral <--> positive selectionsites under purifying selection <--> neutral <--> positive selection

• Every amino acid site in a protein has a unique Every amino acid site in a protein has a unique structural/functional contextstructural/functional context– Hydrophobicity, polarity, charge, dihedral angle, size, functional group…Hydrophobicity, polarity, charge, dihedral angle, size, functional group…

etc…etcetc…etc– Different sites have different exchangeabilities to different aa’sDifferent sites have different exchangeabilities to different aa’s– Different “frequencies” of aa’s occur at different sitesDifferent “frequencies” of aa’s occur at different sites

Page 7: Modeling protein sequence evolution: Lets get real(er)!

Pij exp(R le rx ) ij

Human

Shiitake mushroomE. coli

Lactobacillus

i j

ABCDEF

ABCDEF

Uniform rate model Rates-across-sites model

ABCDEF

Punctuated rates-across-sites model

ABCDEF

Covarion model

r1 r2 r1r3

Assumptions-‘fast-evolving’ positions are always fast and slow-evolving positions are always slow-Sites (x’s) have the same rate of evolution (rx) on different branches (e’s)

Probability of going from state i to j at protein g, site x, branch e

Page 8: Modeling protein sequence evolution: Lets get real(er)!

ArchaebacteriaEF-1

EukaryotesEF-1

Changing rates of evolution at sites in different parts of the tree of life

(heterotachy)

slow

fast

slow

fast

Page 9: Modeling protein sequence evolution: Lets get real(er)!

Models that 'deal' with heterotachy (changing site rates across the tree)

• Covarion models (stationary)– Tuffley and Steel (1998)

– Galtier (2001)

– Huelsenbeck (2002)

– Wang et al. (2007)

• Discrete rate-shift models – Gu 1999, 2002

– Bivariate rates: Susko et al. (2002)

– Pupko and Galtier (2001) - LRT for diff. site rates in subtrees

– Knudsen and Miyamoto (2001)

• Mixture of edgelength models– Kolaczkowski and Thornton (2005)

– Spencer et al. (2005)

– Zhou et al. (2007)

Page 10: Modeling protein sequence evolution: Lets get real(er)!

Pij exp(R le rx ) ij

Human

Shiitake mushroomE. coli

Lactobacillus

i je

Assumptions-different sites (x’s) and branches (e’s) all evolve according to the same general ‘rules’- i.e. rate matrices (R’s) and frequencies (’s) are the ‘same’ for all x and e

Probability of going from state i to j at protein g, site x, branch e

A 0 0 0

0 R 0 0

0 0 ... 0

0 0 0 v

Q

Page 11: Modeling protein sequence evolution: Lets get real(er)!

Hydrophobic amino acids

Page 12: Modeling protein sequence evolution: Lets get real(er)!

Hydrophobic amino acids

AcidicBasic

Page 13: Modeling protein sequence evolution: Lets get real(er)!

Evolution of chaperonin 60 over ~1.5 billion years

PlantsFungi

Animals

Protists

Bacteria

V or L D or ER or K C, V or A

Page 14: Modeling protein sequence evolution: Lets get real(er)!

Distribution of the number of different amino acid states in alignment columns

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Number of amino acid states observed at site

Nu

mb

er

of s

ites

Simulated under JTT+F+ model on HSP90 tree (1x105 sites)

HSP90 protein

Page 15: Modeling protein sequence evolution: Lets get real(er)!

** ** **

** ** ** ** **

p-values from the 2 tests

RATE 1 RATE 2 RATE 3 RATE 42 test (states)

EF-2 (669)ILVD_EDD (310) 0.1954HSP90 (459)NuoF (405)Glu_synth_NTN (253) 0.01174Poty_coat (212) 0.1897CTP synthetase (212)SecA (203)

EF1 (361) 0.2872 tubulin (375)HSP70 (432) 0.3127DNA topo IV (228) 0.213Usher (317) 0.08051-tubulin (382) 0.01767CPN60 (466) 0.1826 0.04338Carboxyl_trans (212) 0.9667 0.04754MreB (275) 0.4971 0.1046 0.02768actin (363)MPP (203) 0.04491 0.2412 0.03161 0.3224MCM (220) 0.6576 0.11Filament (210) 0.3517 0.09121 0.9233 0.4505 0.6625

Protein family (sites)

Z-test (uniformity)

** ** ** ** **

** < 0.001 * < 0.01

** ** ** **** ** ** ** **

** ** ** **

** **

** ** ** ** **

**

** ** ** **

**

*

*

***

**

***

** ** **

** ** ** **

*

** *** * **

** ** *

** ** **

**

*

** ** ****

** * **

**

**

**

Page 16: Modeling protein sequence evolution: Lets get real(er)!

How do we model the site-specific nature of protein evolution?

Use information from tertiary (3D) structure of the protein under

examination:Parisi & Echave (2002)Robinson et al. (2005)Rodrigue et al. (2005)

‘Dayhoff’ type matrices for structural classes from

databases of alignments + characterized structures:

Lio, Goldman et al. (1998)Gascuel et al. (?)

Use site-specific frequency classes to

parameterize a model:

Bruno (1996)Lartillot et al. (2004)

Page 17: Modeling protein sequence evolution: Lets get real(er)!

Principal Components Analysis (PCA) of aa-frequency matrices (from 21 globular

protein alignments)

Page 18: Modeling protein sequence evolution: Lets get real(er)!

Can be cut up into at least 4 classes

D,E

G (A,S)V,I,L (M)

Page 19: Modeling protein sequence evolution: Lets get real(er)!
Page 20: Modeling protein sequence evolution: Lets get real(er)!

A simple class frequency (cF) mixture model....Use 4 frequency classes from PCA and add a fifth corresponding to the whole dataset frequencies (F):

This way JTT+F+ is a special case of JTT+cF+ where P(1)…P(4) = 0

Can do likelihood ratio test where:

L(x i) P(x i | rj ,c )P(rj )j

c1

5

P(c )

where 1....4 are PCA derived classes

and 5 F

p value P(42 2 lnL)

lnL lnLJTT cF lnLJTT F

Page 21: Modeling protein sequence evolution: Lets get real(er)!

Protein P(1) P(2) P(3) P(4) P(F) lnL p (df=4/df=80)Carboxyl_trans 0.095 0.05 0 0.1 0.755 61.38 <0.01*CTP-synthetase 0.235 0.06 0.01 0.255 0.44 228.28 <0.01*DNA topo IV 0.13 0.04 0.005 0.2 0.625 153.21 <0.01*Filament 0.06 0 0.02 0.05 0.87 14.17 <0.01/1Glu_synth_NTN 0.13 0.035 0.005 0.17 0.66 79.11 <0.01*HSP70 0.165 0.025 0 0.16 0.65 132.84 <0.01*ILVD_EDD 0.11 0.04 0.005 0.155 0.69 174.82 <0.01*MCM 0.175 0.03 0 0.135 0.66 69.84 <0.01*MreB 0.185 0.065 0 0.215 0.535 139.55 <0.01*Poty_coat 0.16 0.035 0.015 0.165 0.625 115.3 <0.01*SecA 0.2 0.05 0.015 0.225 0.51 218.56 <0.01*Usher 0.095 0.015 0.005 0.095 0.79 73.24 <0.01*Hsp90 0.19 0.045 0.085 0.295 0.385 269.47 <0.01*NuoF 0.2 0.11 0.04 0.265 0.385 179.12 <0.01*Cpn60 0.185 0.04 0.025 0.215 0.535 244.07 <0.01*Mpp 0.125 0.025 0 0.105 0.745 70.65 <0.01*alpha-tubulin 0.155 0.035 0.005 0.325 0.48 88.76 <0.01*beta-tubulin 0.145 0.025 0.015 0.205 0.61 66.88 <0.01*Actin 0.115 0.03 0.02 0.24 0.595 39.76 <0.01/0.48EF-1alpha 0.145 0.05 0 0.205 0.6 99.74 <0.01*EF-2 0.15 0.065 0.03 0.215 0.54 263.99 <0.01*enolase 0.12 0.055 0 0.19 0.635 46.12 <0.01myoglobin 0.14 0.045 0.03 0.165 0.62 41.89 <0.01lipoprotein 0.1 0.02 0.005 0.105 0.77 68.65 <0.01lysozyme 0.115 0.02 0.015 0.215 0.635 18.61 <0.01

Likelihood ratio tests

From which PCA classes were derived

New datasets

Page 22: Modeling protein sequence evolution: Lets get real(er)!

How do we model the site-specific nature of protein evolution?

Use information from tertiary (3D) structure of the protein under

examination:Parisi & Echave (2002)Robinson et al. (2005)Rodrigue et al. (2005)

‘Dayhoff’ type matrices for structural classes from

databases of alignments + characterized structures:

Lio, Goldman et al. (1998)Gascuel et al. (?)

Use site-specific frequency classes to

parameterize a model:

Bruno (1996)Lartillot et al. (2004)

Page 23: Modeling protein sequence evolution: Lets get real(er)!

Anfinsen’s corollory

Christian B. Anfinsen1916-1995

Conformation ‘space’

Ene

rgy

‘native’ state

The native stateof the protein is the

conformation ofminimum energy

Page 24: Modeling protein sequence evolution: Lets get real(er)!

We are not the first to do this...Simulation-based approach• Parisi and Echave (2001) Mol. Biol. Evol. 18:750-756

Parameterized Markov Modeling approach• Robinson et al. (2003) Mol. Biol. Evol. 20:1692-1704

– model is at the codon-level– 'ground-breaking'

• Rodrigue et al. (2005) Gene 347:207 & (2006) Mol. Biol. Evol. 23:1762– models at the amino acid level

Key features of the Robinson and Rodrigue models:• Bayesian approaches - explicitly context dependent (not i.i.d.)• difference in energy between sequence i and j on a fixed structure is used to parameterize the Q matrix

• Qij --> instantaneous rate of sequence i changing to sequence j

• these are 4nx4n (nucleotides) or 20nx20n (amino acids) Q matrices where n is the number of sites (typically n > 100)......yikes.

• Use MCMC to sample character change histories• extremely high dimensional model --> how good are the approximations??

Page 25: Modeling protein sequence evolution: Lets get real(er)!

Boltzmann’s principle

Ludwig BoltzmannThe Austrian Physicist

1844-1906

The energy of a given state is related to the probability

that state is occupied at equilibrium:

Er = energy of state rT = temperature

k = Boltzmann’s constantpr = probability of state r

E r kT ln pr

Page 26: Modeling protein sequence evolution: Lets get real(er)!

How the ‘mean force potentials’ are derived:

Contact energy ( )

For all amino acid pairs (i,j) at each distance slice v in a database of thousands of structures

To get the ‘total energy’ for site x in a given structure, sum the energy contributions over all sites within a given distance threshold of x (dv < t )

• Solvation energy ( ) is calculated similarly• Implemented in Sippl’s PROSA 2003 program (http://www.came.sbg.ac.at)

i j

E ( p )(i, j | dv ) kT ln p(i, j,dv )p(dv )

dv

x

E x( p ) E xy

( p )

yx,v

E ( p )

E (s)

Page 27: Modeling protein sequence evolution: Lets get real(er)!

Some details

• can measure distances between two residues from the 'backbone' carbon (C) or from first side-chain carbon (C)– the latter makes more sense biochemically (but early structures sometimes

did not have good resolution of side chains)

• fast approximation to 'full energy' calculations consider one distance slice corresponding to residues in 'contact' (within ~4-6Å)– Bastolla et al. (2005)... contact map

• Robinson et al. (2005) used 'full energy' calculation, whereas Rodrigue et al. (2005) and (2006) used Bastolla contact map based energies (how good is this?)

Page 28: Modeling protein sequence evolution: Lets get real(er)!

An ‘energy-based’ model where sites are independentAn ‘energy-based’ model where sites are independentIf substitution of amino acid If substitution of amino acid jj for for ii at a site at a site xx::

– increases energyincreases energy --> ‘bad’ --> should occur less often --> ‘bad’ --> should occur less often– decrease energydecrease energy --> ‘good’ --> should occur more often --> ‘good’ --> should occur more often

where where ffjj is a function of amino acid frequencies in the alignment, and is a function of amino acid frequencies in the alignment, and ss and and pp are are weight parameters.weight parameters.

But its not But its not allall about energy…. about energy….

Plus add rates, Plus add rates, rr, from a discretized gamma distribution to get E+JTT+, from a discretized gamma distribution to get E+JTT+ model.... model....

Qij(E ) f j exp s Ex

(s)( j) E x(s)(i) p Ex

( p )( j) Ex( p )(i)

Qij(E JTT ) Qij

(E ) Qij(JTT )

Page 29: Modeling protein sequence evolution: Lets get real(er)!

How do we get site specific energy differences between states?

Two approaches:StructureFor every site x, mutate state to 19 other aa's:

……STTMGNL...STTMGNL... AA .. .. .. jj

AverageFor each sequence q, for each site x, mutate to 19 other aa's:For each sequence q, for each site x, mutate to 19 other aa's:……STTTGHL…STTTGHL… Average:Average:……STTMGNL…STTMGNL………STTVGNL…STTVGNL………STTVGML…STTVGML…

PROSA-2003

Ex(s)( j) Ex

(s)(i)

Ex( p )( j) Ex

( p )(i)mutate

mutate

E x(s)( j) E x

(s)(i)

E x( p )( j) E x

( p )(i)PROSA-2003

Page 30: Modeling protein sequence evolution: Lets get real(er)!

av. (no JTT) 0.43 1.00 0.19 0.07 -415.52cF model (df=4) 0.43 92.24

average

average

average

contact

Performance - likelihood ratio tests

P-value(df=3)

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

Similar results with two other proteins -- lipoxygenase and myoglobin

Page 31: Modeling protein sequence evolution: Lets get real(er)!

Site-likelihood diff.s between energy model versus # of contacts at site

For site x,

lnLx lnLx(energy JTT ) lnLx

(JTT )

Number of contacts

lnLx

Page 32: Modeling protein sequence evolution: Lets get real(er)!

Site-likelihood diff.s between energy model versus % solvent accessibility

% solvent accessible

lnLx

Page 33: Modeling protein sequence evolution: Lets get real(er)!

lnL(energy+JTT) - lnLJTT

Energydoes best!

Page 34: Modeling protein sequence evolution: Lets get real(er)!

E126

Energydoes best!

Page 35: Modeling protein sequence evolution: Lets get real(er)!

Energies at 126 predict stationary amino acid frequencies better than JTT

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

A R N D C Q E G H I L K M F P S T W Y V

126 obs. Freq.

126 Cont freqs

126 Surf freqs

JTT freqs

Observed

JTTSolvation energyContact energy

Site 126

Page 36: Modeling protein sequence evolution: Lets get real(er)!

lnLenergy+JTT - lnLJTT

Energysucks

Page 37: Modeling protein sequence evolution: Lets get real(er)!

lnLenergy+JTT - lnLJTT

S306

Page 38: Modeling protein sequence evolution: Lets get real(er)!

lnLenergy+JTT - lnLJTT

S306

Page 39: Modeling protein sequence evolution: Lets get real(er)!

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

A R N D C Q E G H I L K M F P S T W Y V

306 obs. Freq.

Contact freq

Surface freq

JTT freqs

Energies at 306 site-specific amino acid stationary frequencies worse than JTT

Observed

JTT

Solvation energyContact energy

Site 306

Page 40: Modeling protein sequence evolution: Lets get real(er)!

S306 W302

6.55Å

Lobster enolase (1PDZ) aligned with minimized Schistosoma structure model

Page 41: Modeling protein sequence evolution: Lets get real(er)!

P306W302

7.73Å

Lobster enolase (1PDZ) aligned with minimized Schistosoma structure model

Page 42: Modeling protein sequence evolution: Lets get real(er)!

Summary• Traditional 'average' protein models are useful but their assumptions are

often seriously violated• Need to address:

– heterotachy

– site-specific nature of substitution process– coevolution– changing state frequencies over the tree

• Often SEVERAL of these factors may be important for a given protein family– ignoring them may cause phylogenetic artefacts

• New models come with new assumptions and new problems....e.g.:– energy models currently assume that structures do not change across

species and that they are static entities

– complex models may not be identifiable (Allman and Rhodes and others)

Page 43: Modeling protein sequence evolution: Lets get real(er)!

Be careful of believing too much in our

models

Page 44: Modeling protein sequence evolution: Lets get real(er)!

Acknowledgements

Group membersGabino Sanchez-PerezHuaichun WangJessica LeighDaniel GastonKaren Li

CollaboratorsEd SuskoMatt SpencerChristian Blouin