PPI Network Alignment

PPI Network PPI Network AlignmentAlignment

陳琨、朱安強、林晏禕、翁翊鐘陳縕儂、呂哲安、楊孟翰

PROTEIN-PROTEIN PROTEIN-PROTEIN INTERACTIONINTERACTIONNETWORK NETWORK ALIGNMENTALIGNMENT

Protein BiosynthesisProtein Biosynthesis

From DNA to lifeFrom DNA to life

Biology TechnologyBiology TechnologyHow do we measure protein

interaction?◦Two-hybrid screens◦Co-immunoprecipitation

Two-hybrid screensTwo-hybrid screens

A. Regular transcription of the reporter gene

UASReporter gene

(LacZ)


B. One fusion protein only (Gal4-BD + Bait) – no transcription

UASReporter gene

(LacZ)

no transcription


C. One fusion protein only (Gal4-AD + Prey) – no transcription

UASReporter gene

(LacZ)

no transcription


D. Two fusion proteins with interacting Bait and Prey

UASReporter gene

(LacZ)

Co-immunoprecipitationCo-immunoprecipitation

Known viral proteinProtein A

AntibodyUnknown proteinX

Y

Protein-Protein Interaction Protein-Protein Interaction Networks?Networks?Protein are nodesInteractions are edges

Yeast PPI network

Network comparisonsQuery for a modulePredict functions of a modulePredict protein functionsValidate protein interactionsPredict protein interactions

Random networkRandom networkConnect each pair of node with

prob p Expect value of edge is pN(N-1)/2Poisson distribution

◦The node with high degree is rare

Scale-free networkScale-free networkPower-law degree distributionHubs and nodesWhen a node add into network, it

prefer to link to hubs

The Network Alignment The Network Alignment ProblemProblemGiven k different protein

interaction networks belonging to different species, we wish to find conserved sub-networks within these networks

Conserved in terms of protein sequence similarity (node similarity) and interaction similarity (network topology similarity)

General Framework For General Framework For Network Alignment AlgorithmsNetwork Alignment Algorithms

PATHBLASTPATHBLASTConserved pathways within bacteria and yeast as revealed by global protein network alignment. Brian P. Kelley , Roded Sharan , Richard M. Karp , Taylor Sittler , David E. Root , Brent R. Stockwell , and Trey Ideker (2003)

Protein SimilarityProtein SimilarityHomologous proteins:

two proteins that have common ancestry.

Orthologous proteins: two protein from different species that diverged after a speciation event.

Paralogous proteins: two proteins from the same species that diverged after a duplication event.

Source: Roded Sharan, Protein-protein Interaction: Network Alignment Lecture Note

Path BlastPath BlastPathBlast is a strategy for aligning two protein

interaction networks to elucidate their conserved pathways.

This method identifies pairs of interaction paths, drawn from the networks of different species or from different processes within a species, where proteins at equivalent path positions share strong sequence homology.

Source: Conserved pathways within bacteria and yeast as revealed by global protein network alignment. PNAS, 2003.

Alignment GraphAlignment GraphVertical solid line:

protein-protein intertactions.

Horizontal dotted line: significant sequence similarity.

Node: a homologous protein pair.

Link: protein interaction relations of three types: direct, gap, and mismatch. Source: Conserved pathways within bacteria and

yeast as revealed by global protein network alignment. PNAS, 2003.

Yeast & Bacteria PPI Alignment Yeast & Bacteria PPI Alignment graph graph The yeast and bacteria global alignment

graphs v.s. randomized networks obtained by permuting the protein name.

This suggests that both species share conserved interaction pathways.

“direct interaction” are rare. “mismatches” and “gaps” were permitted,

allowed overcome false negatives.


Scoring FunctionScoring Function

p(v) is the probability of true homology with in the protein pair represented by v.

q(e) is the probability that the protein-protein interactions represented by e.

The background probabilities are the expected values of p(v) and q(e) over global alignment graph.

Pathways & Protein Pathways & Protein ComplexesComplexesPathBLAST is used to find

conserved paths and then overlapping paths are merged into complexs.


Yeast v.s. BacteriaYeast v.s. BacteriaOrthologous PathwaysSelect the 150 highest-

scoring pathway of length four from alignment graph.

Combing overlapping pathways, found fell into 5 network regions.

Right figure involves the union of 6 paths.

With similar function.Solid link: direct

interactions, dotted link: gaps or mismatches.


Yeast vs. Yeast.Yeast vs. Yeast.Paralogous PathwaysProteins were not

allowed to pair with themselves or their neighbors.

Analyzed 150 highest-scoring pathway alignments of length 4 from alignment graph.

distinct alignments but homologous in function.


Pathway QueriesPathway Queries

PATHBLAST identified two other well known MAPK pathways as the highest-scoring hits,indicating that the algorithm was sufficiently sensitive and specific to identify known paralogous pathways.


Identification of Identification of Protein ComplexesProtein Complexes

Roded Sharan, Trey Ideker, Brian P. Kelley, Ron Shamir, Richard M. Karp:

Identification of Protein Complexes by Comparative Analysis of Yeast and Bacterial Protein Interaction Data.

Journal of Computational Biology 12(6): 835-846 (2005)

State-of-The-Art

Flashback[Input] the alignment graph of 2

PPI networks.We already can handle the

problem of finding conserved linear pathways.

Now this is not the end: How can we step further?

MotivationFinding more complex conserved

structures is of practical interest.


structures is of practical interest. [Reduction] Now we can merge

overlapping paths into complexes.


structures is of practical interest. [Reduction] Now we can merge

overlapping paths into complexes. Or we can develop another model

to identify conserved complexes.

A New Model: The Main Idea How do you recognize protein

complexes?◦ Dense Subgraphs◦ Comparative Analysis

"When I use a word," Humpty Dumpty said in a rather a scornful tone, "it means just what I choose it to mean -- neither more nor less."

Lewis Carroll, Through the Looking-Glass

Dense Subgraph: LikelihoodLikelihood Formula 0.1: given an

induced subgraph,◦ L(C) = |Ec|/ { ½ * |Vc| * ( |Vc| - 1 ) }

It makes sense: graphs with more edges have higher likelihood.

Dense Subgraph: Likelihood(Cont.)

Likelihood Formula 0.1: given an induced subgraph,◦ L(C) = |Ec|/ { ½ * |Vc| * ( |Vc| - 1 ) }

It makes sense: graphs with more edges have higher likelihood.

We only consider the structure of graphs.

Problems of link analysis are often data-dependent.


Likelihood Formula 0.1: given an induced subgraph,◦ L(C) = |Ec|/ { ½ * |Vc| * ( |Vc| - 1 ) }

Likelihood Formula 0.2: given an induced subgraph,

What the hell is it?


What do you expect about the behavior of revised formulas?

Higher likelihood: The scores of dense graphs are higher.

Adjustment: The weakest link ◦ Bonus: Interaction with low

probability happens.


Higher likelihood: The scores of dense graphs are higher.

We assume that every 2 proteins in a complex interact with some probability p( 0.8 is used in this work).

We can use the model as a baseline for comparing density.


Adjustment: The weakest link!p(u,v) is defined to be the fraction

of graphs in FG that includes this edge.◦ FG : the family of graphs with V and

the same degree sequence.Edges incident on vertices with

higher degrees have higher probability.



What the hell is it?◦ For p(u,v) = 0.2, we have 4 and ¼ in

both side.◦ For p(u,v) = 0.6, we have 4/3 and 1/2 in

both side.◦ It makes sense! We emphasize the

weakest link.



◦ L(C) = |Ec|/ { ½ * |Vc| * ( |Vc| - 1 ) } Likelihood Formula 0.2: given an induced

subgraph,


The Main Idea Revisited How do you recognize protein

complexes?◦ Dense Subgraphs

We have some revised formula for density in a PPI network.

◦ Comparative Analysis

Comparative AnalysisIdea: If some structure occurs in

different species, it is of high probability to be some meaningful structure.

How do you define dense substructures on alignment graphs?

Comparative Analysis(Cont.)

Consider two subsets U1 ={ u1,..., uk}, V2 ={ v1,..., vk} and Θ: U1 → V2 is a many-to-many correspondence.

Since you already have

You may derive the formula 1.1 as follows:

Does it make sense?

Comparative Analysis(Cont.)Θ is useful information:

You have the formula 1.2:

{ A/(A+B) }/ {X/(X+Y)}




◦ Comparative Analysis We have some revised formula for

density in an alignment network.

Search the Complexes Now we only need to find heavy

subgraphs in the alignment graph.The problem is NP-Hard.

Search the Complexes(Cont.)

[Seed] Compute a seed around each node v.

[Refined Seed] Enumerate all subsets of the seed that have size 3 and contain v.

[Local Search] Iteratively modify the refined seed.

[Output Heavy Subgraphs] For each node, we record at most k heaviest subgraphs.

Search the Complexes(Cont.)

[Seed] Compute a seed around each node v.

[Restrict the Size] Keep seeds small![Refined Seed] Enumerate all subsets of

the seed that have size 3 and contain v.[Local Search] Iteratively modify the

refined seed.[Output Heavy Subgraphs] For each node,

we record at most k heaviest subgraphs.[Filtering overlapping ones] Greedy

method is used!




◦ Comparative Analysis We have some revised formula for

density in an alignment network. Finally, we have some practical method

to search complexes!

PATH QUERIESPATH QUERIES

Path QueriesProblem definitionInput

◦a target network represented as an undirected weighted graph G(V, E), with a weight function on the edges w:E×E→R

◦A path queries Q=(q1,…,qk)

Scoring function of node similarity H:Q×V

Output: a set of best matching pathways P=(p1,…,pl) in G, where a good match is measured in two respects:

1. The matched nodes are similar by scoring function H.

2. The reliability of edges in the matched pathway is high.

Algorithm

1. Introduce a mapping M from Q to P∪{0} where deleted query nodes are mapped to 0 by M.

2. Path Scoring:• interaction score and sequence

score

k

qMiii

l

iii

i

qMqHppw0,1

1

11 ,,

Interaction score◦Edges weights represent the

logarithm of reliability of interaction between two proteins.

Sequence score◦BLAST E-value for the two proteins

normalized by the maximal E-value over all pairs of proteins from the two networks.

AlgorithmAvoiding cycles

◦N. Alon, R. Yuster, and U. Zwick: Color-coding. J.ACM, 1995.

Finding the best matching paths:

deldeldel

del

idel

Vmdel

NSmiW

EjmjmwjcSmiW

EjmjqHjmwjcSmiW

SjiW

,1,,,1

,,,,,,

,,,,,,,1

max,,,

Dataset and ResultsYeast and fly PPI networks

◦ The yeast (S. cerevisiae) PPI network contains 4,726 proteins and 15,166 known interactions between them.

◦ The fly (D. melanogaster) PPI network contains 7,028 proteins and 22,837 interactions.

271 pathways were discovered which were better than 99% of randomly chosen from yeast PPI network, and then were used as queries for the fly PPI network.

Results

APPLICATION OF PPI APPLICATION OF PPI NETWORK NETWORK ALIGNMENT: ALIGNMENT: ORTHOLOGY ORTHOLOGY MAPPINGMAPPING

S. Bandyopadhyay, R. Sharan, and T. Ideker. Systematic identification of functional orthologs based on protein network comparison. Genome Research, 16(3):428–435, 2006

IntroductionIntroductionAnnotating protein function across species is

often complicated by the presence of paralogous proteins

Most of the methods of dealing with this problem are sequence-based models, thus sequences of proteins from different species were compared to find a group of proteins that have the same functional annotation

A protein and its functional ortholog are likely to interact with proteins in their respective networks that are themselves functional orthologs

This introduced a strategy for identifying functionally related proteins that supplements sequence-based comparisons with information on conserved protein-protein interactions

Introduction (cont’d)Introduction (cont’d)

a b

a’b’

a’

b’b

a

Functional orthologyFunctional orthology When the protein in question has

similarity to not one but many paralogous proteins, it’s harder to distinguish which of these is the true ortholog, the protein that is directly inherited from a common ancestor

Definite functional orthologs are defined as proteins that are functionally equivalent as a result of direct ancestry

Model reviewModel reviewThe protein interaction networks of two species

are aligned by assigning proteins to sequences homology groups using the Inparanoid algorithm

Networks are aligned into a merged graph representation

Probabilistic inference is performed on the aligned networks to identify pairs of proteins, one from each species, that are likely to retain the same function based on conservation of their interacting partners

A logistic function is used to compute the probability of functional orthology for a protein pair i given the states of functional orthology for its network neighbors

The previous probability is updated for each pair over successive iterations of Gibbs sampling

Model review (cont’d)Model review (cont’d)

Conservation indexConservation indexConsider an alignment graph G

◦Nodes represent sequence-similar protein pairs

◦Edges link nodes (a, b) and (a’, b’) if one of (a, a’) or (b, b’) directly interacts, and the other interacts via a neighbor, which is directly connected to them

◦An edge is strongly conserved if its endpoints are true functional orthologs

Conservation index Conservation index (cont’d)(cont’d)

network itsin bprotein of degree the:)(

network itsin aprotein of degree the:)(

i node involving links conservedstrongly ofnumber the:)(

i node a ofindex on conservati :)(

)()(

)(2)(

bd

ad

id

ic

bdad

idic

Probabilistic modelProbabilistic modelThe probability of functional

orthology for a pair of proteins is influenced by the probabilities of functional orthology for their network neighbors, which in turn depend on their network neighbors, and so on

This type of probabilistic model is known as a Markov random field

Probabilistic model Probabilistic model (cont’d)(cont’d)

Positive training examples: the definite functional orthologs having as least one conserved interaction

Negative training examples: the protein paired with its best BLAST e-value matching protein not the same cluster by the Inparanoid algorithm

examples trainingnegative allover ))|(1(

and examples trainingpositive allover )|(

ofproduct themaximizingby optimized are and Parameter

)(such that all ofset the:Z

i node of neighbors ofset the:)(

i node of state the:

)}(exp{1

1)|(

)(

)(

N(i)

)(

iNi

iNi

j

i

iNi

ZzP

ZzP

iNjz

iN

z

icZzp

Orthology inferenceOrthology inferenceThe above model was used to estimate the

final posterior probabilities P(zi) using the Gibbs sampling

Nodes representing ambiguous functional orthologs are each assigned a temporary state z=0 or z=1, initially at random

At each iteration, a node i is sampled (with replacement) and its value if zi is updated given the states of its neighbors, ZN(i). The new value of zi is set to 0 or 1 with probability P(zi|ZN(i))

Over all iterations, the nodes designed as definite functional orthologs and non-orthologs are forced to states of 1 and 0, respectively

Experimental resultsExperimental results

A total of 2244 clusters were generated by the Inparanoid algorithm, covering 2834 proteins in yeast and 3881 proteins in fly

Of these, 1552 clusters contained only a single yeast and fly protein pair and were assumed to represent definite functional orthologs

They applied above method to resolve the remaining 692 clusters which were assumed to represent ambiguous functional orthologs, and found 121 contained protein pairs for which at least one pair had conserved interations between networks

In 60 of these, the highest probability was assigned to the protein pair that was also the most sequence-similar via BLAST

Experimental results Experimental results (cont’d)(cont’d)

ConclusionConclusionThese findings confirm that

yeast/fly proteins classified as definite functional orthologs are more likely to have equivalent functional roles in the protein network

The conserved network context could be used to help discriminate functional orthology from general sequence similarity

MULTIPLE NETWORK MULTIPLE NETWORK ALIGNMENTALIGNMENT

R. Sharan, S. Suthram, R.M. Kelley, T. Kuhn, S. McCuine, P. Uetz, T. Sittler, R.M. Karp, and T. Ideker.Conserved patterns of protein interaction in multiple species. PNAS, 102(6):1974–1979, 2004

The alignment graphThe alignment graph Each node in this graph consists of a group

of sequence-similar proteins, one from each species

Each link between a pair of nodes in the alignment graph represent conserved protein interactions between the corresponding protein group

A search over the alignment graph is performed to identify:1. Short linear paths of interacting proteins, which

model signal transduction pathways2. Dense clusters of interactions, which model

protein complexes

The alignment graph The alignment graph (cont’d)(cont’d)

Experimental resultsExperimental resultsThey applied the multiple network

alignment framework to three PPI networks:◦ Yeast: 14319 interactions among 4389 proteins◦ Worm: 3926 interactions among 2718 proteins◦ Fly: 20720 interactions among 7038 proteins

It identified 183 protein clusters and 240 paths conserved at a significance level of P < 0.01; groups of conserved clusters overlap to define 71 distinct network regions

Experimental results Experimental results (cont’d)(cont’d)

Prediction of protein Prediction of protein functionfunctionWhenever the set of proteins in a

conserved cluster or path (over all species) was significantly enriched for a particular GO annotation and at least half of the proteins in the cluster or path had that annotation, all remaining proteins in the sub-network were predicted to have that annotation

Fast and accurate alignment of Fast and accurate alignment of multiple PPI networksmultiple PPI networksBy Maxim Kalaev, Vineet Bafna,

and Roded Sharan, 2007Drawback of the alignment graph:

exponential growth of the graph with the number of species

They introduced a new algorithm avoiding the explicit representation of every set of potentially orthologous proteins, thereby reducing time and memory requirements

The layered alignment The layered alignment graph (1/3)graph (1/3)Given k PPI networks (for k species

respectively)A layered alignment graph: each layer

corresponds to a species and contains the corresponding network. Additional edges connect proteins from different layers if they are sequence similar

A k-spine: a sub-graph of size k which includes a vertex from each of the layers. A k-spine corresponds to a set of truly orthologous proteins

A collection of connected k-spines induces a candidate conserved sub-network

The layered alignment The layered alignment graph (2/3)graph (2/3)

Species 1 Species 2 Species k

k-spin U[3]……

Inter-layer edge

PPI edge

U1 U2 U3 Uk

The layered alignment The layered alignment graph (3/3)graph (3/3)If considering every k-spine to be a

node in a graphAn m-subnet: a collection U of k multi-

sets Ui = {ui[1],…, ui[m]}◦ For all 1≦ i ≦ k and 1≦ j ≦ m, ui[j] belongs

to Vi

◦ For all 1≦ j ≦ m, the set U[j] = {u1[j], u2[j],…, uk[j]} is a k-spine

The task is to look for high scoring m-subnets, for a fixed m

PPI Network Alignment

Documents

Transcript of PPI Network Alignment