PPI Network Alignment
description
Transcript of PPI Network Alignment
PPI Network PPI Network AlignmentAlignment
陳琨、朱安強、林晏禕、翁翊鐘陳縕儂、呂哲安、楊孟翰
PROTEIN-PROTEIN PROTEIN-PROTEIN INTERACTIONINTERACTIONNETWORK NETWORK ALIGNMENTALIGNMENT
Protein BiosynthesisProtein Biosynthesis
From DNA to lifeFrom DNA to life
Biology TechnologyBiology TechnologyHow do we measure protein
interaction?◦Two-hybrid screens◦Co-immunoprecipitation
Two-hybrid screensTwo-hybrid screens
A. Regular transcription of the reporter gene
UASReporter gene
(LacZ)
Two-hybrid screensTwo-hybrid screens
B. One fusion protein only (Gal4-BD + Bait) – no transcription
UASReporter gene
(LacZ)
no transcription
Two-hybrid screensTwo-hybrid screens
C. One fusion protein only (Gal4-AD + Prey) – no transcription
UASReporter gene
(LacZ)
no transcription
Two-hybrid screensTwo-hybrid screens
D. Two fusion proteins with interacting Bait and Prey
UASReporter gene
(LacZ)
Co-immunoprecipitationCo-immunoprecipitation
Known viral proteinProtein A
AntibodyUnknown proteinX
Y
Protein-Protein Interaction Protein-Protein Interaction Networks?Networks?Protein are nodesInteractions are edges
Yeast PPI network
Network comparisonsQuery for a modulePredict functions of a modulePredict protein functionsValidate protein interactionsPredict protein interactions
Random networkRandom networkConnect each pair of node with
prob p Expect value of edge is pN(N-1)/2Poisson distribution
◦The node with high degree is rare
Scale-free networkScale-free networkPower-law degree distributionHubs and nodesWhen a node add into network, it
prefer to link to hubs
The Network Alignment The Network Alignment ProblemProblemGiven k different protein
interaction networks belonging to different species, we wish to find conserved sub-networks within these networks
Conserved in terms of protein sequence similarity (node similarity) and interaction similarity (network topology similarity)
General Framework For General Framework For Network Alignment AlgorithmsNetwork Alignment Algorithms
PATHBLASTPATHBLASTConserved pathways within bacteria and yeast as revealed by global protein network alignment. Brian P. Kelley , Roded Sharan , Richard M. Karp , Taylor Sittler , David E. Root , Brent R. Stockwell , and Trey Ideker (2003)
Protein SimilarityProtein SimilarityHomologous proteins:
two proteins that have common ancestry.
Orthologous proteins: two protein from different species that diverged after a speciation event.
Paralogous proteins: two proteins from the same species that diverged after a duplication event.
Source: Roded Sharan, Protein-protein Interaction: Network Alignment Lecture Note
Path BlastPath BlastPathBlast is a strategy for aligning two protein
interaction networks to elucidate their conserved pathways.
This method identifies pairs of interaction paths, drawn from the networks of different species or from different processes within a species, where proteins at equivalent path positions share strong sequence homology.
Source: Conserved pathways within bacteria and yeast as revealed by global protein network alignment. PNAS, 2003.
Alignment GraphAlignment GraphVertical solid line:
protein-protein intertactions.
Horizontal dotted line: significant sequence similarity.
Node: a homologous protein pair.
Link: protein interaction relations of three types: direct, gap, and mismatch. Source: Conserved pathways within bacteria and
yeast as revealed by global protein network alignment. PNAS, 2003.
Yeast & Bacteria PPI Alignment Yeast & Bacteria PPI Alignment graph graph The yeast and bacteria global alignment
graphs v.s. randomized networks obtained by permuting the protein name.
This suggests that both species share conserved interaction pathways.
“direct interaction” are rare. “mismatches” and “gaps” were permitted,
allowed overcome false negatives.
Source: Roded Sharan, Protein-protein Interaction: Network Alignment Lecture Note
Scoring FunctionScoring Function
p(v) is the probability of true homology with in the protein pair represented by v.
q(e) is the probability that the protein-protein interactions represented by e.
The background probabilities are the expected values of p(v) and q(e) over global alignment graph.
Pathways & Protein Pathways & Protein ComplexesComplexesPathBLAST is used to find
conserved paths and then overlapping paths are merged into complexs.
Source: Roded Sharan, Protein-protein Interaction: Network Alignment Lecture Note
Yeast v.s. BacteriaYeast v.s. BacteriaOrthologous PathwaysSelect the 150 highest-
scoring pathway of length four from alignment graph.
Combing overlapping pathways, found fell into 5 network regions.
Right figure involves the union of 6 paths.
With similar function.Solid link: direct
interactions, dotted link: gaps or mismatches.
Source: Conserved pathways within bacteria and yeast as revealed by global protein network alignment. PNAS, 2003.
Yeast vs. Yeast.Yeast vs. Yeast.Paralogous PathwaysProteins were not
allowed to pair with themselves or their neighbors.
Analyzed 150 highest-scoring pathway alignments of length 4 from alignment graph.
distinct alignments but homologous in function.
Source: Conserved pathways within bacteria and yeast as revealed by global protein network alignment. PNAS, 2003.
Pathway QueriesPathway Queries
PATHBLAST identified two other well known MAPK pathways as the highest-scoring hits,indicating that the algorithm was sufficiently sensitive and specific to identify known paralogous pathways.
Source: Conserved pathways within bacteria and yeast as revealed by global protein network alignment. PNAS, 2003.
Identification of Identification of Protein ComplexesProtein Complexes
Roded Sharan, Trey Ideker, Brian P. Kelley, Ron Shamir, Richard M. Karp:
Identification of Protein Complexes by Comparative Analysis of Yeast and Bacterial Protein Interaction Data.
Journal of Computational Biology 12(6): 835-846 (2005)
State-of-The-Art
Flashback[Input] the alignment graph of 2
PPI networks.We already can handle the
problem of finding conserved linear pathways.
Now this is not the end: How can we step further?
MotivationFinding more complex conserved
structures is of practical interest.
MotivationFinding more complex conserved
structures is of practical interest. [Reduction] Now we can merge
overlapping paths into complexes.
MotivationFinding more complex conserved
structures is of practical interest. [Reduction] Now we can merge
overlapping paths into complexes. Or we can develop another model
to identify conserved complexes.
A New Model: The Main Idea How do you recognize protein
complexes?◦ Dense Subgraphs◦ Comparative Analysis
"When I use a word," Humpty Dumpty said in a rather a scornful tone, "it means just what I choose it to mean -- neither more nor less."
Lewis Carroll, Through the Looking-Glass
Dense Subgraph: LikelihoodLikelihood Formula 0.1: given an
induced subgraph,◦ L(C) = |Ec|/ { ½ * |Vc| * ( |Vc| - 1 ) }
It makes sense: graphs with more edges have higher likelihood.
Dense Subgraph: Likelihood(Cont.)
Likelihood Formula 0.1: given an induced subgraph,◦ L(C) = |Ec|/ { ½ * |Vc| * ( |Vc| - 1 ) }
It makes sense: graphs with more edges have higher likelihood.
We only consider the structure of graphs.
Problems of link analysis are often data-dependent.
Dense Subgraph: Likelihood(Cont.)
Likelihood Formula 0.1: given an induced subgraph,◦ L(C) = |Ec|/ { ½ * |Vc| * ( |Vc| - 1 ) }
Likelihood Formula 0.2: given an induced subgraph,
What the hell is it?
Dense Subgraph: Likelihood(Cont.)
What do you expect about the behavior of revised formulas?
Higher likelihood: The scores of dense graphs are higher.
Adjustment: The weakest link ◦ Bonus: Interaction with low
probability happens.
Dense Subgraph: Likelihood(Cont.)
Higher likelihood: The scores of dense graphs are higher.
We assume that every 2 proteins in a complex interact with some probability p( 0.8 is used in this work).
We can use the model as a baseline for comparing density.
Dense Subgraph: Likelihood(Cont.)
Adjustment: The weakest link!p(u,v) is defined to be the fraction
of graphs in FG that includes this edge.◦ FG : the family of graphs with V and
the same degree sequence.Edges incident on vertices with
higher degrees have higher probability.
Dense Subgraph: Likelihood(Cont.)
Likelihood Formula 0.2: given an induced subgraph,
What the hell is it?◦ For p(u,v) = 0.2, we have 4 and ¼ in
both side.◦ For p(u,v) = 0.6, we have 4/3 and 1/2 in
both side.◦ It makes sense! We emphasize the
weakest link.
Dense Subgraph: Likelihood(Cont.)
Likelihood Formula 0.1: given an induced subgraph,
◦ L(C) = |Ec|/ { ½ * |Vc| * ( |Vc| - 1 ) } Likelihood Formula 0.2: given an induced
subgraph,
Likelihood Formula 0.3: given an induced subgraph,
The Main Idea Revisited How do you recognize protein
complexes?◦ Dense Subgraphs
We have some revised formula for density in a PPI network.
◦ Comparative Analysis
Comparative AnalysisIdea: If some structure occurs in
different species, it is of high probability to be some meaningful structure.
How do you define dense substructures on alignment graphs?
Comparative Analysis(Cont.)
Consider two subsets U1 ={ u1,..., uk}, V2 ={ v1,..., vk} and Θ: U1 → V2 is a many-to-many correspondence.
Since you already have
You may derive the formula 1.1 as follows:
Does it make sense?
Comparative Analysis(Cont.)Θ is useful information:
You have the formula 1.2:
{ A/(A+B) }/ {X/(X+Y)}
The Main Idea Revisited How do you recognize protein
complexes?◦ Dense Subgraphs
We have some revised formula for density in a PPI network.
◦ Comparative Analysis We have some revised formula for
density in an alignment network.
Search the Complexes Now we only need to find heavy
subgraphs in the alignment graph.The problem is NP-Hard.
Search the Complexes(Cont.)
[Seed] Compute a seed around each node v.
[Refined Seed] Enumerate all subsets of the seed that have size 3 and contain v.
[Local Search] Iteratively modify the refined seed.
[Output Heavy Subgraphs] For each node, we record at most k heaviest subgraphs.
Search the Complexes(Cont.)
[Seed] Compute a seed around each node v.
[Restrict the Size] Keep seeds small![Refined Seed] Enumerate all subsets of
the seed that have size 3 and contain v.[Local Search] Iteratively modify the
refined seed.[Output Heavy Subgraphs] For each node,
we record at most k heaviest subgraphs.[Filtering overlapping ones] Greedy
method is used!
The Main Idea Revisited How do you recognize protein
complexes?◦ Dense Subgraphs
We have some revised formula for density in a PPI network.
◦ Comparative Analysis We have some revised formula for
density in an alignment network. Finally, we have some practical method
to search complexes!
PATH QUERIESPATH QUERIES
Path QueriesProblem definitionInput
◦a target network represented as an undirected weighted graph G(V, E), with a weight function on the edges w:E×E→R
◦A path queries Q=(q1,…,qk)
Scoring function of node similarity H:Q×V
Output: a set of best matching pathways P=(p1,…,pl) in G, where a good match is measured in two respects:
1. The matched nodes are similar by scoring function H.
2. The reliability of edges in the matched pathway is high.
Algorithm
1. Introduce a mapping M from Q to P∪{0} where deleted query nodes are mapped to 0 by M.
2. Path Scoring:• interaction score and sequence
score
k
qMiii
l
iii
i
qMqHppw0,1
1
11 ,,
Interaction score◦Edges weights represent the
logarithm of reliability of interaction between two proteins.
Sequence score◦BLAST E-value for the two proteins
normalized by the maximal E-value over all pairs of proteins from the two networks.
AlgorithmAvoiding cycles
◦N. Alon, R. Yuster, and U. Zwick: Color-coding. J.ACM, 1995.
Finding the best matching paths:
deldeldel
del
idel
Vmdel
NSmiW
EjmjmwjcSmiW
EjmjqHjmwjcSmiW
SjiW
,1,,,1
,,,,,,
,,,,,,,1
max,,,
Dataset and ResultsYeast and fly PPI networks
◦ The yeast (S. cerevisiae) PPI network contains 4,726 proteins and 15,166 known interactions between them.
◦ The fly (D. melanogaster) PPI network contains 7,028 proteins and 22,837 interactions.
271 pathways were discovered which were better than 99% of randomly chosen from yeast PPI network, and then were used as queries for the fly PPI network.
Results
APPLICATION OF PPI APPLICATION OF PPI NETWORK NETWORK ALIGNMENT: ALIGNMENT: ORTHOLOGY ORTHOLOGY MAPPINGMAPPING
S. Bandyopadhyay, R. Sharan, and T. Ideker. Systematic identification of functional orthologs based on protein network comparison. Genome Research, 16(3):428–435, 2006
IntroductionIntroductionAnnotating protein function across species is
often complicated by the presence of paralogous proteins
Most of the methods of dealing with this problem are sequence-based models, thus sequences of proteins from different species were compared to find a group of proteins that have the same functional annotation
A protein and its functional ortholog are likely to interact with proteins in their respective networks that are themselves functional orthologs
This introduced a strategy for identifying functionally related proteins that supplements sequence-based comparisons with information on conserved protein-protein interactions
Introduction (cont’d)Introduction (cont’d)
a b
a’b’
a’
b’b
a
Functional orthologyFunctional orthology When the protein in question has
similarity to not one but many paralogous proteins, it’s harder to distinguish which of these is the true ortholog, the protein that is directly inherited from a common ancestor
Definite functional orthologs are defined as proteins that are functionally equivalent as a result of direct ancestry
Model reviewModel reviewThe protein interaction networks of two species
are aligned by assigning proteins to sequences homology groups using the Inparanoid algorithm
Networks are aligned into a merged graph representation
Probabilistic inference is performed on the aligned networks to identify pairs of proteins, one from each species, that are likely to retain the same function based on conservation of their interacting partners
A logistic function is used to compute the probability of functional orthology for a protein pair i given the states of functional orthology for its network neighbors
The previous probability is updated for each pair over successive iterations of Gibbs sampling
Model review (cont’d)Model review (cont’d)
Conservation indexConservation indexConsider an alignment graph G
◦Nodes represent sequence-similar protein pairs
◦Edges link nodes (a, b) and (a’, b’) if one of (a, a’) or (b, b’) directly interacts, and the other interacts via a neighbor, which is directly connected to them
◦An edge is strongly conserved if its endpoints are true functional orthologs
Conservation index Conservation index (cont’d)(cont’d)
network itsin bprotein of degree the:)(
network itsin aprotein of degree the:)(
i node involving links conservedstrongly ofnumber the:)(
i node a ofindex on conservati :)(
)()(
)(2)(
bd
ad
id
ic
bdad
idic
Probabilistic modelProbabilistic modelThe probability of functional
orthology for a pair of proteins is influenced by the probabilities of functional orthology for their network neighbors, which in turn depend on their network neighbors, and so on
This type of probabilistic model is known as a Markov random field
Probabilistic model Probabilistic model (cont’d)(cont’d)
Positive training examples: the definite functional orthologs having as least one conserved interaction
Negative training examples: the protein paired with its best BLAST e-value matching protein not the same cluster by the Inparanoid algorithm
examples trainingnegative allover ))|(1(
and examples trainingpositive allover )|(
ofproduct themaximizingby optimized are and Parameter
)(such that all ofset the:Z
i node of neighbors ofset the:)(
i node of state the:
)}(exp{1
1)|(
)(
)(
N(i)
)(
iNi
iNi
j
i
iNi
ZzP
ZzP
iNjz
iN
z
icZzp
Orthology inferenceOrthology inferenceThe above model was used to estimate the
final posterior probabilities P(zi) using the Gibbs sampling
Nodes representing ambiguous functional orthologs are each assigned a temporary state z=0 or z=1, initially at random
At each iteration, a node i is sampled (with replacement) and its value if zi is updated given the states of its neighbors, ZN(i). The new value of zi is set to 0 or 1 with probability P(zi|ZN(i))
Over all iterations, the nodes designed as definite functional orthologs and non-orthologs are forced to states of 1 and 0, respectively
Experimental resultsExperimental results
A total of 2244 clusters were generated by the Inparanoid algorithm, covering 2834 proteins in yeast and 3881 proteins in fly
Of these, 1552 clusters contained only a single yeast and fly protein pair and were assumed to represent definite functional orthologs
They applied above method to resolve the remaining 692 clusters which were assumed to represent ambiguous functional orthologs, and found 121 contained protein pairs for which at least one pair had conserved interations between networks
In 60 of these, the highest probability was assigned to the protein pair that was also the most sequence-similar via BLAST
Experimental results Experimental results (cont’d)(cont’d)
ConclusionConclusionThese findings confirm that
yeast/fly proteins classified as definite functional orthologs are more likely to have equivalent functional roles in the protein network
The conserved network context could be used to help discriminate functional orthology from general sequence similarity
MULTIPLE NETWORK MULTIPLE NETWORK ALIGNMENTALIGNMENT
R. Sharan, S. Suthram, R.M. Kelley, T. Kuhn, S. McCuine, P. Uetz, T. Sittler, R.M. Karp, and T. Ideker.Conserved patterns of protein interaction in multiple species. PNAS, 102(6):1974–1979, 2004
The alignment graphThe alignment graph Each node in this graph consists of a group
of sequence-similar proteins, one from each species
Each link between a pair of nodes in the alignment graph represent conserved protein interactions between the corresponding protein group
A search over the alignment graph is performed to identify:1. Short linear paths of interacting proteins, which
model signal transduction pathways2. Dense clusters of interactions, which model
protein complexes
The alignment graph The alignment graph (cont’d)(cont’d)
Experimental resultsExperimental resultsThey applied the multiple network
alignment framework to three PPI networks:◦ Yeast: 14319 interactions among 4389 proteins◦ Worm: 3926 interactions among 2718 proteins◦ Fly: 20720 interactions among 7038 proteins
It identified 183 protein clusters and 240 paths conserved at a significance level of P < 0.01; groups of conserved clusters overlap to define 71 distinct network regions
Experimental results Experimental results (cont’d)(cont’d)
Experimental results Experimental results (cont’d)(cont’d)
Prediction of protein Prediction of protein functionfunctionWhenever the set of proteins in a
conserved cluster or path (over all species) was significantly enriched for a particular GO annotation and at least half of the proteins in the cluster or path had that annotation, all remaining proteins in the sub-network were predicted to have that annotation
Fast and accurate alignment of Fast and accurate alignment of multiple PPI networksmultiple PPI networksBy Maxim Kalaev, Vineet Bafna,
and Roded Sharan, 2007Drawback of the alignment graph:
exponential growth of the graph with the number of species
They introduced a new algorithm avoiding the explicit representation of every set of potentially orthologous proteins, thereby reducing time and memory requirements
The layered alignment The layered alignment graph (1/3)graph (1/3)Given k PPI networks (for k species
respectively)A layered alignment graph: each layer
corresponds to a species and contains the corresponding network. Additional edges connect proteins from different layers if they are sequence similar
A k-spine: a sub-graph of size k which includes a vertex from each of the layers. A k-spine corresponds to a set of truly orthologous proteins
A collection of connected k-spines induces a candidate conserved sub-network
The layered alignment The layered alignment graph (2/3)graph (2/3)
Species 1 Species 2 Species k
k-spin U[3]……
Inter-layer edge
PPI edge
U1 U2 U3 Uk
The layered alignment The layered alignment graph (3/3)graph (3/3)If considering every k-spine to be a
node in a graphAn m-subnet: a collection U of k multi-
sets Ui = {ui[1],…, ui[m]}◦ For all 1≦ i ≦ k and 1≦ j ≦ m, ui[j] belongs
to Vi
◦ For all 1≦ j ≦ m, the set U[j] = {u1[j], u2[j],…, uk[j]} is a k-spine
The task is to look for high scoring m-subnets, for a fixed m