Gene Expression Analysis and Modeling

61
Gene Expression Analysis and Modeling Guillaume Bourque Centre de Recherches Mathématiques Université de Montréal August 2003

description

Gene Expression Analysis and Modeling. Guillaume Bourque Centre de Recherches Mathématiques Université de Montréal August 2003. http://www.sri.com/pharmdisc/cancer_biology/laderoute.html. DNA Microarrays. Experiment design Noise reduction Normalization … Data analysis. Outline. - PowerPoint PPT Presentation

Transcript of Gene Expression Analysis and Modeling

Page 1: Gene Expression Analysis and Modeling

Gene Expression Analysis and Modeling

Guillaume Bourque

Centre de Recherches MathématiquesUniversité de Montréal

August 2003

Page 2: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 2/61

DNA Microarrays

http://www.sri.com/pharmdisc/cancer_biology/laderoute.html

• Experiment design

• Noise reduction

• Normalization

• …

• Data analysis

Page 3: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 3/61

Outline

• Microarray data analysis techniques– Clustering: hierarchical and k-means– SVD and PCA– SVM – Support Vector Machines

• Gene network modeling– Boolean networks– Bayesian models– Differential equations

Page 4: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 4/61

Outline

• Microarray data analysis techniques– Clustering: hierarchical and k-means– SVD and PCA– SVM – Support Vector Machines

• Gene network modeling– Boolean networks– Bayesian models– Differential equations

Page 5: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 5/61

Gene Expression Data

Page 6: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 6/61

Gene Expression Matrix

Given an experiment with m genes and n assays we produce a matrix X where:

xij = expression level of the ith gene in the jth assay.

gi = Transcriptional

response of the ith gene

aj = Expression profile of the jth assay

Page 7: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 7/61

Goals of Clustering

• Clustering genes:– Classify genes by their transcriptional response and get an

idea of how groups of genes are regulated.

– Potentially infer gene functions of unknown genes.

• Clustering assays:– Classify diseased versus normal samples by their

expression profile.

– Track the expression levels at different stages in the cell.

– Study the impact of external stimuli.

Page 8: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 8/61

Clustering Genes

X

n assays

m genes

m genes

m genes

similarity matrix

clustergenes basedon similarity

Page 9: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 9/61

Clustering Steps

• Choose a similarity metric to compare the transcriptional response or the expression profiles:– Pearson Correlation

– Spearman Correlation

– Euclidean Distance

– …

• Choose a clustering algorithm:– Hierarchical

– K-means

– …

Page 10: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 10/61

Similarity Metric

• Choice of the best metric depends on the normalization procedure.

• Must be cautious of potential pitfalls.• Correlations:

Correlation coefficients are values from –1 to 1, with 1 indicating a similar behavior, –1 indicating an opposite behavior and 0 indicating no direct relation.

• Euclidean distance:

Page 11: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 11/61

Hierarchical Clustering

g1 g2 g3 g4 g5

g1 0.23 0.00 0.95 -0.63

g2 0.91 0.56 0.56

g3 0.32 0.77

g4 -0.36

g5

g1 g4

g1 g2 g3 g4 g5

g1 0.23 0.00 0.95 -0.63

g2 0.91 0.56 0.56

g3 0.32 0.77

g4 -0.36

g5

• Find largest value is similarity matrix.

• Join clusters together.

• Recompute matrix and iterate.

Page 12: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 12/61

Hierarchical Clustering

g1 , g4 g2 g3 g5

g1 , g4 0.37 0.16 -0.52

g2 0.91 0.56

g3 0.77

g5

g1 g4 g2 g3

g1 , g4 g2 g3 g5

g1 , g4 0.37 0.16 -0.52

g2 0.91 0.56

g3 0.77

g5

• Find largest value is similarity matrix.

• Join clusters together.

• Recompute matrix and iterate.

Page 13: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 13/61

Hierarchical Clustering

g1 , g4 g2 , g3 g5

g1 , g4 0.27 -0.52

g2 , g3 0.68

g5

g1 g4 g2 g3g5

g1 , g4 g2 , g3 g5

g1 , g4 0.27 -0.52

g2 , g3 0.68

g5

• Find largest value is similarity matrix.

• Join clusters together.

• Recompute similarity matrix and iterate.

Page 14: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 14/61

Cluster JoiningOne of the issue with hierarchical clustering is how to recompute the similarity matrix after joining clusters. Here are 3 common solutions that define different types of hierarchical clustering:

• Single-link: minimum distance between any member of one cluster to any member of the other cluster.

• Complete-link: maximum distance between any member of one cluster to any member of the other cluster.

• Average-link: average distance between any member of one cluster to any member of the other cluster.

Page 15: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 15/61

Interpreting the Results

g1 g4 g2 g3g5

2 clusters ?

3 clusters ?

Page 16: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 16/61

Clustering Example

Eisen et al. (1998), PNAS, 95(25): 14863-14868

Page 17: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 17/61

K-means Clustering

• Expression profiles are displayed in n dimensional space.

• First cluster center is picked at random between all data points.

• Other cluster centers are picked as far as possible from previous clusters centers.

k = 3

Page 18: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 18/61

K-means Clustering

• Associate each data point to the closest cluster center.

• Recompute cluster centers based on new clusters.

k = 3

Page 19: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 19/61

K-means Clustering

• Associate each data point to the closest cluster center.

• Recompute cluster centers based on new clusters.

• Iterate until the clusters remain unchanged.

k = 3

Page 20: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 20/61

K-means Clustering

• Associate each data point to the closest cluster center.

• Recompute cluster centers based on new clusters.

• Iterate until the clusters remain unchanged.

k = 3

Page 21: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 21/61

K-means Clustering

• Associate each data point to the closest cluster center.

• Recompute cluster centers based on new clusters.

• Iterate until the clusters remain unchanged.

k = 3

Page 22: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 22/61

Singular Value Decomposition

Xm x n = Um x n Sn x n V T n x n (n m)

=

uk = eigenassay

sk = singular

value

vk = eigengene

geneexpression

matrix

Page 23: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 23/61

Singular Value Matrix

Sn x n = =

Singular values are organized from largest to smallest:s1 s2 … sk … sn .

Page 24: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 24/61

Why SVD?

• SVD extracts from the gene expression matrix:• n eigenassays

• m eigengenes

• n singular values

• We can represent the transcriptional response of each gene as a linear combination of the eigengenes.

• We can represent the expression profile of each assay as a linear combination of the eigenassays.

• Allows for dimensionality reduction and for the identification of important components.

Page 25: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 25/61

SVD and PCA• There is a direct correspondence between SVD and PCA

(Principal Component Analysis) when calculated on covariance matrices.

• If we normalize X so that it’s columns have a 0 mean. We get that the eigengenes are the principal components of the transcriptional responses.

• If we normalize X so that it’s rows have a 0 mean. We get that the eigenassays are the principal components of the expression profiles.

• In both cases, we get that the square of the singular values are proportional to the variance of the principal components.

Page 26: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 26/61

SVD Special Property

X =

U S V T

X(r) =

U S(r)

00

0

V T

where r is the numberof non-null rows

X(r) is the closestrank r

approximationof X

Page 27: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 27/61

Applications of SVD

• Detects redundancies and allows for the representation of the data with the minimal set of essential features (components). These features can themselves represent signals (e.g. cell-cyle).

• Data visualization. SVD can identify subspaces that capture most of the variance in the data which allows for the visualization of high-dimensional data in 1, 2 or 3-dimensional subspace.

• Signal extraction in noisy data.

Page 28: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 28/61

Essential Features

Alter et al. (2000), PNAS, 97(18): 10101-10106

Page 29: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 29/61

Data Visualization

Yeung and Ruzzo. (2001), Bioinformatics, 17(9): 763-774

Page 30: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 30/61

Support Vector Machines (SVM)

• Instead of trying to identify clusters directly in the data, we assume the genes are already pre-clustered into different classes. The goal is the find a model that best predicts these classes.

• We need to find the hyperplane that best divide the data points.

• We must do so while minimize the error rates of the predictions.

Page 31: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 31/61

References• Clustering

– deRisi et al. (1997), Science, 278(5338): 680-686.

– Eisen et al. (1998), PNAS, 95(25): 14863-14868.

• SVD and PCA– Alter et al. (2000), PNAS, 97(18): 10101-10106.

– Holter et al. (2000), PNAS, 97(15): 8409-8414.

– Yeung and Ruzzo. (2001), Bioinformatics, 17(9): 763-774.

– Wall et al. (2003), A Pratical Approach to Microarray Data Analysis, Chapter 5.

• SVM– Brown et al. (2000), PNAS, 97(1), 262-267.

Page 32: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 32/61

Outline

• Microarray data analysis techniques– Clustering: hierarchical and k-means– SVD and PCA– SVM – Support Vector Machines

• Gene network modeling– Boolean networks– Bayesian models– Differential equations

Page 33: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 33/61

Problem

Time series

x2

x1

x4

x3

_

_

+

+ _

_

+

_

?

Gene network

Page 34: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 34/61

Boolean Networks

• Genes are assumed to be ON or OFF.

• At any given time, combining the gene states gives a gene activity pattern (GAP).

• Given a GAP at time t, a deterministic function (a set of logical rules) provides the GAP at time t +1.

• GAPs can be classified into attractor and transient states.

Page 35: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 35/61

Boolean Network Example

x2

x1

x1 x3

x2 x3

or nor nand

t

t+1

t 0 1 2 3 4

x11 1 0 1 1

x21 0 0 0 0

x31 0 1 1 0

transient attractors

t 0 1 2 3 4

x11

x21

x31

Page 36: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 36/61

State Space

Picture generated using the program DDLab.Wuensche,A., (1998), Proceedings of Complex Systems '98 .

Page 37: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 37/61

Boolean Network Example

I. Shmulevich et al., Bioinformatics (2002), 18 (2): 261-274

AND

NOT

NAND

Page 38: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 38/61

Issues with Boolean Networks

• Gene trajectories are continuous and modeling them as ON/OFF might be inadequate.

• A deterministic set of logical rules forces a very stringent model. – It doesn’t allow for external input.– Very susceptible to noise.

• Probability Boolean Networks aims at fixing some of these issues by combining multiple sets of rules (related to Bayesian Networks).

Page 39: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 39/61

Threshold(s)

ON

OFF

Page 40: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 40/61

Bayesian Networks

• A gene regulatory network is represented by directed acyclic graph:– Vertices correspond to genes.

– Edges correspond to direct influence or interaction.

• For each gene xi, a conditional distribution

p(xi | ancestors(xi) ) is defined.

• The graph and the conditional distributions, uniquely specify the joint probability distribution.

Page 41: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 41/61

Bayesian Network Example

x3x4

x5

x1 x2

Conditional distributions:p(x1), p(x2), p(x3| x2),

p(x4| x1,x2), p(x5| x4)

p(X) = p(x1) p(x2| x1) p(x3| x1,x2) p(x4| x1,x2, x3) p(x5| x1,x2, x3,x4)

p(X) = p(x1) p(x2) p(x3| x2) p(x4| x1,x2) p(x5| x4)

Page 42: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 42/61

Learning Bayesian Models

• Using gene expression data, the goal is to find the

bayesian network that best matches the data.

• Recovering optimal conditional probability

distributions when the graph is known is “easy”.

• Recovering the structure of the graph is NP-hard.

• But, good statistics are available:– What is the likelihood of a specific assignment?– What is the distribution of xi given xj?– …

Page 43: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 43/61

Issues with Bayesian Models

• Computationally intensive.

• Requires lots of data.

• Does not allow for feedback loops which are known

to play an important role in biological networks.

• Does not make use of the temporal aspect of the data.

• Dynamical Bayesian Networks aim at solving some

of these issues but they require even more data.

Page 44: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 44/61

Differential Equations

• Typically uses linear differential equations to model the gene trajectories:dxi(t) / dt = a0 + ai,1 x1(t)+ ai,2 x2(t)+ … + ai,n xn(t)

• Several reasons for that choice:– lower number of parameters implies that we are

less likely to over fit the data

– sufficient to model complex interactions between the genes

Page 45: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 45/61

Small Network Example

dx1(t) / dt = 0.491 - 0.248 x1(t)

dx2(t) / dt = -0.473 x3(t) + 0.374 x4(t)

dx3(t) / dt = -0.427 + 0.376 x1(t) - 0.241 x3(t)

dx4(t) / dt = 0.435 x1(t) - 0.315 x3(t) - 0.437 x4(t)

x2

x1

x4

x3

_

_

+

+ _

_

+

_

Page 46: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 46/61

Small Network Example

dx1(t) / dt = 0.491 - 0.248 x1(t)

dx2(t) / dt = -0.473 x3(t) + 0.374 x4(t)

dx3(t) / dt = -0.427 + 0.376 x1(t) - 0.241 x3(t)

dx4(t) / dt = 0.435 x1(t) - 0.315 x3(t) - 0.437 x4(t)

x2

x1

x4

x3

_

_

+

+ _

_

+

_

one interactioncoefficient

Page 47: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 47/61

Small Network Example

dx1(t) / dt = 0.491 - 0.248 x1(t)

dx2(t) / dt = -0.473 x3(t) + 0.374 x4(t)

dx3(t) / dt = -0.427 + 0.376 x1(t) - 0.241 x3(t)

dx4(t) / dt = 0.435 x1(t) - 0.315 x3(t) - 0.437 x4(t)

x2

x1

x4

x3

_

_

+

+ _

_

+

_

constantcoefficients

Page 48: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 48/61

Problem Revisited

a0,i a1,i a2,i a3,i a4,i

x1 .431 -.248 0 0 0

x2 0 0 0 -.473 .374

x3 -.427 .376 0 -.241 0

x4 0 .435 0 -.315 -.437

Given the time-series data, can we find the interactions coefficients?

Page 49: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 49/61

Issues with Differential Equations• Even under the simplest linear model, there are

m(m+1) unknown parameters to estimate:• m(m-1) directional effects

• m self effects

• m constant effects

• Number of data points is mn and we typically have that n << m (few time-points).

• To avoid over fitting, extra constraints must be incorporated into the model such as:

• Smoothness of the equations

• Sparseness of the network (few non-null interaction coefficients)

Page 50: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 50/61

Algorithm for Network Inference

• To recover the interaction coefficients, we use stepwise multiple linear regression.

• Why?– This procedure only finds coefficient that

significantly improve the fit in the regression. Hence it limits the number of non-zero coefficients (i.e. it finds sparse networks) a feature we were seeking.

– It is highly flexible and provides p-value scores which can be interpreted easily.

Page 51: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 51/61

Partial F Test• The procedure finds the interaction coefficients

iteratively for each gene xi.• A partial F test is constructed to compare the total

square error of the predicted gene trajectory with a specific subset of coefficients being added or removed.

• If the p-value obtained from the test exceeds a certain cutoff, the subset of coefficients is significant and will be added or removed.

• The procedures iterates until no more subsets of coefficients are either added or removed.

Page 52: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 52/61

Simulations

• Difficult to find coefficients that will produce

realistic gene trajectories.

• We select coefficients such that the resulting

trajectories satisfy 3 conditions:• They are bounded

• The correlation of any pair is not too high

• They are not too stable

• We add gaussian noise to model errors.

Page 53: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 53/61

Noise

Page 54: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 54/61

Network Inferencea0,i a1,i a2,i a3,i a4,i

x1 .431 -.248 0 0 0

x2 0 0 0 -.473 .374

x3 -.427 .376 0 -.241 0

x4 0 .435 0 -.315 -.437

Procedure recovers perfectly this network with 4 genes and 10 interactions coefficients.

x2

x1

x4

x3

_

_

+

+ _

_

+_

Page 55: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 55/61

10 Genes

Procedure also recovers perfectly this network with 10 genes and 22 interactions coefficients.

Page 56: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 56/61

Multiple Networks

Page 57: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 57/61

Multiple Network Problem

• Multiple networks related by a graph or a tree can arise from various situations:– Different species

– Different developments stages

– Different tissues

• The goal is now not only to maximize the fit (with as few interactions as possible) but also to minimize an evolutionary score on the graph of the network.

Page 58: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 58/61

Multiple Network Inference

• The stepwise regression algorithm is modified to act directly on the edges of the graph and to take into account the evolutionary score.

• The inference is done concurrently in all the networks.

• Results: the comparative framework actually simplifies the inference process especially when more genes or noise are involved.

Page 59: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 59/61

Simulation Tests

Page 60: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 60/61

Simple?

Page 61: Gene Expression Analysis and Modeling

Guillaume Bourque, CRM Summer School 61/61

References• Boolean Networks

– Kauffman (1993), The Origins of Order

– Lian et al. (1998), PSB, 3: 18-29.

• Bayesian Networks– Friedman et al. (2000), RECOMB 2000.

– Hartemink et al. (2001), PSB, 6: 422-433.

• Differential Equations– Chen et al. (1999), PSB, 4: 29-40.

– D’haeseleer et al. (1999), PSB, 4: 41-52.

– Yeung et al. (2002), PNAS, 99(9): 6163-6168.

• Literature Review– De Jong (2002). JCB, 9(1): 67-103.