Введение в биоинформатику, весна 2010: Лекции 3-4
Transcript of Введение в биоинформатику, весна 2010: Лекции 3-4
![Page 1: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/1.jpg)
Gene Expression - Microarrays
Misha KapusheskyEuropean Bioinformatics Institute, EMBL
St. Petersburg, RussiaMay 2010
![Page 2: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/2.jpg)
![Page 3: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/3.jpg)
![Page 4: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/4.jpg)
![Page 5: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/5.jpg)
![Page 6: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/6.jpg)
Compare gene expression in this cell type…
…after drug treatment
…at a later developmental time
…in a different body region
…after viral infection
…in samplesfrom patients
…relative to a knockout
![Page 7: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/7.jpg)
• by region (e.g. brain versus kidney)• in development (e.g. fetal versus adult tissue)• in dynamic response to environmental signals
(e.g. immediate-early response genes)• in disease states• by gene activity
Gene expression is context-dependent,and is regulated in several basic ways
Page 297
![Page 8: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/8.jpg)
Outline: microarray data analysis
Gene expression
Microarrays
Preprocessingnormalizationscatter plots
Inferential statisticst-testANOVA
Exploratory (descriptive) statisticsdistancesclusteringprincipal components analysis (PCA)
![Page 9: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/9.jpg)
Microarrays: tools for gene expression
A microarray is a solid support (such as a membraneor glass microscope slide) on which DNA of knownsequence is deposited in a grid-like array.
Page 312
![Page 10: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/10.jpg)
Microarrays: tools for gene expression
The most common form of microarray is used to measure gene expression. RNA is isolated from matched samples of interest. The RNA is typically converted to cDNA, labeled with fluorescence (or radioactivity), then hybridized to microarrays in order to measure the expression levelsof thousands of genes.
![Page 12: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/12.jpg)
How it works
Complementary hybridization:- Put a part of the gene sequence on the array- convert mRNA to cDNA using reverse transcriptase
![Page 13: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/13.jpg)
Spotted Arrays
• Robot puts little spots of DNA on glass slides• Each spot is a DNA analog of the mRNA we
want to detect
![Page 14: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/14.jpg)
Spotted Arrays
• Two channel technology for comparing two samples – relative measurements
• Two mRNA samples (reference, test) are reversetranscribed to cDNA, labeled with fluorescent dyes (Cy3, Cy5) and allowed to hybridize to array
![Page 15: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/15.jpg)
Spotted Arrays
• Read out two images by scanning array with lasers,one for each dye
![Page 16: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/16.jpg)
Oligonucleotide Arrays
• One channel technology – absolute measurements• Instead of putting entire genes on array, put multiple
oligonucleotide probes: short, fixed length DNA sequences (25-60 nucleotides)
• Oligos are synthesized in situ• Affymetrix uses a photolithography process,
similar to that used to make semiconductor chips• Other technologies available (e.g. mirror arrays)
![Page 17: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/17.jpg)
Oligonucleotide Arrays
• For each gene, construct a probeset – a set of n-mers to specific to this gene
![Page 18: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/18.jpg)
Fast Data on >20,000 transcripts within weeks
Comprehensive Entire yeast or mouse genome on a chip
Flexible Custom arrays can be made to represent genes of interest
Easy Submit RNA samples to a core facility
Cheap? Chip representing 20,000 genes for $300
Advantages of microarray experiments
![Page 19: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/19.jpg)
Cost ■ Some researchers can’t afford to doappropriate numbers of controls, replicates
RNA ■ The final product of gene expression is proteinsignificance ■ “Pervasive transcription” of the genome is
poorly understood (ENCODE project)■ There are many noncoding RNAs not yet
represented on microarrays
Quality ■ Impossible to assess elements on array surfacecontrol ■ Artifacts with image analysis
■ Artifacts with data analysis■ Not enough attention to experimental design■ Not enough collaboration with statisticians
Disadvantages of microarray experiments
![Page 20: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/20.jpg)
Biological insight
Sampleacquisition
Dataacquisition
Data analysis
Data confirmation
![Page 21: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/21.jpg)
Stage 1: Experimental design
Stage 3: Hybridization to DNA arrays
Stage 2: RNA and probe preparation
Stage 4: Image analysis
Stage 5: Microarray data analysis
Stage 6: Biological confirmationStage 7: Microarray databases
![Page 22: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/22.jpg)
Stage 1: Experimental design
[1] Biological samples: technical and biological replicates:determine the data analysis approach at the outset
[2] RNA extraction, conversion, labeling, hybridization:except for RNA isolation, routinely performed at core facilities
[3] Arrangement of array elements on a surface:randomization can reduce spatially-based artifacts
Page 314
![Page 23: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/23.jpg)
Stage 2: RNA preparation
For Affymetrix chips, need total RNA (about 5 ug)
Confirm purity by running agarose gel
Measure a260/a280 to confirm purity, quantity
One of the greatest sources of error in microarrayexperiments is artifacts associated with RNA isolation;appropriately balanced, randomized experimental design is necessary.
![Page 24: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/24.jpg)
Stage 3: Hybridization to DNA arrays
The array consists of cDNA or oligonucleotides
Oligonucleotides can be deposited by photolithography
The sample is converted to cRNA or cDNA
(Note that the terms “probe” and “target” may refer to theelement immobilized on the surface of the microarray, orto the labeled biological sample; for clarity, it may be simplest to avoid both terms.)
![Page 25: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/25.jpg)
Stage 4: Image analysis
RNA transcript levels are quantitated
Fluorescence intensity is measured with a scanner.
![Page 26: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/26.jpg)
Rett
Control
Differential Gene Expression on a cDNA Microarray
α B Crystallin is over-expressed in Rett Syndrome
![Page 27: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/27.jpg)
![Page 28: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/28.jpg)
Fig. 8.21Page 319
![Page 29: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/29.jpg)
![Page 30: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/30.jpg)
Fig. 8.21Page 319
![Page 31: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/31.jpg)
Stage 5: Microarray data analysis
Page 318
Hypothesis testing• How can arrays be compared? • Which RNA transcripts (genes) are regulated?• Are differences authentic?• What are the criteria for statistical significance?
Clustering• Are there meaningful patterns in the data (e.g. groups)?
Classification• Do RNA transcripts predict predefined groups, such as disease subtypes?
![Page 32: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/32.jpg)
Stage 6: Biological confirmation
Page 320
Microarray experiments can be thought of as“hypothesis-generating” experiments.
The differential up- or down-regulation of specific RNAtranscripts can be measured using independent assayssuch as
-- Northern blots-- polymerase chain reaction (RT-PCR)-- in situ hybridization
![Page 33: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/33.jpg)
Stage 7: Microarray databases
There are two main repositories:
Gene Expression Omnibus (GEO) at NCBI
ArrayExpress at the European Bioinformatics Institute (EBI)
![Page 34: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/34.jpg)
MicrobialORFs
Design PCR Primers
PCR Products
EukaryoticGenes
Select cDNA clones
PCR Products
Microarray Overview IMicroarray Overview I
For each plate set,many identical replicasFor each plate set,many identical replicas
Microarray Slide(with 60,000 or more
spotted genes)
Microarray Slide(with 60,000 or more
spotted genes)
+
Microtiter PlateMicrotiter Plate
Many different plates containing different genesMany different plates containing different genes
![Page 35: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/35.jpg)
Microarray Overview IIMicroarray Overview II
Prepare FluorescentlyLabeled Probes
Prepare FluorescentlyLabeled Probes
ControlControl
TestTest
Hybridize,WashHybridize,Wash
MeasureFluorescencein 2 channels
red/green
MeasureFluorescencein 2 channels
red/green
Analyze the datato identifypatterns of
gene expression
Analyze the datato identifypatterns of
gene expression
![Page 36: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/36.jpg)
Affymetrix GeneChip™ Expression AnalysisAffymetrix GeneChip™ Expression Analysis
Obtain RNASamples
Obtain RNASamples
Prepare Fluorescently
LabeledProbes
Prepare Fluorescently
LabeledProbes
ControlControl
TestTest
Scan chipsScan chips
AnalyzeAnalyze
PMPM
MMMM
Hybridize andwash chips
Hybridize andwash chips
![Page 37: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/37.jpg)
GeneGeneSpots on anArray
Spots on anArray
FluorescenceIntensityFluorescenceIntensity
ExpressionMeasurementExpressionMeasurement
TissueSelectionTissueSelection
DifferentialState/StageSelection
DifferentialState/StageSelection
RNA Preparationand LabelingRNA Preparationand Labeling
CompetitiveHybridizationCompetitiveHybridization
Microarray Expression AnalysisMicroarray Expression Analysis
![Page 38: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/38.jpg)
Select array elements and annotate them
Build a database to manage stuff
Print arrays and manage the lab
Hybridize and analyze images; manage data
Analyze hybridization data and get results
Select array elements and annotate them
Build a database to manage stuff
Print arrays and manage the lab
Hybridize and analyze images; manage data
Analyze hybridization data and get results
Steps in the Process
![Page 39: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/39.jpg)
MIAME
In an effort to standardize microarray data presentationand analysis, Alvis Brazma and colleagues at 17institutions introduced Minimum Information About aMicroarray Experiment (MIAME). The MIAME framework standardizes six areas of information:
►experimental design►microarray design►sample preparation►hybridization procedures►image analysis►controls for normalization
Visit http://www.mged.org
![Page 40: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/40.jpg)
Interpretation of RNA analyses
The relationship of DNA, RNA, and protein:DNA is transcribed to RNA. RNA quantities and half-lives vary. There tends to be a low positive correlation between RNA and protein levels.
The pervasive nature of transcription:The Encyclopedia of DNA Elements (ENCODE) project identified functional features of genomic DNA, initially in 30 megabases (1% of the human genome). One of its observations was the “pervasive nature of transcription”: the vast majority of DNA is transcribed, although the function is unknown.
![Page 41: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/41.jpg)
Outline: microarray data analysis
Gene expression
Microarrays
Preprocessingnormalizationscatter plots
Inferential statisticst-testANOVA
Exploratory (descriptive) statisticsdistancesclusteringprincipal components analysis (PCA)
![Page 42: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/42.jpg)
Microarray data analysis
• begin with a data matrix (gene expression valuesversus samples)
genes(RNAtranscriptlevels)
![Page 43: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/43.jpg)
Microarray data analysis
• begin with a data matrix (gene expression valuesversus samples)
Typically, there aremany genes(>> 20,000) and few samples (~ 10)
Fig. 9.1Page 333
![Page 44: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/44.jpg)
Microarray data analysis
• begin with a data matrix (gene expression valuesversus samples)
Preprocessing
Inferential statistics Descriptive statistics
![Page 45: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/45.jpg)
Microarray data analysis: preprocessing
Observed differences in gene expression could be due to transcriptional changes, or they could becaused by artifacts such as:
• different labeling efficiencies of Cy3, Cy5• uneven spotting of DNA onto an array surface• variations in RNA purity or quantity• variations in washing efficiency• variations in scanning efficiency
![Page 46: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/46.jpg)
Microarray data analysis: preprocessing
The main goal of data preprocessing is to removethe systematic bias in the data as completely aspossible, while preserving the variation in geneexpression that occurs because of biologicallyrelevant changes in transcription.
A basic assumption of most normalization proceduresis that the average gene expression level does notchange in an experiment.
![Page 47: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/47.jpg)
Data analysis: global normalization
Global normalization is used to correct two or moredata sets. In one common scenario, samples arelabeled with Cy3 (green dye) or Cy5 (red dye) andhybridized to DNA elements on a microrarray. Afterwashing, probes are excited with a laser and detectedwith a scanning confocal microscope.
![Page 48: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/48.jpg)
Data analysis: global normalization
Global normalization is used to correct two or moredata sets
Example: total fluorescence in Cy3 channel = 4 million unitsCy 5 channel = 2 million units
Then the uncorrected ratio for a gene could show2,000 units versus 1,000 units. This would artifactuallyappear to show 2-fold regulation.
![Page 49: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/49.jpg)
Data analysis: global normalization
Global normalization procedure
Step 1: subtract background intensity values(use a blank region of the array)
Step 2: globally normalize so that the average ratio = 1(apply this to 1-channel or 2-channel data sets)
![Page 50: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/50.jpg)
Scatter plots
Useful to represent gene expression values fromtwo microarray experiments (e.g. control, experimental)
Each dot corresponds to a gene expression value
Most dots fall along a line
Outliers represent up-regulated or down-regulated genes
![Page 51: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/51.jpg)
Brain
Astrocyte Astrocyte
Fibroblast
Differential Gene Expressionin Different Tissue and Cell Types
![Page 52: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/52.jpg)
high
low
Expression level (sample 1)
Expr
essi
on le
vel (
sam
ple
2)
![Page 53: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/53.jpg)
Log-log transformation
![Page 54: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/54.jpg)
Scatter plots
Typically, data are plotted on log-log coordinates
Visually, this spreads out the data and offers symmetry
raw ratio log2 ratiotime behavior value valuet=0 basal 1.0 0.0t=1h no change 1.0 0.0t=2h 2-fold up 2.0 1.0t=3h 2-fold down 0.5 -1.0
![Page 55: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/55.jpg)
expression levelhighlow
up
down
Mean log intensity
Log
ratio
![Page 56: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/56.jpg)
You can make these plots in Excel…
…but for many bioinformatics applications use R.Visit http://www.r-project.org to download it.
![Page 57: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/57.jpg)
![Page 58: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/58.jpg)
There are limits to what you can measure
There are limits to what you can measure
![Page 59: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/59.jpg)
The Limits of log-ratios: The space we exploreThe Limits of log-ratios: The space we explore
![Page 60: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/60.jpg)
The Limits of log-ratios: The space we exploreThe Limits of log-ratios: The space we explore
![Page 61: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/61.jpg)
The Limits of log-ratios: The space we exploreThe Limits of log-ratios: The space we explore
![Page 62: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/62.jpg)
Good Data
![Page 63: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/63.jpg)
Bad Data from Parts Unknown
Gary ChurchillGary Churchill
Each “pin group” is colored differentlyEach “pin group” is colored differently
![Page 64: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/64.jpg)
Lowess NormalizationWhy LOWESS?Why LOWESS?
A SD = 0.346
1. Intensity-dependent structure2. Data not mean centered at log2(ratio) = 01. Intensity-dependent structure2. Data not mean centered at log2(ratio) = 0
![Page 65: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/65.jpg)
Ratio Cy3/Cy5 for the same RNA sorted from least most expressed
![Page 66: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/66.jpg)
LOWESS Results
![Page 67: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/67.jpg)
Affymetrix Chips
![Page 68: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/68.jpg)
Mismatch (MM) probes
• MM probes are used to measure background signals due to non-specific sources and scanner offset.
• Using a MM probe as an estimate of background seems wrong and often the MM signal >= the PM signal
• Some would claim that subtraction of the mismatch probe adds noise for little gain.
![Page 69: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/69.jpg)
Computing expression summaries: a three-step process• Background/Signal adjustment • Normalization (can happen at the probe-pair or
the probe-set level).• Summarization of probe-pairs into probe-set or
gene level information
![Page 70: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/70.jpg)
Background/Signal Adjustment
• A method which does some or all of the followingCorrects for background noise, processing effectsAdjusts for cross hybridizationAdjust estimated expression values to fall on proper scale
• Probe intensities are used in background adjustment to compute correction (unlike cDNA arrays where area surrounding spot might be used)
![Page 71: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/71.jpg)
Normalization Methods
• Complete data (no reference chip, information from all arrays used)Quantile normalization (Bolstadt al 2003)
• Baseline (normalized using reference chip)Scaling (Affymetrix)Non linear (Li-Wong)
![Page 72: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/72.jpg)
Summarization
• Reduce the 11-20 probe intensities on each array to a single number for gene expression
• Main ApproachesSingle chip
• AvDiff (Affymetrix) – no longer recommended for use due to many flaws
• Mas5.0 (Affymetrix) –use a 1 step Tukey biweight to combine the probe intensities in log scale
Multiple Chip•MBEI (Li-Wong dChip) –a multiplicative model•RMA –a robust multi-chip linear model fit on the log scale
![Page 73: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/73.jpg)
Robust multi-array analysis (RMA)• Developed by Rafael Irizarry (Dept. of Biostatistics), Terry Speed, and others• Available at www.bioconductor.org as an R package• Also available in various software packages (including
Partek, www.partek.com and Iobion Gene Traffic)• See Bolstad et al. (2003) Bioinformatics 19; Irizarry et al. (2003) Biostatistics 4
There are three steps:
[1] Background adjustment based on a normal plus exponential model (no mismatch data are used)
[2] Quantile normalization (nonparametric fitting of signal intensity data to normalize their distribution)
[3] Fitting a log scale additive model robustly. The model is additive: probe effect + sample effect
![Page 74: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/74.jpg)
GCRMA
• GC-RMA is a modified version of RMA that models intensity of probe level data as a function of GC-content
• expect to see higher intensity values for probes that are GC rich due to increased binding
![Page 75: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/75.jpg)
![Page 76: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/76.jpg)
A A
M M
After RMA (a normalization procedure), the median is near zero, and skewing is corrected.
Scatterplots display the effects of normalization.
![Page 77: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/77.jpg)
vsn: variance stabilizing normalization• Variance depends on signal intensity in microarray data
• A transformation can be found after which the variance is approximately constant
• Like the logarithm at the upper end of, approximately linear at the lower end
• Also incorporates the estimation of "normalization" parameters (shift and scale)
• Assumes that less than half of the genes on the arrays are differentially transcribed across the experiment.
![Page 78: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/78.jpg)
vsn: post-normalization plot
![Page 79: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/79.jpg)
array
log
sign
al in
tens
ity
array
log
sign
al in
tens
ity
Histograms of raw intensity values for 14 arrays (plotted in R) before and after RMA was applied.
![Page 80: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/80.jpg)
RMA can adjust for the effect of GC content
GC content
log
inte
nsity
![Page 81: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/81.jpg)
Robust multi-array analysis (RMA)
RMA offers a large increase in precision (relative to Affymetrix MAS 5.0 software).
precision
average log expression
log
expr
essi
on S
D
RMA
MAS 5.0
![Page 82: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/82.jpg)
Robust multi-array analysis (RMA)
RMA offers comparable accuracy to MAS 5.0.
log nominal concentration
obse
rved
log
expr
essi
onaccuracy
![Page 83: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/83.jpg)
Outline: microarray data analysis
Gene expression
Microarrays
Preprocessingnormalizationscatter plots
Inferential statisticst-testANOVA
Exploratory (descriptive) statisticsdistancesclusteringprincipal components analysis (PCA)
![Page 84: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/84.jpg)
Inferential statisticsInferential statistics are used to make inferencesabout a population from a sample.
Hypothesis testing is a common form of inferentialstatistics. A null hypothesis is stated, such as:“There is no difference in signal intensity for the geneexpression measurements in normal and diseasedsamples.” The alternative hypothesis is that thereis a difference.
We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level α to p < 0.05.
![Page 85: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/85.jpg)
[1] Obtain a matrix of genes (rows) and expression values columns. Here there are 20,000 rows of genes of which the first six are shown. There are three control samples and three disease samples. Calculate the mean value for each gene (transcript) for the controls and the disease (experimental) samples.
Analyzing expression dataQuestion: for each of my 20,000 transcripts, decide whether it is significantly regulated in some disease.
control disease
![Page 86: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/86.jpg)
[2] Calculate the ratios of control versus disease.
Also note that some ratios, such as 2.00, appear to be dramatic while others are not. Some researchers set a cut-off for changes of interest such as two-fold.
Analyzing expression data
![Page 87: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/87.jpg)
A significantdifference
A significantdifference
Probablynot
Probablynot
![Page 88: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/88.jpg)
Inferential statistics
A t-test is a commonly used test statistic to assessthe difference in mean values between two groups.
t = =
Questions
Is the sample size (n) adequate?Are the data normally distributed?Is the variance of the data known?Is the variance the same in the two groups?Is it appropriate to set the significance level to p < 0.05?
x1 – x2
SEdifference between mean values
variability (standard errorof the difference)
![Page 89: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/89.jpg)
Inferential statistics
A t-test is a commonly used test statistic to assessthe difference in mean values between two groups.
t = =
Notes
• t is a ratio (it thus has no units)• We assume the two populations are Gaussian• The two groups may be of different sizes• Obtain a P value from t using a table• For a two-sample t test, the degrees of freedom is N - 2. • For any value of t, P gets smaller as df gets larger
x1 – x2
SEdifference between mean values
variability (standard errorof the difference)
![Page 90: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/90.jpg)
[3] Perform a t-test. Hypothesis is that the transcript in the disease group is up (or down) relative to controls.
Analyzing expression data
![Page 91: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/91.jpg)
[3] Note the results: you can have…
a small p value (<0.05) with a big ratio differencea small p value (<0.05) with a trivial ratio differencea large p value (>0.05) with a big ratio differencea large p value (>0.05) with a trivial ratio difference
Analyzing expression data
![Page 92: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/92.jpg)
Inferential statisticsIs it appropriate to set the significance level to p < 0.05?If you hypothesize that a specific gene is up-regulated,you can set the probability value to 0.05.
You might measure the expression of 10,000 genes andhope that any of them are up- or down-regulated. Butyou can expect to see 5% (500 genes) regulated at thep < 0.05 level by chance alone. To account for thethousands of repeated measurements you are making,some researchers apply a Bonferroni correction.The level for statistical significance is divided by thenumber of measurements, e.g. the criterion becomes:
p < (0.05)/10,000 or p < 5 x 10-6
The Bonferroni correction is generally considered to be tooconservative.
![Page 93: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/93.jpg)
Inferential statistics: false discovery rateThe false discovery rate (FDR) is a popular multiple corrections correction. A false positive (also called a type I error) is sometimes called a false discovery.
The FDR equals the p value of the t-test times the number of genes measured (e.g. for 10,000 genes and a p value of 0.01, there are 100 expected false positives).You can adjust the false discovery rate. For example:
FDR # regulated transcripts # false discoveries0.1 100 100.05 45 30.01 20 1
Would you report 100 regulated transcripts of which 10 are likely to be false positives, or 20 transcripts of which one is likely to be a false positive?
![Page 94: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/94.jpg)
Inferential statistics: other methods used• t-test for two sample groups, SAM and t-tests with
permutation testing
• ANOVA for multiple factors
• Linear models with Bayesian moderation of varianceSmyth G. (2004) “Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments”
• Simultaneous inference: multivariate t-distributions forsimultaneous confidence intervalsHsu et al. (1996) “Multiple Comparisons: Theory and Methods”Hsu et al. (2006) “Screening for Differential Gene Expressions from Microarray Data”
![Page 95: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/95.jpg)
log fold change (treated/untreated)
p va
lue
(trea
ted
vers
us c
ontro
l)
A volcano plot displays both p values and fold change
![Page 96: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/96.jpg)
Outline: microarray data analysis
Gene expression
Microarrays
Preprocessingnormalizationscatter plots
Inferential statisticst-testANOVA
Exploratory (descriptive) statisticsdistancesclusteringprincipal components analysis (PCA)
![Page 97: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/97.jpg)
![Page 98: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/98.jpg)
![Page 99: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/99.jpg)
![Page 100: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/100.jpg)
Descriptive statisticsMicroarray data are highly dimensional: there aremany thousands of measurements made from a smallnumber of samples.
Descriptive (exploratory) statistics help you to findmeaningful patterns in the data.
A first step is to arrange the data in a matrix.Next, use a distance metric to define the relatednessof the different data points. Two commonly useddistance metrics are:
-- Euclidean distance-- Pearson coefficient of correlation
![Page 101: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/101.jpg)
What is a cluster?
A cluster is a group that has homogeneity(internal cohesion) and separation (external isolation). The relationships between objects being studied are assessed by similarity or dissimilarity measures.
![Page 102: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/102.jpg)
Data matrix(20 genes and 3 time pointsfrom Chu et al., 1998)
Software: S-PLUS package
genes
samples (time points)
![Page 103: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/103.jpg)
3D plot (using S-PLUS software)
t=0t=0.5
t=2.0
![Page 104: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/104.jpg)
Descriptive statistics: clusteringClustering algorithms offer useful visual descriptionsof microarray data.
Genes may be clustered, or samples, or both.
We will next describe hierarchical clustering.This may be agglomerative (building up the branchesof a tree, beginning with the two most closely relatedobjects) or divisive (building the tree by finding themost dissimilar objects first).
In each case, we end up with a tree having branchesand nodes.
Page 355
![Page 105: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/105.jpg)
Distance Is Defined by a Metric
Euclidean Pearson*Distance Metric:
6.0
1.4
+1.00
-0.05D
D
-3
0
3
log2(cy5/cy3)
![Page 106: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/106.jpg)
Distance is Defined by a Metric
-2
0
2
log2(cy5/cy3)
Euclidean Pearson(r*-1)Euclidean Pearson(r*-1)Distance Metric:Distance Metric:
4.24.2
1.41.4
-1.00-1.00
-0.90-0.90DD
DD
![Page 107: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/107.jpg)
Once a distance metric has been selected, the starting point for all clustering methods is a “distance matrix”Once a distance metric has been selected, the starting point for all clustering methods is a “distance matrix”
Distance Matrix
Gen
e 1
Gen
e 2
Gen
e 3
Gen
e 4
Gen
e 5
Gen
e 6
Gen
e 1
Gen
e 2
Gen
e 3
Gen
e 4
Gen
e 5
Gen
e 6
Gene1 0 1.5 1.2 0.25 0.75 1.4 Gene2 1.5 0 1.3 0.55 2.0 1.5 Gene3 1.2 1.3 0 1.3 0.75 0.3Gene4 0.25 0.55 1.3 0 0.25 0.4 Gene5 0.75 2.0 0.75 0.25 0 1.2 Gene6 1.4 1.5 0.3 0.4 1.2 0
The elements of this matrix are the pair-wise distances. Note that the matrix is symmetric about the diagonal.
Gene1 0 1.5 1.2 0.25 0.75 1.4 Gene2 1.5 0 1.3 0.55 2.0 1.5 Gene3 1.2 1.3 0 1.3 0.75 0.3Gene4 0.25 0.55 1.3 0 0.25 0.4 Gene5 0.75 2.0 0.75 0.25 0 1.2 Gene6 1.4 1.5 0.3 0.4 1.2 0
The elements of this matrix are the pair-wise distances. Note that the matrix is symmetric about the diagonal.
![Page 108: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/108.jpg)
Agglomerative clustering
abcde
a,b
43210
Adapted from Kaufman and Rousseeuw (1990)
![Page 109: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/109.jpg)
abcde
a,b
d,e
43210
Agglomerative clustering
![Page 110: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/110.jpg)
abcde
a,b
d,e
c,d,e
43210
Agglomerative clustering
![Page 111: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/111.jpg)
abcde
a,b
d,e
c,d,e
a,b,c,d,e
43210
Agglomerative clustering
…tree is constructed
![Page 112: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/112.jpg)
Divisive clustering
a,b,c,d,e
4 3 2 1 0
![Page 113: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/113.jpg)
Divisive clustering
c,d,e
a,b,c,d,e
4 3 2 1 0
![Page 114: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/114.jpg)
Divisive clustering
d,e
c,d,e
a,b,c,d,e
4 3 2 1 0
![Page 115: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/115.jpg)
Divisive clusteringa,b
d,e
c,d,e
a,b,c,d,e
4 3 2 1 0
![Page 116: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/116.jpg)
Divisive clusteringabcde
a,b
d,e
c,d,e
a,b,c,d,e
4 3 2 1 0
…tree is constructed
![Page 117: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/117.jpg)
divisive
agglomerative
abcde
a,b
d,e
c,d,e
a,b,c,d,e
4 3 2 1 0
43210
Adapted from Kaufman and Rousseeuw (1990)
![Page 118: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/118.jpg)
![Page 119: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/119.jpg)
![Page 120: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/120.jpg)
1
1
12
12
Agglomerative and divisive clustering sometimes give conflictingresults, as shown here
![Page 121: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/121.jpg)
Agglomerative Linkage Methods
Linkage methods are rules or metrics that return a value that can be used to determine which elements (clusters) should be linked.
Three linkage methods that are commonly used are:
Single LinkageAverage LinkageComplete Linkage
Linkage methods are rules or metrics that return a value that can be used to determine which elements (clusters) should be linked.
Three linkage methods that are commonly used are:
Single LinkageAverage LinkageComplete Linkage
(HCL-6)(HCL-6)
![Page 122: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/122.jpg)
Cluster-to-cluster distance is defined as the minimum distancebetween members of one cluster and members of the another cluster. Single linkage tends to create ‘elongated’ clusters with individual genes chained onto clusters.
DAB = min ( d(ui, vj) )
where u ∈ A and v ∈ Bfor all i = 1 to NA and j = 1 to NB
Cluster-to-cluster distance is defined as the minimum distancebetween members of one cluster and members of the another cluster. Single linkage tends to create ‘elongated’ clusters with individual genes chained onto clusters.
DAB = min ( d(ui, vj) )
where u ∈ A and v ∈ Bfor all i = 1 to NA and j = 1 to NB
Single Linkage
(HCL-7)(HCL-7)
DABDAB
![Page 123: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/123.jpg)
Cluster-to-cluster distance is defined as the average distancebetween all members of one cluster and all members of another cluster. Average linkage has a slight tendency to produce clusters of similar variance.
DAB = 1/(NANB) Σ Σ ( d(ui, vj) )
where u ∈ A and v ∈ Bfor all i = 1 to NA and j = 1 to NB
Cluster-to-cluster distance is defined as the average distancebetween all members of one cluster and all members of another cluster. Average linkage has a slight tendency to produce clusters of similar variance.
DAB = 1/(NANB) Σ Σ ( d(ui, vj) )
where u ∈ A and v ∈ Bfor all i = 1 to NA and j = 1 to NB
Average Linkage
(HCL-8)(HCL-8)
DABDAB
![Page 124: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/124.jpg)
Cluster-to-cluster distance is defined as the maximum distancebetween members of one cluster and members of the another cluster. Complete linkage tends to create clusters of similar size and variability.
DAB = max ( d(ui, vj) )
where u ∈ A and v ∈ Bfor all i = 1 to NA and j = 1 to NB
Cluster-to-cluster distance is defined as the maximum distancebetween members of one cluster and members of the another cluster. Complete linkage tends to create clusters of similar size and variability.
DAB = max ( d(ui, vj) )
where u ∈ A and v ∈ Bfor all i = 1 to NA and j = 1 to NB
Complete Linkage
(HCL-9)(HCL-9)
DABDAB
![Page 125: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/125.jpg)
Comparison of Linkage Methods
SingleSingle AverageAverage CompleteComplete
![Page 126: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/126.jpg)
Two-way clusteringof genes (y-axis)and cell lines(x-axis)(Alizadeh et al.,2000)
![Page 127: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/127.jpg)
A
B
x1
x2
1
1
0.5
0.5
1.5
A’
B’
a1 b1a’1 b’1
a’2
b2
a2
b’2
α
β
γ
Euclidean distance
Chord distanceAngle distance
![Page 128: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/128.jpg)
1. Specify number of clusters, e.g., 5. 1. Specify number of clusters, e.g., 5.
2. Randomly assign genes to clusters.2. Randomly assign genes to clusters.G1G1 G2G2 G3G3 G4G4 G5G5 G6G6 G7G7 G8G8 G9G9 G10G10 G11G11 G12G12 G13G13
K-Means/Medians Clustering – 1
![Page 129: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/129.jpg)
K-Means/Medians Clustering – 23. Calculate mean/median expression profile of each cluster.3. Calculate mean/median expression profile of each cluster.
4. Shuffle genes among clusters such that each gene is now in the cluster whose mean expression profile (calculated in step 3) is the closest to that gene’s expression profile.
4. Shuffle genes among clusters such that each gene is now in the cluster whose mean expression profile (calculated in step 3) is the closest to that gene’s expression profile.
G1G1 G2G2G3G3 G4G4 G5G5G6G6
G7G7
G8G8 G9G9G10G10
G11G11
G12G12
G13G13
5. Repeat steps 3 and 4 until genes cannot be shuffled around any more, OR a user-specified number of iterations has been reached.
5. Repeat steps 3 and 4 until genes cannot be shuffled around any more, OR a user-specified number of iterations has been reached.
k-means is most useful when the user has an a priori hypothesis about the number of clusters the genes should belong to.k-means is most useful when the user has an a priori hypothesis about the number of clusters the genes should belong to.
![Page 130: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/130.jpg)
Because of the random initialization of K-Means/K-Means, clustering results may vary somewhat between successive runs on the same dataset. KMS helps us validate the clustering results obtained from K-Means/K-Medians.
Run K-Means / K-Medians multiple times.
The KMS module generates clusters in which the member genes frequently group together in the same clusters (“consensus clusters”) across multiple runs of K-Means / K-Medians.
The consensus clusters consist of genes that clustered together in at least x% of the K-Means / Medians runs, where x is the threshold percentage input by the user.
Because of the random initialization of K-Means/K-Means, clustering results may vary somewhat between successive runs on the same dataset. KMS helps us validate the clustering results obtained from K-Means/K-Medians.
Run K-Means / K-Medians multiple times.
The KMS module generates clusters in which the member genes frequently group together in the same clusters (“consensus clusters”) across multiple runs of K-Means / K-Medians.
The consensus clusters consist of genes that clustered together in at least x% of the K-Means / Medians runs, where x is the threshold percentage input by the user.
K-Means / K-Medians Support (KMS)
![Page 131: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/131.jpg)
An exploratory technique used to reduce thedimensionality of the data set to 2D or 3D
For a matrix of m genes x n samples, create a newcovariance matrix of size n x n
Thus transform some large number of variables intoa smaller number of uncorrelated variables calledprincipal components (PCs).
Principal components analysis (PCA)
![Page 132: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/132.jpg)
Principal components analysis (PCA): objectives
• to reduce dimensionality
• to determine the linear combination of variables
• to choose the most useful variables (features)
• to visualize multidimensional data
• to identify groups of objects (e.g. genes/samples)
• to identify outliers
![Page 133: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/133.jpg)
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
![Page 134: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/134.jpg)
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
![Page 135: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/135.jpg)
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
![Page 136: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/136.jpg)
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
![Page 137: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/137.jpg)
![Page 138: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/138.jpg)
![Page 139: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/139.jpg)
1
12
![Page 141: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/141.jpg)
RNA-seq
• Sequencing technology is making fast progress• Idea: sequencing is so cheap that we can sequence
mRNA molecules directly
“Digital Gene Expression”
![Page 142: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/142.jpg)
RNA-seq (a) After two rounds of poly(A) selection, RNA is fragmented to an average length of 200 nt by magnesium-catalyzed hydrolysis and then converted into cDNA by random priming. The cDNA is then converted into a molecular library for Illumina/Solexa 1G sequencing, and the resulting 25-bp reads are mapped onto the genome. Normalized transcript prevalence is calculated with an algorithm from the ERANGE package.
(b) Primary data from mouse muscle RNAs that map uniquely in the genome to a 1-kb region of the Myf6 locus, including reads that span introns. The RNA-Seq graph above the gene model summarizes the quantity of reads, so that each point represents the number of reads covering each nucleotide, per million mapped reads (normalized scale of 0–5.5 reads).
(c) Detection and quantification of differential expression. Mouse poly(A)-selected RNAs from brain, liver and skeletal muscle for a 20-kb region of chromosome 10 containing Myf6 and its paralog Myf5, which are muscle specific. In muscle, Myf6 is highly expressed in mature muscle, whereas Myf5is expressed at very low levels from a small number of cells. The specificity of RNA-Seq is high: Myf6 expression is known to be highly muscle specific, and only 4 reads out of 71 million total liver and brain mapped reads were assigned to the Myf6 gene model.
![Page 144: Введение в биоинформатику, весна 2010: Лекции 3-4](https://reader034.fdocument.pub/reader034/viewer/2022042707/58f349b41a28ab26458b45d7/html5/thumbnails/144.jpg)
Acknowledgements
• This presentation uses slides/graphics from: J. Pevsner (Johns Hopkins, http://www.bioinfbook.org)J. Quackenbush (DFCI, Harvard)C. Dewey (Wisconsin, http://www.biostat.wisc.edu/bmi576)