Post on 02-Jan-2016
description
龙星课程—肿瘤生物信息学上机课程
曹莎Email:scaorobin@sina.com
课程安排• 各类数据类型的介绍,简单的 R 入门 ;• 基因表达数据和蛋白表达数据的相关性;• 差异性表达的检验 , 假阳性检验 (FDR), 批次
效应 (batch effect) ;• 基因突变数据以及表达通路的富集分析• 基因表达数据的相关性以及双聚类分析• 各类数据的整合 基因表达数据和 metabolic
profiling 的数据;基因表达数据和表观遗传数据的整合
数据类型的介绍—基因表达数据• Microarray– 高通量测量几万个探针– 精度较低
• 如何获取?– GEO Dataset, array-express, TCGA
• 这些数据有何信息?
•
使用microarray数据须知• Organism• Experimental design• Sample list (Sample distribution, sample size)• Platform
• Important!!!!
数据类型的介绍—基因表达数据• RNA-seq• 如何获取?– TCGA, SRA
• 这些数据测有何信息?
Data levels and data types
• https://tcga-data.nci.nih.gov/tcga/tcgaDataType.jsp
数据类型的介绍—基因组数据• Somatic point mutation• 如何获取?– TCGA, GEO SRA
• 这些数据测的是什么?有何信息?
数据类型的介绍—表观遗传数据• DNA 甲基化数据• 如何获取?– TCGA, GEO Dataset
• 这些数据测的是什么,有何信息?
数据类型的介绍—表观遗传数据• Histone modification 数据• 如何获取?– Very limited
• 这些数据测的是什么,有何信息?
数据类型的介绍—蛋白质组学数据
• Protein array• 如何获取?– TCGA, literature search
• 这些数据测的是什么?有何信息?
数据类型的介绍—代谢组学数据• Metabolic profiling• 如何获取?– literature search
• 这些数据测的是什么?有何信息?
简单的 R 入门• 简单的数据处理• 统计检验• 统计建模(回归,矩阵分解等)• 可视化
• print(matrix(c(1,2,3,4), 2, 2))• print(list("a","b","c"))
Basis functions
• ls()• rm()• c() #creating a vector,
c() is a function• mode() #• class() #
• mean(x)• median(x)• sd(x)• var(x)• cor(x, y) #• cov(x, y)
Creating Sequences
• 1:5• 5:1• seq(from=0, to=20, by=5)• 1.1:10.1• 1.1:10.3• a<-rep(0,3)• rep(c(1,2,a),2)
Basic calculations
• +• -• *• /• %%• ^• %*% #matrix multiply
• log(x)• sin(x)• exp()
• e• Pi• Inf• NA
Data mode: Physical Typemode(3.1415) # Mode of a number[1] "numeric"> mode(c(2.7182, 3.1415)) # Mode of a vector of numbers[1] "numeric"> mode("Moe") # Mode of a character string[1] "character"
Data Class: Abstract type
• scalar• array (vector)• matrix• From array to matrix
• factor (looks like a vector, but has special properties, for Categorical variables or grouping)
• data.frame
data.frame matrix
• Same data mode in each column
• Unique Row/column names (rownames, colnames)
• One row of a data.frame is a data.frame
• as.data.frame(****)
• Same data mode in the whole matrix
• Can have repeated Row/column names
• One row of matrix is an array (vector)
• as.matrix(****)
这门课处理的数据类型• Clinical data-> data.frame• Experimental data-> data.frame or matrix– Microarray data– RNA seq data– Somatic mutation data– Protein array– DNA methylation data
Data combining
• cbind– Combine data by column
• rbind– Combine data by row
• Eg.a<-matrix(0,2,2)b<-matrix(1,2,2)cbind(a,b)rbind(a,b)
length
• a<-c(1:5)• length(a)
apply• Apply Functions Over Array Margins• apply(DATA, MARGIN, FUNCTION, ...)– MARGIN= 1 for rows; 2 for columns
• Eg.m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2)apply(m, 1, mean)apply(m, 2, mean)
Pattern 寻找• Which command• which(****),**** should be a logical operation• which(****), return the index of TRUE elements in
the logical operation• Eg
x<- floor(10*runif(10))xwhich(x<5)x[which(x<5)]
For loop
For loop: http://en.wikipedia.org/wiki/For_loop
In computer science a for loop is a programming language statement which allows code to be repeatedly executed
Question:Calculate the sum of all the values in the vector x<- floor(10*runif(10))
For loop
Real computer program!Eg.for(i in 1:100){print("Hello world!")print(i*i)}
For loop
for(*** in ***){}for(VARIABLE in TARGETSET){}for(i in 1:100){}x <-floor(10*runif(10)) total_x<-0for(i in 1:length(x)){
print(i)print(x[i])total_x<-total_x+x[i]
}
Working directory
• getwd()• setwd(“****”)• list.files()
• load(“****”)• save.image(“****”)
实例• 摘出 colon cancer 的 clinical information 中所
有二期和三期的样本
步骤• 将数据 load 进来• 找到数据中所有的期的信息• 用 for 循环将所有的一期,二期的样本摘出
来,并且合并所有的数据
R codesetwd("D:\\DragonStar\\dragon_star_data\\TCGA_colon_cancer_data")rm(list=ls())list.files()load("COAD_clinical_data.RData")data_clinical<-COAD_clinical_datadata_clinical$pathologic_stage<-as.character(data_clinical$pathologic_stage)head(data_clinical)table(data_clinical$pathologic_stage)all_stages<-unique(sort(data_clinical$pathologic_stage))all_I_II_stages<-all_stages[c(2,3,4,5,6,7)]all_I_II_stages<-c("Stage I","Stage IA","Stage II","Stage IIA","Stage IIB","Stage IIC")data_stage_I_II<-c()for(i in 1:length(all_I_II_stages)){
id<-which(data_clinical$pathologic_stage==all_I_II_stages[i])data_stage_I_II<-rbind(data_stage_I_II,data_clinical[id,])
}