R lecture oga
-
Upload
osamu-ogasawara -
Category
Data & Analytics
-
view
375 -
download
0
Transcript of R lecture oga
Handling quantitative data usingstatistical software R
Osamu Ogasawara2015.01.19
Contents1. What is R?
2. An Introductory Example
3. Types and Data Structures (in C and R)
4. Functional Programming (apply() function)
5. R Graphics
6. Bioinformatics (RNA-seq)
What is the R language?
Computer Language Popularity
The TOIBE index is the weighted mean of following form: ((hits(PL,SE1)/hits(SE1) + ... + hits(PL,SEn)/hits(SEn))/nwhere the PL is the search query of following pattern +"<language> programming”
Computer Language Popularity
C languageand its derivatives
(General purpose)Script languages
Domain specific language
Computer Language Popularity
Domain SpecificLanguages
Script language The others
Classification of Computer Languages
by abstraction levels
Assembly Languages
High Level LanguagesC, C++, Java, …
Very High Level Languages (VHLL)Scripting languages: Perl, Python, Ruby, …Domain Specific Language
R : statisticsMatlab, …
Higher level language is more closer to the natural language.
Introductory Examples
Simple Example (1) histogram
> x<-rnorm(100000000)> head(x)[1] 0.4667083 0.8907642 0.8147121 0.4839252 0.5811472 0.4941122> hist(x)
> system.time(x<-rnorm(100000000)) user system elapsed 8.771 0.249 9.020
Simple Example (2) t-test>group1 <- c(0.7,-1.6,-0.2,-1.2,-0.1,3.4,3.7,0.8,0.0,2.0)
> group2 <- c(1.9, 0.8, 1.1, 0.1,-0.1,4.4,5.5,1.6,4.6,3.4)> group1 [1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0> group2 [1] 1.9 0.8 1.1 0.1 -0.1 4.4 5.5 1.6 4.6 3.4> boxplot(group1, group2)> t.test(group1, group2, var.equal=T)
Two Sample t-test
data: group1 and group2t = -1.8608, df = 18, p-value = 0.07919alternative hypothesis: true difference in means is not equal to 095 percent confidence interval: -3.363874 0.203874sample estimates:mean of x mean of y 0.75 2.33
http://cse.naro.affrc.go.jp/takezawa/r-tips/r/65.html
Getting Help in RDisplay the contents of the R manual. (If you know the name of the function)
Search functions by keywords
Search functions by (partial) matching of function names
?rnormhelp(“rnorm”)
??”normal distribution”help.search(“normal distribution”)
find(“rnorm”)appropos(“rnorm”)
The R Graphical manual
R manual
Probability Distributions
dnorm() : Density function
pnorm() : (cumulative) probability distribution function
qnorm() : Quantile
rnorm() : Random number generation
“Quick-R” sitehttp://www.statmethods.net/advgraphs/probability.html
Plotting the density function (1/2)
> x<-seq(-4,4,length=100)> x [1] -4.00000000 -3.91919192 -3.83838384 -3.75757576 -3.67676768 -3.59595960 [7] -3.51515152 -3.43434343 -3.35353535 -3.27272727 -3.19191919 -3.11111111 [13] -3.03030303 -2.94949495 -2.86868687 -2.78787879 -2.70707071 -2.62626263… omitted> dx<-dnorm(x)
Plotting the density function (2/2)
> plot(x,dx,type="l",xlab="x",ylab="y",main="The normal distribution”)
Plotting the probability distribution function
> x<-seq(-4,4,length=100)> px<-pnorm(x)> plot(x,px,type="l",xlab="x",ylab="y",main="The normal distribution")
Quantile (1/5)plot(x,dnorm(x), type="n", ylim=c(0,1))
http://cse.niaes.affrc.go.jp/minaka/R/R-normal.htmlCopyright (c) 2004 by MINAKA Nobuhiro. All rights reserved.
Quantile (2/5)plot(x,dnorm(x), type="n", ylim=c(0,1))curve(dnorm(x), type="l", add=T)
http://cse.niaes.affrc.go.jp/minaka/R/R-normal.htmlCopyright (c) 2004 by MINAKA Nobuhiro. All rights reserved.
Quantile (3/5)plot(x,dnorm(x), type="n", ylim=c(0,1))curve(dnorm(x), type="l", add=T)curve(pnorm(x), type="l", lty=3, add=T)
http://cse.niaes.affrc.go.jp/minaka/R/R-normal.htmlCopyright (c) 2004 by MINAKA Nobuhiro. All rights reserved.
Quantile (4/5)plot(x,dnorm(x), type="n", ylim=c(0,1))curve(dnorm(x), type="l", add=T)curve(pnorm(x), type="l", lty=3, add=T)abline(h=0.05)abline(h=0.95)
http://cse.niaes.affrc.go.jp/minaka/R/R-normal.htmlCopyright (c) 2004 by MINAKA Nobuhiro. All rights reserved.
Quantile (5/5)x<-seq(-4,4,length=100)plot(x,dnorm(x), type="n", ylim=c(0,1))curve(dnorm(x), type="l", add=T)curve(pnorm(x), type="l", lty=3, add=T)abline(h=0.05)abline(h=0.95)
lower.alpha5<-qnorm(0.05)upper.alpha5<-qnorm(0.95)abline(v=lower.alpha5)abline(v=upper.alpha5)points(lower.alpha5, 0.05, cex=3.0, pch="*")points(upper.alpha5, 0.95, cex=3.0, pch="*")
http://cse.niaes.affrc.go.jp/minaka/R/R-normal.htmlCopyright (c) 2004 by MINAKA Nobuhiro. All rights reserved.
Calculation of the p-value of a numeral vector x.
http://d.hatena.ne.jp/hoxo_m/20130213/p1
norm.dist.p <- function(x) { n <- length(x) mean <- mean(x) sd <- sd(x) / sqrt(n) p <- pnorm(-abs(mean), mean=0, sd=sd) * 2 p } x <- rnorm(10, mean=0) p <- norm.dist.p(x) cat("p =", p, "\n")
Bias in small samples
alpha = 0.05ps <- sapply(1:10000, function(i) { x <- rnorm(10) p <- norm.dist.p(x) p })fp <- sum(ps < alpha) / length(ps)cat("alpha error rate =", fp, "\n")
alpha error rate = 0.0812
Types and Data Structures
Types in C (partial)Integer Types
Floating-Point Types
Memory Layout of C Programs
1. Text segment (Code segment)
2. Initialized data segment (initialized global variables and static variables)
3. Uninitialized data segment
4. Stack (automatic variables)
5. Heap (for dynamic memory allocation by malloc(), free(), …)
http://www.geeksforgeeks.org/memory-layout-of-c-program/
Stack frame and function call
int main() { int x = 0; a(); return 0;}
int a() { int x=1; b(); c(); return 0;}
http://www.tenouk.com/ModuleZ.html
Recursion in C#include<stdio.h>
Fact(int f) { if (f == 1) return 1; return (f * Fact(f - 1)); //called in function only once }
int main() { int fact; fact = Fact(5); printf("Factorial is %d", fact); return 0;}
http://www.programmingspark.com/2013/03/Working-of-Recursion-in-detail-using-Stack.html
Recursion in C
http://www.programmingspark.com/2013/03/Working-of-Recursion-in-detail-using-Stack.html
C pointersint b = 17;
int* a = &b;
x = *a; /* x = 17 */
Arrays and Linked Lists
Adding an element to the containers
Linked ListC Array (R vector)
Types in RLogical : TRUE, T, FALSE, F
Numerical (double): 1, 1.0, 1.4e+3
Complex: 3.5+4i
Character : “abc”> typeof(TRUE)[1] "logical"> typeof(1)[1] "double"> typeof(1.0)[1] "double”> typeof(3.5+4i)[1] "complex"> typeof("abc")[1] "character”
> is.vector(TRUE)[1] TRUE> is.vector(1)[1] TRUE> is.vector(3.5+4i)[1] TRUE> is.vector("abc")[1] TRUE
Creation of R vectors
> c(1,2,3,4,5)[1] 1 2 3 4 5
> 1:5[1] 1 2 3 4 5
> 5.1:-1.2[1] 5.1 4.1 3.1 2.1 1.1 0.1 -0.9
> seq(1,3,0.5)[1] 1.0 1.5 2.0 2.5 3.0
> rep(
> numeric(10) [1] 0 0 0 0 0 0 0 0 0 0> logical(10) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE> character(10) [1] "" "" "" "" "" "" "" "" "" ""> complex(10) [1] 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i
Operation on vectors
> 1:10*2 [1] 2 4 6 8 10 12 14 16 18 20
> 2*(3^(0:4))[1] 2 6 18 54 162
> v1<-1:10> v2<-10:1> v1+v2 [1] 11 11 11 11 11 11 11 11 11 11
> v1<-c(1,2,3)> v1[1] 1 2 3> v1[1][1] 1> v1[4][1] NA> v1[5]<-10> v1[1] 1 2 3 NA 10> v1[6]<-"a"> v1[1] "1" "2" "3" NA "10" "a"
> v2<-runif(10, 1,10)> v2 [1] 4.851027 7.618278 5.371393 3.940181 1.002870 9.511409 2.364836 5.246343 [9] 3.361870 9.435904> v2<5 [1] TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE> v2[v2<5][1] 4.851027 3.940181 1.002870 2.364836 3.361870> v2[1:3][1] 4.851027 7.618278 5.371393> v2[1:3*2][1] 7.618278 3.940181 9.511409
R Lists
Creation of R Lists> w1<-list("a", 10, TRUE)> w1[[1]][1] "a"
[[2]][1] 10
[[3]][1] TRUE
> w2 <- as.list(c(1,2,3))> w2[[1]][1] 1
[[2]][1] 2
[[3]][1] 3
Data structure of R objects
Type information pointers data (vector)
R List> w1<-list(1:3,"ab",TRUE)> w1[[1]][1] 1 2 3
[[2]][1] "ab"
[[3]][1] TRUE
TRUE
“a” “b”
1 2 3
w1[1] returns sublist w1[[1]] returns a content of
the listTRU
E
“a” “b”
1 2 3
> typeof(w1)[1] "list"> typeof(w1[1])[1] "list"> typeof(w1[[1]])[1] "integer”
> w1[1][[1]][1] 1 2 3
> w1[[1]][1] 1 2 3
> w1[[1]][1][1] 1
w2<-w1[c(1,2)] TRUE
“a” “b”
1 2 3
w1
w2
> remove(w1) > w1Error: object 'w1' not found> w2[[1]][1] 1 2
[[2]][1] 3 4
R List and “names”
> w3<-list(a=1:3, b="abc", NA)> w3$a[1] 1 2 3
$b[1] "abc"
[[3]][1] NA
> w3[[1]][1] 1 2 3> w3$a[1] 1 2 3> w3[1]$a[1] 1 2 3
Attributes of an R object
TRUE
“a” “b”
1 2 3
> w3<-list(a=1:3,b="ab",TRUE)> attributes(w3)$names[1] "a" "b" "”
> attr(w3,"names")<-NULL> w3[[1]][1] 1 2 3
[[2]][1] "ab"
[[3]][1] TRUE
$names[1] "a" "b" ""
data.frame : List of vectors> phenotype<-read.table("bodymap_phenodata.txt", header=T,
row.names=1, sep=" ", quote="")> phenotype num.tech.reps tissue.type gender age raceERS025098 2 adipose F 73 caucasianERS025092 2 adrenal M 60 caucasianERS025085 2 brain F 77 caucasianERS025088 2 breast F 29 caucasianERS025089 2 colon F 68 caucasianERS025082 2 heart M 77 caucasianERS025081 2 kidney F 60 caucasianERS025096 2 liver M 37 caucasianERS025099 2 lung M 65 caucasianERS025086 2 lymphnode F 86 caucasianERS025084 6 mixture <NA> NA caucasianERS025087 5 mixture <NA> NA caucasianERS025093 5 mixture <NA> NA caucasianERS025083 2 ovary F 47 african_americanERS025095 2 prostate M 73 caucasian… omitted
RNA-seq
http://www.bgisequence.com/jp/services/sequencing-services/rna-sequencing/rna-seq/
http://bowtie-bio.sourceforge.net/recount/
bodymap_count_table.txt
Tab delimited formatThe first line shows a list of sample identifiers. (19 human organs) The first column is a list of gene identifiers (Ensemble genes)
bodymap_phenodata.txt
Read a data table to a data frame
> phenotype<-read.table("bodymap_phenodata.txt", header=T, row.names=1, sep=" ", quote="")> phenotype num.tech.reps tissue.type gender age raceERS025098 2 adipose F 73 caucasianERS025092 2 adrenal M 60 caucasianERS025085 2 brain F 77 caucasianERS025088 2 breast F 29 caucasianERS025089 2 colon F 68 caucasianERS025082 2 heart M 77 caucasianERS025081 2 kidney F 60 caucasianERS025096 2 liver M 37 caucasianERS025099 2 lung M 65 caucasianERS025086 2 lymphnode F 86 caucasianERS025084 6 mixture <NA> NA caucasianERS025087 5 mixture <NA> NA caucasianERS025093 5 mixture <NA> NA caucasianERS025083 2 ovary F 47 african_americanERS025095 2 prostate M 73 caucasian… omitted
Inspect the type and attribute of the data frame
> typeof(phenotype)[1] "list"> attributes(phenotype)$names[1] "num.tech.reps" "tissue.type" "gender" "age" [5] "race"
$class[1] "data.frame"
$row.names [1] "ERS025098" "ERS025092" "ERS025085" "ERS025088" "ERS025089" "ERS025082" [7] "ERS025081" "ERS025096" "ERS025099" "ERS025086" "ERS025084" "ERS025087"[13] "ERS025093" "ERS025083" "ERS025095" "ERS025097" "ERS025094" "ERS025090"[19] "ERS025091"
Read the count table
> data <- read.table("bodymap_count_table.txt", header=T, row.names=1, sep="\t", quote="")
> head(data) ERS025098 ERS025092 ERS025085 ERS025088 ERS025089 ERS025082ENSG00000000003 1354 216 215 924 725 125ENSG00000000005 712 134 4 1495 119 20ENSG00000000419 450 547 516 529 808 680ENSG00000000457 188 368 196 386 156 259ENSG00000000460 66 29 1 26 11 9ENSG00000000938 104 79 7 29 0 3… omitted
Replace the column names: from the IDs to the tissue
type descriptions> colnames(data) [1] "ERS025098" "ERS025092" "ERS025085" "ERS025088" "ERS025089" "ERS025082" [7] "ERS025081" "ERS025096" "ERS025099" "ERS025086" "ERS025084" "ERS025087"[13] "ERS025093" "ERS025083" "ERS025095" "ERS025097" "ERS025094" "ERS025090"[19] "ERS025091"> colnames(data)<-phenotype$tissue.type> colnames(data) [1] "adipose" "adrenal" "brain" "breast" [5] "colon" "heart" "kidney" "liver" [9] "lung" "lymphnode" "mixture" "mixture" [13] "mixture" "ovary" "prostate" "skeletal_muscle" [17] "testes" "thyroid" "white_blood_cell"> head(data) adipose adrenal brain breast colon heart kidney liver lungENSG00000000003 1354 216 215 924 725 125 796 1954 815ENSG00000000005 712 134 4 1495 119 20 7 0 0ENSG00000000419 450 547 516 529 808 680 744 369 636ENSG00000000457 188 368 196 386 156 259 436 288 187ENSG00000000460 66 29 1 26 11 9 25 42 12ENSG00000000938 104 79 7 29 0 3 1 20 243
Looking into the data frame> head(data$adipose, 100)
[1] 1354 712 450 188 66 104 0 1323 0 858 0 0 [13] 13 6346 0 0 0 0 0 3 0 485 0 0 [25] 36 0 0 0 0 1002 1360 0 4179 12 424 0 [37] 97 0 0 0 0 0 0 0 2577 0 0 0 [49] 0 0 5 2241 0 0 115 3678 0 14104 18 1662 [61] 0 0 0 0 6 0 0 7839 0 2 1313 1997 [73] 40 5390 0 0 0 208 180 1277 1460 0 0 1002 [85] 30 177 84 441 0 2986 1598 0 13925 94 5565 0 [97] 0 0 0 0
> length(data$adipose)[1] 52580> length(data$adipose[data$adipose>0])[1] 9992
Distribution of the data> hist(data$adipose)
> hist(log10(data$adipose))
> summary(log10(data$adipose)) Min. 1st Qu. Median Mean 3rd Qu. Max. -Inf -Inf -Inf -Inf -Inf 6 > summary(log10(data$adipose[data$adipose>0])) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 1.462 2.382 2.287 3.109 6.200
attach() and detach() the column header names to its
“environment”
> attach(data) > head(adipose, 100) [1] 1354 712 450 188 66 104 0 1323 0 858 0 0 [13] 13 6346 0 0 0 0 0 3 0 485 0 0 [25] 36 0 0 0 0 1002 1360 0 4179 12 424 0 [37] 97 0 0 0 0 0 0 0 2577 0 0 0 [49] 0 0 5 2241 0 0 115 3678 0 14104 18 1662 [61] 0 0 0 0 6 0 0 7839 0 2 1313 1997 [73] 40 5390 0 0 0 208 180 1277 1460 0 0 1002 [85] 30 177 84 441 0 2986 1598 0 13925 94 5565 0 [97] 0 0 0 0 > length(adipose) [1] 52580 > detach(data) > length(adipose) Error: object 'adipose' not found > length(data$adipose) [1] 52580
Environment (1/2)Environment basics : http://adv-r.had.co.nz/Environments.html
The job of an environment is to associate, or bind, a set of names to a set of values.You can think of an environment as a bag of names:
• If an object has no names pointing to it, it gets automatically deleted by the garbage collector.
• Every object in an environment has a unique name.
• The objects in an environment are not ordered (i.e., it doesn’t make sense to ask what the first object in an environment is).
Environment (2/2)Most environments are created as a consequence of using functions.
An environment has a parent environment.
http://adv-r.had.co.nz/Environments.html
the apply() function> apply(data, 2, sum) adipose adrenal brain breast 23957600 18987359 20995462 23426900 colon heart kidney liver 23397325 26762377 22630393 29314904 lung lymphnode mixture mixture 23426381 19489508 31135063 57697453 mixture ovary prostate skeletal_muscle 52460922 22857384 25215879 28400943 testes thyroid white_blood_cell 27261469 24465463 27871222
> png(filename="bar001.png") > par(mai=c(1,2,1,1)) > barplot(s,horiz=T,las=1) > dev.off()
Customizing (Traditional) Graphics
> s=apply(data, 2, sum)> s adipose adrenal brain breast 23957600 18987359 20995462 23426900 colon heart kidney liver 23397325 26762377 22630393 29314904 lung lymphnode mixture mixture 23426381 19489508 31135063 57697453 mixture ovary prostate skeletal_muscle 52460922 22857384 25215879 28400943 testes thyroid white_blood_cell 27261469 24465463 27871222
> barplot(s)
Customizing (Traditional)
Graphics
barplot(s, horiz=TRUE)
Customizing (Traditional)
Graphics
> par(mai=c(1,2,1,1)) > barplot(s,horiz=T,las=1)
Customizing Traditional Graphics
with par() function
Paul MurrelR Graphics 2nd. ed.(2011)
Customizing Traditional Graphics
with par() function
Paul MurrelR Graphics 2nd. ed.(2011)
Paul MurrelR Graphics 2nd. ed.(2011)
How many plot types are there?
Winston ChangR Graphics Cookbook O’Reilly (2013)
ggplot2 and traditional graphics
Functional programming with the apply() function
> apply(log10(data), 2, mean) adipose adrenal brain breast -Inf -Inf -Inf -Inf colon heart kidney liver -Inf -Inf -Inf -Inf lung lymphnode mixture mixture -Inf -Inf -Inf -Inf mixture ovary prostate skeletal_muscle -Inf -Inf -Inf -Inf testes thyroid white_blood_cell -Inf -Inf -Inf > mean2<-function(x) { mean(x[x>0]) }> apply(log10(data), 2, mean2) adipose adrenal brain breast 2.335220 2.344531 2.278299 2.346041 colon heart kidney liver 2.380096 2.226729 2.415721 2.236490 lung lymphnode mixture mixture 2.484701 2.502548 2.531860 2.776740 mixture ovary prostate skeletal_muscle 2.670258 2.402131 2.503051 2.464915 testes thyroid white_blood_cell 2.486507 2.439520 2.597849 >
Quick-Rhttp://www.statmethods.net/management/userfunctions.html
Quick-Rhttp://www.statmethods.net/management/controlstructures.html