R lecture oga

73
Handling quantitative data using statistical software R Osamu Ogasawara 2015.01.19

Transcript of R lecture oga

Page 1: R lecture oga

Handling quantitative data usingstatistical software R

Osamu Ogasawara2015.01.19

Page 2: R lecture oga

Contents1. What is R?

2. An Introductory Example

3. Types and Data Structures (in C and R)

4. Functional Programming (apply() function)

5. R Graphics

6. Bioinformatics (RNA-seq)

Page 3: R lecture oga

What is the R language?

Page 4: R lecture oga

Computer Language Popularity

The TOIBE index is the weighted mean of following form: ((hits(PL,SE1)/hits(SE1) + ... + hits(PL,SEn)/hits(SEn))/nwhere the PL is the search query of following pattern +"<language> programming”

Page 5: R lecture oga

Computer Language Popularity

C languageand its derivatives

(General purpose)Script languages

Domain specific language

Page 6: R lecture oga

Computer Language Popularity

Domain SpecificLanguages

Script language The others

Page 7: R lecture oga

Classification of Computer Languages

by abstraction levels

Assembly Languages

High Level LanguagesC, C++, Java, …

Very High Level Languages (VHLL)Scripting languages: Perl, Python, Ruby, …Domain Specific Language

R : statisticsMatlab, …

Higher level language is more closer to the natural language.

Page 8: R lecture oga

Introductory Examples

Page 9: R lecture oga

Simple Example (1) histogram

> x<-rnorm(100000000)> head(x)[1] 0.4667083 0.8907642 0.8147121 0.4839252 0.5811472 0.4941122> hist(x)

> system.time(x<-rnorm(100000000)) user system elapsed 8.771 0.249 9.020

Page 10: R lecture oga

Simple Example (2) t-test>group1 <- c(0.7,-1.6,-0.2,-1.2,-0.1,3.4,3.7,0.8,0.0,2.0)

> group2 <- c(1.9, 0.8, 1.1, 0.1,-0.1,4.4,5.5,1.6,4.6,3.4)> group1 [1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0> group2 [1] 1.9 0.8 1.1 0.1 -0.1 4.4 5.5 1.6 4.6 3.4> boxplot(group1, group2)> t.test(group1, group2, var.equal=T)

Two Sample t-test

data: group1 and group2t = -1.8608, df = 18, p-value = 0.07919alternative hypothesis: true difference in means is not equal to 095 percent confidence interval: -3.363874 0.203874sample estimates:mean of x mean of y 0.75 2.33

http://cse.naro.affrc.go.jp/takezawa/r-tips/r/65.html

Page 11: R lecture oga

Getting Help in RDisplay the contents of the R manual. (If you know the name of the function)

Search functions by keywords

Search functions by (partial) matching of function names

?rnormhelp(“rnorm”)

??”normal distribution”help.search(“normal distribution”)

find(“rnorm”)appropos(“rnorm”)

Page 12: R lecture oga

The R Graphical manual

Page 13: R lecture oga

R manual

Page 14: R lecture oga

Probability Distributions

dnorm() : Density function

pnorm() : (cumulative) probability distribution function

qnorm() : Quantile

rnorm() : Random number generation

“Quick-R” sitehttp://www.statmethods.net/advgraphs/probability.html

Page 15: R lecture oga

Plotting the density function (1/2)

> x<-seq(-4,4,length=100)> x [1] -4.00000000 -3.91919192 -3.83838384 -3.75757576 -3.67676768 -3.59595960 [7] -3.51515152 -3.43434343 -3.35353535 -3.27272727 -3.19191919 -3.11111111 [13] -3.03030303 -2.94949495 -2.86868687 -2.78787879 -2.70707071 -2.62626263… omitted> dx<-dnorm(x)

Page 16: R lecture oga

Plotting the density function (2/2)

> plot(x,dx,type="l",xlab="x",ylab="y",main="The normal distribution”)

Page 17: R lecture oga

Plotting the probability distribution function

> x<-seq(-4,4,length=100)> px<-pnorm(x)> plot(x,px,type="l",xlab="x",ylab="y",main="The normal distribution")

Page 18: R lecture oga

Quantile (1/5)plot(x,dnorm(x), type="n", ylim=c(0,1))

http://cse.niaes.affrc.go.jp/minaka/R/R-normal.htmlCopyright (c) 2004 by MINAKA Nobuhiro. All rights reserved.

Page 19: R lecture oga

Quantile (2/5)plot(x,dnorm(x), type="n", ylim=c(0,1))curve(dnorm(x), type="l", add=T)

http://cse.niaes.affrc.go.jp/minaka/R/R-normal.htmlCopyright (c) 2004 by MINAKA Nobuhiro. All rights reserved.

Page 20: R lecture oga

Quantile (3/5)plot(x,dnorm(x), type="n", ylim=c(0,1))curve(dnorm(x), type="l", add=T)curve(pnorm(x), type="l", lty=3, add=T)

http://cse.niaes.affrc.go.jp/minaka/R/R-normal.htmlCopyright (c) 2004 by MINAKA Nobuhiro. All rights reserved.

Page 21: R lecture oga

Quantile (4/5)plot(x,dnorm(x), type="n", ylim=c(0,1))curve(dnorm(x), type="l", add=T)curve(pnorm(x), type="l", lty=3, add=T)abline(h=0.05)abline(h=0.95)

http://cse.niaes.affrc.go.jp/minaka/R/R-normal.htmlCopyright (c) 2004 by MINAKA Nobuhiro. All rights reserved.

Page 22: R lecture oga

Quantile (5/5)x<-seq(-4,4,length=100)plot(x,dnorm(x), type="n", ylim=c(0,1))curve(dnorm(x), type="l", add=T)curve(pnorm(x), type="l", lty=3, add=T)abline(h=0.05)abline(h=0.95)

lower.alpha5<-qnorm(0.05)upper.alpha5<-qnorm(0.95)abline(v=lower.alpha5)abline(v=upper.alpha5)points(lower.alpha5, 0.05, cex=3.0, pch="*")points(upper.alpha5, 0.95, cex=3.0, pch="*")

http://cse.niaes.affrc.go.jp/minaka/R/R-normal.htmlCopyright (c) 2004 by MINAKA Nobuhiro. All rights reserved.

Page 23: R lecture oga

Calculation of the p-value of a numeral vector x.

http://d.hatena.ne.jp/hoxo_m/20130213/p1

norm.dist.p <- function(x) { n <- length(x) mean <- mean(x) sd <- sd(x) / sqrt(n) p <- pnorm(-abs(mean), mean=0, sd=sd) * 2     p } x <- rnorm(10, mean=0) p <- norm.dist.p(x) cat("p =", p, "\n")

Page 24: R lecture oga

Bias in small samples

alpha = 0.05ps <- sapply(1:10000, function(i) { x <- rnorm(10) p <- norm.dist.p(x) p })fp <- sum(ps < alpha) / length(ps)cat("alpha error rate =", fp, "\n")

alpha error rate = 0.0812

Page 25: R lecture oga

Types and Data Structures

Page 26: R lecture oga

Types in C (partial)Integer Types

Floating-Point Types

Page 27: R lecture oga

Memory Layout of C Programs

1. Text segment (Code segment)

2. Initialized data segment (initialized global variables and static variables)

3. Uninitialized data segment

4. Stack (automatic variables)

5. Heap (for dynamic memory allocation by malloc(), free(), …)

http://www.geeksforgeeks.org/memory-layout-of-c-program/

Page 28: R lecture oga

Stack frame and function call

int main() { int x = 0; a(); return 0;}

int a() { int x=1; b(); c(); return 0;}

http://www.tenouk.com/ModuleZ.html

Page 29: R lecture oga

Recursion in C#include<stdio.h>

Fact(int f) {     if (f == 1) return 1;    return (f * Fact(f - 1)); //called in function only once }

int main() {    int fact;    fact = Fact(5);    printf("Factorial is %d", fact);    return 0;}

http://www.programmingspark.com/2013/03/Working-of-Recursion-in-detail-using-Stack.html

Page 30: R lecture oga

Recursion in C

http://www.programmingspark.com/2013/03/Working-of-Recursion-in-detail-using-Stack.html

Page 31: R lecture oga

C pointersint b = 17;

int* a = &b;

x = *a; /* x = 17 */

Page 32: R lecture oga

Arrays and Linked Lists

Page 33: R lecture oga

Adding an element to the containers

Linked ListC Array (R vector)

Page 34: R lecture oga

Types in RLogical : TRUE, T, FALSE, F

Numerical (double): 1, 1.0, 1.4e+3

Complex: 3.5+4i

Character : “abc”> typeof(TRUE)[1] "logical"> typeof(1)[1] "double"> typeof(1.0)[1] "double”> typeof(3.5+4i)[1] "complex"> typeof("abc")[1] "character”

> is.vector(TRUE)[1] TRUE> is.vector(1)[1] TRUE> is.vector(3.5+4i)[1] TRUE> is.vector("abc")[1] TRUE

Page 35: R lecture oga

Creation of R vectors

> c(1,2,3,4,5)[1] 1 2 3 4 5

> 1:5[1] 1 2 3 4 5

> 5.1:-1.2[1] 5.1 4.1 3.1 2.1 1.1 0.1 -0.9

> seq(1,3,0.5)[1] 1.0 1.5 2.0 2.5 3.0

> rep(

> numeric(10) [1] 0 0 0 0 0 0 0 0 0 0> logical(10) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE> character(10) [1] "" "" "" "" "" "" "" "" "" ""> complex(10) [1] 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i

Page 36: R lecture oga

Operation on vectors

> 1:10*2 [1] 2 4 6 8 10 12 14 16 18 20

> 2*(3^(0:4))[1] 2 6 18 54 162

> v1<-1:10> v2<-10:1> v1+v2 [1] 11 11 11 11 11 11 11 11 11 11

Page 37: R lecture oga

> v1<-c(1,2,3)> v1[1] 1 2 3> v1[1][1] 1> v1[4][1] NA> v1[5]<-10> v1[1] 1 2 3 NA 10> v1[6]<-"a"> v1[1] "1" "2" "3" NA "10" "a"

> v2<-runif(10, 1,10)> v2 [1] 4.851027 7.618278 5.371393 3.940181 1.002870 9.511409 2.364836 5.246343 [9] 3.361870 9.435904> v2<5 [1] TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE> v2[v2<5][1] 4.851027 3.940181 1.002870 2.364836 3.361870> v2[1:3][1] 4.851027 7.618278 5.371393> v2[1:3*2][1] 7.618278 3.940181 9.511409

Page 38: R lecture oga

R Lists

Page 39: R lecture oga

Creation of R Lists> w1<-list("a", 10, TRUE)> w1[[1]][1] "a"

[[2]][1] 10

[[3]][1] TRUE

> w2 <- as.list(c(1,2,3))> w2[[1]][1] 1

[[2]][1] 2

[[3]][1] 3

Page 40: R lecture oga

Data structure of R objects

Type information pointers data (vector)

Page 41: R lecture oga

R List> w1<-list(1:3,"ab",TRUE)> w1[[1]][1] 1 2 3

[[2]][1] "ab"

[[3]][1] TRUE

TRUE

“a” “b”

1 2 3

Page 42: R lecture oga

w1[1] returns sublist w1[[1]] returns a content of

the listTRU

E

“a” “b”

1 2 3

> typeof(w1)[1] "list"> typeof(w1[1])[1] "list"> typeof(w1[[1]])[1] "integer”

> w1[1][[1]][1] 1 2 3

> w1[[1]][1] 1 2 3

> w1[[1]][1][1] 1

Page 43: R lecture oga

w2<-w1[c(1,2)] TRUE

“a” “b”

1 2 3

w1

w2

> remove(w1) > w1Error: object 'w1' not found> w2[[1]][1] 1 2

[[2]][1] 3 4

Page 44: R lecture oga

R List and “names”

> w3<-list(a=1:3, b="abc", NA)> w3$a[1] 1 2 3

$b[1] "abc"

[[3]][1] NA

> w3[[1]][1] 1 2 3> w3$a[1] 1 2 3> w3[1]$a[1] 1 2 3

Page 45: R lecture oga

Attributes of an R object

TRUE

“a” “b”

1 2 3

> w3<-list(a=1:3,b="ab",TRUE)> attributes(w3)$names[1] "a" "b" "”

> attr(w3,"names")<-NULL> w3[[1]][1] 1 2 3

[[2]][1] "ab"

[[3]][1] TRUE

$names[1] "a" "b" ""

Page 46: R lecture oga

data.frame : List of vectors> phenotype<-read.table("bodymap_phenodata.txt", header=T,

row.names=1, sep=" ", quote="")> phenotype num.tech.reps tissue.type gender age raceERS025098 2 adipose F 73 caucasianERS025092 2 adrenal M 60 caucasianERS025085 2 brain F 77 caucasianERS025088 2 breast F 29 caucasianERS025089 2 colon F 68 caucasianERS025082 2 heart M 77 caucasianERS025081 2 kidney F 60 caucasianERS025096 2 liver M 37 caucasianERS025099 2 lung M 65 caucasianERS025086 2 lymphnode F 86 caucasianERS025084 6 mixture <NA> NA caucasianERS025087 5 mixture <NA> NA caucasianERS025093 5 mixture <NA> NA caucasianERS025083 2 ovary F 47 african_americanERS025095 2 prostate M 73 caucasian… omitted

Page 47: R lecture oga

RNA-seq

http://www.bgisequence.com/jp/services/sequencing-services/rna-sequencing/rna-seq/

Page 48: R lecture oga

http://bowtie-bio.sourceforge.net/recount/

Page 49: R lecture oga

bodymap_count_table.txt

Tab delimited formatThe first line shows a list of sample identifiers. (19 human organs) The first column is a list of gene identifiers (Ensemble genes)

Page 50: R lecture oga

bodymap_phenodata.txt

Page 51: R lecture oga

Read a data table to a data frame

> phenotype<-read.table("bodymap_phenodata.txt", header=T, row.names=1, sep=" ", quote="")> phenotype num.tech.reps tissue.type gender age raceERS025098 2 adipose F 73 caucasianERS025092 2 adrenal M 60 caucasianERS025085 2 brain F 77 caucasianERS025088 2 breast F 29 caucasianERS025089 2 colon F 68 caucasianERS025082 2 heart M 77 caucasianERS025081 2 kidney F 60 caucasianERS025096 2 liver M 37 caucasianERS025099 2 lung M 65 caucasianERS025086 2 lymphnode F 86 caucasianERS025084 6 mixture <NA> NA caucasianERS025087 5 mixture <NA> NA caucasianERS025093 5 mixture <NA> NA caucasianERS025083 2 ovary F 47 african_americanERS025095 2 prostate M 73 caucasian… omitted

Page 52: R lecture oga

Inspect the type and attribute of the data frame

> typeof(phenotype)[1] "list"> attributes(phenotype)$names[1] "num.tech.reps" "tissue.type" "gender" "age" [5] "race"

$class[1] "data.frame"

$row.names [1] "ERS025098" "ERS025092" "ERS025085" "ERS025088" "ERS025089" "ERS025082" [7] "ERS025081" "ERS025096" "ERS025099" "ERS025086" "ERS025084" "ERS025087"[13] "ERS025093" "ERS025083" "ERS025095" "ERS025097" "ERS025094" "ERS025090"[19] "ERS025091"

Page 53: R lecture oga

Read the count table

> data <- read.table("bodymap_count_table.txt", header=T, row.names=1, sep="\t", quote="")

> head(data) ERS025098 ERS025092 ERS025085 ERS025088 ERS025089 ERS025082ENSG00000000003 1354 216 215 924 725 125ENSG00000000005 712 134 4 1495 119 20ENSG00000000419 450 547 516 529 808 680ENSG00000000457 188 368 196 386 156 259ENSG00000000460 66 29 1 26 11 9ENSG00000000938 104 79 7 29 0 3… omitted

Page 54: R lecture oga

Replace the column names: from the IDs to the tissue

type descriptions> colnames(data) [1] "ERS025098" "ERS025092" "ERS025085" "ERS025088" "ERS025089" "ERS025082" [7] "ERS025081" "ERS025096" "ERS025099" "ERS025086" "ERS025084" "ERS025087"[13] "ERS025093" "ERS025083" "ERS025095" "ERS025097" "ERS025094" "ERS025090"[19] "ERS025091"> colnames(data)<-phenotype$tissue.type> colnames(data) [1] "adipose" "adrenal" "brain" "breast" [5] "colon" "heart" "kidney" "liver" [9] "lung" "lymphnode" "mixture" "mixture" [13] "mixture" "ovary" "prostate" "skeletal_muscle" [17] "testes" "thyroid" "white_blood_cell"> head(data) adipose adrenal brain breast colon heart kidney liver lungENSG00000000003 1354 216 215 924 725 125 796 1954 815ENSG00000000005 712 134 4 1495 119 20 7 0 0ENSG00000000419 450 547 516 529 808 680 744 369 636ENSG00000000457 188 368 196 386 156 259 436 288 187ENSG00000000460 66 29 1 26 11 9 25 42 12ENSG00000000938 104 79 7 29 0 3 1 20 243

Page 55: R lecture oga

Looking into the data frame> head(data$adipose, 100)

[1] 1354 712 450 188 66 104 0 1323 0 858 0 0 [13] 13 6346 0 0 0 0 0 3 0 485 0 0 [25] 36 0 0 0 0 1002 1360 0 4179 12 424 0 [37] 97 0 0 0 0 0 0 0 2577 0 0 0 [49] 0 0 5 2241 0 0 115 3678 0 14104 18 1662 [61] 0 0 0 0 6 0 0 7839 0 2 1313 1997 [73] 40 5390 0 0 0 208 180 1277 1460 0 0 1002 [85] 30 177 84 441 0 2986 1598 0 13925 94 5565 0 [97] 0 0 0 0

> length(data$adipose)[1] 52580> length(data$adipose[data$adipose>0])[1] 9992

Page 56: R lecture oga

Distribution of the data> hist(data$adipose)

> hist(log10(data$adipose))

> summary(log10(data$adipose)) Min. 1st Qu. Median Mean 3rd Qu. Max. -Inf -Inf -Inf -Inf -Inf 6 > summary(log10(data$adipose[data$adipose>0])) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 1.462 2.382 2.287 3.109 6.200

Page 57: R lecture oga

attach() and detach() the column header names to its

“environment”

> attach(data) > head(adipose, 100) [1] 1354 712 450 188 66 104 0 1323 0 858 0 0 [13] 13 6346 0 0 0 0 0 3 0 485 0 0 [25] 36 0 0 0 0 1002 1360 0 4179 12 424 0 [37] 97 0 0 0 0 0 0 0 2577 0 0 0 [49] 0 0 5 2241 0 0 115 3678 0 14104 18 1662 [61] 0 0 0 0 6 0 0 7839 0 2 1313 1997 [73] 40 5390 0 0 0 208 180 1277 1460 0 0 1002 [85] 30 177 84 441 0 2986 1598 0 13925 94 5565 0 [97] 0 0 0 0 > length(adipose) [1] 52580 > detach(data) > length(adipose) Error: object 'adipose' not found > length(data$adipose) [1] 52580

Page 58: R lecture oga

Environment (1/2)Environment basics : http://adv-r.had.co.nz/Environments.html

The job of an environment is to associate, or bind, a set of names to a set of values.You can think of an environment as a bag of names:

• If an object has no names pointing to it, it gets automatically deleted by the garbage collector.

• Every object in an environment has a unique name.

• The objects in an environment are not ordered (i.e., it doesn’t make sense to ask what the first object in an environment is).

Page 59: R lecture oga

Environment (2/2)Most environments are created as a consequence of using functions.

An environment has a parent environment.

http://adv-r.had.co.nz/Environments.html

Page 60: R lecture oga

the apply() function> apply(data, 2, sum) adipose adrenal brain breast 23957600 18987359 20995462 23426900 colon heart kidney liver 23397325 26762377 22630393 29314904 lung lymphnode mixture mixture 23426381 19489508 31135063 57697453 mixture ovary prostate skeletal_muscle 52460922 22857384 25215879 28400943 testes thyroid white_blood_cell 27261469 24465463 27871222

> png(filename="bar001.png") > par(mai=c(1,2,1,1)) > barplot(s,horiz=T,las=1) > dev.off()

Page 61: R lecture oga

Customizing (Traditional) Graphics

> s=apply(data, 2, sum)> s adipose adrenal brain breast 23957600 18987359 20995462 23426900 colon heart kidney liver 23397325 26762377 22630393 29314904 lung lymphnode mixture mixture 23426381 19489508 31135063 57697453 mixture ovary prostate skeletal_muscle 52460922 22857384 25215879 28400943 testes thyroid white_blood_cell 27261469 24465463 27871222

> barplot(s)

Page 62: R lecture oga

Customizing (Traditional)

Graphics

barplot(s, horiz=TRUE)

Page 63: R lecture oga

Customizing (Traditional)

Graphics

> par(mai=c(1,2,1,1)) > barplot(s,horiz=T,las=1)

Page 64: R lecture oga

Customizing Traditional Graphics

with par() function

Paul MurrelR Graphics 2nd. ed.(2011)

Page 65: R lecture oga

Customizing Traditional Graphics

with par() function

Paul MurrelR Graphics 2nd. ed.(2011)

Page 66: R lecture oga

Paul MurrelR Graphics 2nd. ed.(2011)

Page 67: R lecture oga
Page 68: R lecture oga

How many plot types are there?

Page 69: R lecture oga

Winston ChangR Graphics Cookbook O’Reilly (2013)

ggplot2 and traditional graphics

Page 70: R lecture oga

Functional programming with the apply() function

> apply(log10(data), 2, mean) adipose adrenal brain breast -Inf -Inf -Inf -Inf colon heart kidney liver -Inf -Inf -Inf -Inf lung lymphnode mixture mixture -Inf -Inf -Inf -Inf mixture ovary prostate skeletal_muscle -Inf -Inf -Inf -Inf testes thyroid white_blood_cell -Inf -Inf -Inf > mean2<-function(x) { mean(x[x>0]) }> apply(log10(data), 2, mean2) adipose adrenal brain breast 2.335220 2.344531 2.278299 2.346041 colon heart kidney liver 2.380096 2.226729 2.415721 2.236490 lung lymphnode mixture mixture 2.484701 2.502548 2.531860 2.776740 mixture ovary prostate skeletal_muscle 2.670258 2.402131 2.503051 2.464915 testes thyroid white_blood_cell 2.486507 2.439520 2.597849 >

Page 71: R lecture oga

Quick-Rhttp://www.statmethods.net/management/userfunctions.html

Page 72: R lecture oga

Quick-Rhttp://www.statmethods.net/management/controlstructures.html

Page 73: R lecture oga