R lecture oga

Handling quantitative data usingstatistical software R

Osamu Ogasawara2015.01.19

Contents1. What is R?

2. An Introductory Example

3. Types and Data Structures (in C and R)

4. Functional Programming (apply() function)

5. R Graphics

6. Bioinformatics (RNA-seq)

What is the R language?

Computer Language Popularity

The TOIBE index is the weighted mean of following form: ((hits(PL,SE1)/hits(SE1) + ... + hits(PL,SEn)/hits(SEn))/nwhere the PL is the search query of following pattern +"<language> programming”


C languageand its derivatives

(General purpose)Script languages

Domain specific language


Domain SpecificLanguages

Script language The others

Classification of Computer Languages

by abstraction levels

Assembly Languages

High Level LanguagesC, C++, Java, …

Very High Level Languages (VHLL)Scripting languages: Perl, Python, Ruby, …Domain Specific Language

R : statisticsMatlab, …

Higher level language is more closer to the natural language.

Introductory Examples

Simple Example (1) histogram

> x<-rnorm(100000000)> head(x)[1] 0.4667083 0.8907642 0.8147121 0.4839252 0.5811472 0.4941122> hist(x)

> system.time(x<-rnorm(100000000)) user system elapsed 8.771 0.249 9.020

Simple Example (2) t-test>group1 <- c(0.7,-1.6,-0.2,-1.2,-0.1,3.4,3.7,0.8,0.0,2.0)

> group2 <- c(1.9, 0.8, 1.1, 0.1,-0.1,4.4,5.5,1.6,4.6,3.4)> group1 [1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0> group2 [1] 1.9 0.8 1.1 0.1 -0.1 4.4 5.5 1.6 4.6 3.4> boxplot(group1, group2)> t.test(group1, group2, var.equal=T)

Two Sample t-test

data: group1 and group2t = -1.8608, df = 18, p-value = 0.07919alternative hypothesis: true difference in means is not equal to 095 percent confidence interval: -3.363874 0.203874sample estimates:mean of x mean of y 0.75 2.33

http://cse.naro.affrc.go.jp/takezawa/r-tips/r/65.html

Getting Help in RDisplay the contents of the R manual. (If you know the name of the function)

Search functions by keywords

Search functions by (partial) matching of function names

?rnormhelp(“rnorm”)

??”normal distribution”help.search(“normal distribution”)

find(“rnorm”)appropos(“rnorm”)

The R Graphical manual

R manual

Probability Distributions

dnorm() : Density function

pnorm() : (cumulative) probability distribution function

qnorm() : Quantile

rnorm() : Random number generation

“Quick-R” sitehttp://www.statmethods.net/advgraphs/probability.html

Plotting the density function (1/2)

> x<-seq(-4,4,length=100)> x [1] -4.00000000 -3.91919192 -3.83838384 -3.75757576 -3.67676768 -3.59595960 [7] -3.51515152 -3.43434343 -3.35353535 -3.27272727 -3.19191919 -3.11111111 [13] -3.03030303 -2.94949495 -2.86868687 -2.78787879 -2.70707071 -2.62626263… omitted> dx<-dnorm(x)

Plotting the density function (2/2)

> plot(x,dx,type="l",xlab="x",ylab="y",main="The normal distribution”)

Plotting the probability distribution function

> x<-seq(-4,4,length=100)> px<-pnorm(x)> plot(x,px,type="l",xlab="x",ylab="y",main="The normal distribution")

Quantile (1/5)plot(x,dnorm(x), type="n", ylim=c(0,1))

http://cse.niaes.affrc.go.jp/minaka/R/R-normal.htmlCopyright (c) 2004 by MINAKA Nobuhiro. All rights reserved.

Quantile (2/5)plot(x,dnorm(x), type="n", ylim=c(0,1))curve(dnorm(x), type="l", add=T)


Quantile (3/5)plot(x,dnorm(x), type="n", ylim=c(0,1))curve(dnorm(x), type="l", add=T)curve(pnorm(x), type="l", lty=3, add=T)


Quantile (4/5)plot(x,dnorm(x), type="n", ylim=c(0,1))curve(dnorm(x), type="l", add=T)curve(pnorm(x), type="l", lty=3, add=T)abline(h=0.05)abline(h=0.95)


Quantile (5/5)x<-seq(-4,4,length=100)plot(x,dnorm(x), type="n", ylim=c(0,1))curve(dnorm(x), type="l", add=T)curve(pnorm(x), type="l", lty=3, add=T)abline(h=0.05)abline(h=0.95)

lower.alpha5<-qnorm(0.05)upper.alpha5<-qnorm(0.95)abline(v=lower.alpha5)abline(v=upper.alpha5)points(lower.alpha5, 0.05, cex=3.0, pch="*")points(upper.alpha5, 0.95, cex=3.0, pch="*")


Calculation of the p-value of a numeral vector x.

http://d.hatena.ne.jp/hoxo_m/20130213/p1

norm.dist.p <- function(x) { n <- length(x) mean <- mean(x) sd <- sd(x) / sqrt(n) p <- pnorm(-abs(mean), mean=0, sd=sd) * 2 　　　 p } x <- rnorm(10, mean=0) p <- norm.dist.p(x) cat("p =", p, "\n")

Bias in small samples

alpha = 0.05ps <- sapply(1:10000, function(i) { x <- rnorm(10) p <- norm.dist.p(x) p })fp <- sum(ps < alpha) / length(ps)cat("alpha error rate =", fp, "\n")

alpha error rate = 0.0812

Types and Data Structures

Types in C (partial)Integer Types

Floating-Point Types

Memory Layout of C Programs

1. Text segment (Code segment)

2. Initialized data segment (initialized global variables and static variables)

3. Uninitialized data segment

4. Stack (automatic variables)

5. Heap (for dynamic memory allocation by malloc(), free(), …)

http://www.geeksforgeeks.org/memory-layout-of-c-program/

Stack frame and function call

int main() { int x = 0; a(); return 0;}

int a() { int x=1; b(); c(); return 0;}

http://www.tenouk.com/ModuleZ.html

Recursion in C#include<stdio.h>

Fact(int f) { if (f == 1) return 1; return (f * Fact(f - 1)); //called in function only once }

int main() { int fact; fact = Fact(5); printf("Factorial is %d", fact); return 0;}

http://www.programmingspark.com/2013/03/Working-of-Recursion-in-detail-using-Stack.html

Recursion in C

http://www.programmingspark.com/2013/03/Working-of-Recursion-in-detail-using-Stack.html

C pointersint b = 17;

int* a = &b;

x = *a; /* x = 17 */

Arrays and Linked Lists

Adding an element to the containers

Linked ListC Array (R vector)

Types in RLogical : TRUE, T, FALSE, F

Numerical (double): 1, 1.0, 1.4e+3

Complex: 3.5+4i

Character : “abc”> typeof(TRUE)[1] "logical"> typeof(1)[1] "double"> typeof(1.0)[1] "double”> typeof(3.5+4i)[1] "complex"> typeof("abc")[1] "character”

> is.vector(TRUE)[1] TRUE> is.vector(1)[1] TRUE> is.vector(3.5+4i)[1] TRUE> is.vector("abc")[1] TRUE

Creation of R vectors

> c(1,2,3,4,5)[1] 1 2 3 4 5

> 1:5[1] 1 2 3 4 5

> 5.1:-1.2[1] 5.1 4.1 3.1 2.1 1.1 0.1 -0.9

> seq(1,3,0.5)[1] 1.0 1.5 2.0 2.5 3.0

> rep(

> numeric(10) [1] 0 0 0 0 0 0 0 0 0 0> logical(10) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE> character(10) [1] "" "" "" "" "" "" "" "" "" ""> complex(10) [1] 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i

Operation on vectors

> 1:10*2 [1] 2 4 6 8 10 12 14 16 18 20

> 2*(3^(0:4))[1] 2 6 18 54 162

> v1<-1:10> v2<-10:1> v1+v2 [1] 11 11 11 11 11 11 11 11 11 11

> v1<-c(1,2,3)> v1[1] 1 2 3> v1[1][1] 1> v1[4][1] NA> v1[5]<-10> v1[1] 1 2 3 NA 10> v1[6]<-"a"> v1[1] "1" "2" "3" NA "10" "a"

> v2<-runif(10, 1,10)> v2 [1] 4.851027 7.618278 5.371393 3.940181 1.002870 9.511409 2.364836 5.246343 [9] 3.361870 9.435904> v2<5 [1] TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE> v2[v2<5][1] 4.851027 3.940181 1.002870 2.364836 3.361870> v2[1:3][1] 4.851027 7.618278 5.371393> v2[1:3*2][1] 7.618278 3.940181 9.511409

R Lists

Creation of R Lists> w1<-list("a", 10, TRUE)> w1[[1]][1] "a"

[[2]][1] 10

[[3]][1] TRUE

> w2 <- as.list(c(1,2,3))> w2[[1]][1] 1

[[2]][1] 2

[[3]][1] 3

Data structure of R objects

Type information pointers data (vector)

R List> w1<-list(1:3,"ab",TRUE)> w1[[1]][1] 1 2 3

[[2]][1] "ab"

[[3]][1] TRUE

TRUE

“a” “b”

1 2 3

w1[1] returns sublist w1[[1]] returns a content of

the listTRU

E

“a” “b”

1 2 3

> typeof(w1)[1] "list"> typeof(w1[1])[1] "list"> typeof(w1[[1]])[1] "integer”

> w1[1][[1]][1] 1 2 3

> w1[[1]][1] 1 2 3

> w1[[1]][1][1] 1

w2<-w1[c(1,2)] TRUE

“a” “b”

1 2 3

w1

w2

> remove(w1) > w1Error: object 'w1' not found> w2[[1]][1] 1 2

[[2]][1] 3 4

R List and “names”

> w3<-list(a=1:3, b="abc", NA)> w3$a[1] 1 2 3

$b[1] "abc"

[[3]][1] NA

> w3[[1]][1] 1 2 3> w3$a[1] 1 2 3> w3[1]$a[1] 1 2 3

Attributes of an R object

TRUE

“a” “b”

1 2 3

> w3<-list(a=1:3,b="ab",TRUE)> attributes(w3)$names[1] "a" "b" "”

> attr(w3,"names")<-NULL> w3[[1]][1] 1 2 3

[[2]][1] "ab"

[[3]][1] TRUE

$names[1] "a" "b" ""

data.frame : List of vectors> phenotype<-read.table("bodymap_phenodata.txt", header=T,

row.names=1, sep=" ", quote="")> phenotype num.tech.reps tissue.type gender age raceERS025098 2 adipose F 73 caucasianERS025092 2 adrenal M 60 caucasianERS025085 2 brain F 77 caucasianERS025088 2 breast F 29 caucasianERS025089 2 colon F 68 caucasianERS025082 2 heart M 77 caucasianERS025081 2 kidney F 60 caucasianERS025096 2 liver M 37 caucasianERS025099 2 lung M 65 caucasianERS025086 2 lymphnode F 86 caucasianERS025084 6 mixture <NA> NA caucasianERS025087 5 mixture <NA> NA caucasianERS025093 5 mixture <NA> NA caucasianERS025083 2 ovary F 47 african_americanERS025095 2 prostate M 73 caucasian… omitted

RNA-seq

http://www.bgisequence.com/jp/services/sequencing-services/rna-sequencing/rna-seq/

http://bowtie-bio.sourceforge.net/recount/

bodymap_count_table.txt

Tab delimited formatThe first line shows a list of sample identifiers. (19 human organs) The first column is a list of gene identifiers (Ensemble genes)

bodymap_phenodata.txt

Read a data table to a data frame

> phenotype<-read.table("bodymap_phenodata.txt", header=T, row.names=1, sep=" ", quote="")> phenotype num.tech.reps tissue.type gender age raceERS025098 2 adipose F 73 caucasianERS025092 2 adrenal M 60 caucasianERS025085 2 brain F 77 caucasianERS025088 2 breast F 29 caucasianERS025089 2 colon F 68 caucasianERS025082 2 heart M 77 caucasianERS025081 2 kidney F 60 caucasianERS025096 2 liver M 37 caucasianERS025099 2 lung M 65 caucasianERS025086 2 lymphnode F 86 caucasianERS025084 6 mixture <NA> NA caucasianERS025087 5 mixture <NA> NA caucasianERS025093 5 mixture <NA> NA caucasianERS025083 2 ovary F 47 african_americanERS025095 2 prostate M 73 caucasian… omitted

Inspect the type and attribute of the data frame

> typeof(phenotype)[1] "list"> attributes(phenotype)$names[1] "num.tech.reps" "tissue.type" "gender" "age" [5] "race"

$class[1] "data.frame"

$row.names [1] "ERS025098" "ERS025092" "ERS025085" "ERS025088" "ERS025089" "ERS025082" [7] "ERS025081" "ERS025096" "ERS025099" "ERS025086" "ERS025084" "ERS025087"[13] "ERS025093" "ERS025083" "ERS025095" "ERS025097" "ERS025094" "ERS025090"[19] "ERS025091"

Read the count table

> data <- read.table("bodymap_count_table.txt", header=T, row.names=1, sep="\t", quote="")

> head(data) ERS025098 ERS025092 ERS025085 ERS025088 ERS025089 ERS025082ENSG00000000003 1354 216 215 924 725 125ENSG00000000005 712 134 4 1495 119 20ENSG00000000419 450 547 516 529 808 680ENSG00000000457 188 368 196 386 156 259ENSG00000000460 66 29 1 26 11 9ENSG00000000938 104 79 7 29 0 3… omitted

Replace the column names: from the IDs to the tissue

type descriptions> colnames(data) [1] "ERS025098" "ERS025092" "ERS025085" "ERS025088" "ERS025089" "ERS025082" [7] "ERS025081" "ERS025096" "ERS025099" "ERS025086" "ERS025084" "ERS025087"[13] "ERS025093" "ERS025083" "ERS025095" "ERS025097" "ERS025094" "ERS025090"[19] "ERS025091"> colnames(data)<-phenotype$tissue.type> colnames(data) [1] "adipose" "adrenal" "brain" "breast" [5] "colon" "heart" "kidney" "liver" [9] "lung" "lymphnode" "mixture" "mixture" [13] "mixture" "ovary" "prostate" "skeletal_muscle" [17] "testes" "thyroid" "white_blood_cell"> head(data) adipose adrenal brain breast colon heart kidney liver lungENSG00000000003 1354 216 215 924 725 125 796 1954 815ENSG00000000005 712 134 4 1495 119 20 7 0 0ENSG00000000419 450 547 516 529 808 680 744 369 636ENSG00000000457 188 368 196 386 156 259 436 288 187ENSG00000000460 66 29 1 26 11 9 25 42 12ENSG00000000938 104 79 7 29 0 3 1 20 243

Looking into the data frame> head(data$adipose, 100)

[1] 1354 712 450 188 66 104 0 1323 0 858 0 0 [13] 13 6346 0 0 0 0 0 3 0 485 0 0 [25] 36 0 0 0 0 1002 1360 0 4179 12 424 0 [37] 97 0 0 0 0 0 0 0 2577 0 0 0 [49] 0 0 5 2241 0 0 115 3678 0 14104 18 1662 [61] 0 0 0 0 6 0 0 7839 0 2 1313 1997 [73] 40 5390 0 0 0 208 180 1277 1460 0 0 1002 [85] 30 177 84 441 0 2986 1598 0 13925 94 5565 0 [97] 0 0 0 0

> length(data$adipose)[1] 52580> length(data$adipose[data$adipose>0])[1] 9992

Distribution of the data> hist(data$adipose)

> hist(log10(data$adipose))

> summary(log10(data$adipose)) Min. 1st Qu. Median Mean 3rd Qu. Max. -Inf -Inf -Inf -Inf -Inf 6 > summary(log10(data$adipose[data$adipose>0])) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 1.462 2.382 2.287 3.109 6.200

attach() and detach() the column header names to its

“environment”

> attach(data) > head(adipose, 100) [1] 1354 712 450 188 66 104 0 1323 0 858 0 0 [13] 13 6346 0 0 0 0 0 3 0 485 0 0 [25] 36 0 0 0 0 1002 1360 0 4179 12 424 0 [37] 97 0 0 0 0 0 0 0 2577 0 0 0 [49] 0 0 5 2241 0 0 115 3678 0 14104 18 1662 [61] 0 0 0 0 6 0 0 7839 0 2 1313 1997 [73] 40 5390 0 0 0 208 180 1277 1460 0 0 1002 [85] 30 177 84 441 0 2986 1598 0 13925 94 5565 0 [97] 0 0 0 0 > length(adipose) [1] 52580 > detach(data) > length(adipose) Error: object 'adipose' not found > length(data$adipose) [1] 52580

Environment (1/2)Environment basics : http://adv-r.had.co.nz/Environments.html

The job of an environment is to associate, or bind, a set of names to a set of values.You can think of an environment as a bag of names:

• If an object has no names pointing to it, it gets automatically deleted by the garbage collector.

• Every object in an environment has a unique name.

• The objects in an environment are not ordered (i.e., it doesn’t make sense to ask what the first object in an environment is).

Environment (2/2)Most environments are created as a consequence of using functions.

An environment has a parent environment.

http://adv-r.had.co.nz/Environments.html

the apply() function> apply(data, 2, sum) adipose adrenal brain breast 23957600 18987359 20995462 23426900 colon heart kidney liver 23397325 26762377 22630393 29314904 lung lymphnode mixture mixture 23426381 19489508 31135063 57697453 mixture ovary prostate skeletal_muscle 52460922 22857384 25215879 28400943 testes thyroid white_blood_cell 27261469 24465463 27871222

> png(filename="bar001.png") > par(mai=c(1,2,1,1)) > barplot(s,horiz=T,las=1) > dev.off()

Customizing (Traditional) Graphics

> s=apply(data, 2, sum)> s adipose adrenal brain breast 23957600 18987359 20995462 23426900 colon heart kidney liver 23397325 26762377 22630393 29314904 lung lymphnode mixture mixture 23426381 19489508 31135063 57697453 mixture ovary prostate skeletal_muscle 52460922 22857384 25215879 28400943 testes thyroid white_blood_cell 27261469 24465463 27871222

> barplot(s)

Customizing (Traditional)

Graphics

barplot(s, horiz=TRUE)

Customizing (Traditional)

Graphics

> par(mai=c(1,2,1,1)) > barplot(s,horiz=T,las=1)

Customizing Traditional Graphics

with par() function

Paul MurrelR Graphics 2nd. ed.(2011)

Paul MurrelR Graphics 2nd. ed.(2011)

How many plot types are there?

Winston ChangR Graphics Cookbook O’Reilly (2013)

ggplot2 and traditional graphics

Functional programming with the apply() function

> apply(log10(data), 2, mean) adipose adrenal brain breast -Inf -Inf -Inf -Inf colon heart kidney liver -Inf -Inf -Inf -Inf lung lymphnode mixture mixture -Inf -Inf -Inf -Inf mixture ovary prostate skeletal_muscle -Inf -Inf -Inf -Inf testes thyroid white_blood_cell -Inf -Inf -Inf > mean2<-function(x) { mean(x[x>0]) }> apply(log10(data), 2, mean2) adipose adrenal brain breast 2.335220 2.344531 2.278299 2.346041 colon heart kidney liver 2.380096 2.226729 2.415721 2.236490 lung lymphnode mixture mixture 2.484701 2.502548 2.531860 2.776740 mixture ovary prostate skeletal_muscle 2.670258 2.402131 2.503051 2.464915 testes thyroid white_blood_cell 2.486507 2.439520 2.597849 >

Quick-Rhttp://www.statmethods.net/management/userfunctions.html

Quick-Rhttp://www.statmethods.net/management/controlstructures.html

R lecture oga

Data & Analytics

Transcript of R lecture oga