Packages for data wrangling データ前処理のためのパッケージ

Packages for Data Wrangling データ前処理のためのパッケージ Hiroki

Packages for Data Wrangling


• readr

• rio

• readxl


• quantmod


• dplyr

• data.table

• tidyr

• sqldf

• zoo


Data Wrangling用パッケージ群

パッケージ 用途 コメント 解説 作者

plyr data wranglingWhile dplyr is my go-to package for wrangling data frames, the older plyr package still comes in handy when working with other types of R data such as lists. CRAN.

llply(mylist, myfunction) Hadley Wickham

reshape2 data wrangling

Change data row and column formats from "wide" to "long"; turn variables into column names or column names into variables and more. The tidyrpackage is a newer, more focused option, but I still use reshape2. CRAN.

See my tutorial Hadley Wickham

stringr data wrangling

Numerous functions for text manipulation. Some are similar to existing base R functions but in a more standard format, including working with regular expressions. Some of my favorites: str_pad and str_trim. CRAN.

str_pad(myzipcodevector, 5, "left", "0") Hadley Wickham

lubridate data wranglingEverything you ever wanted to do with date arithmetic, although understanding & using available functionality can be somewhat complex. CRAN.

mdy("05/06/2015") + months(1)More examples in the package vignette

Garrett Grolemund, Hadley Wickham & others

sqldfdata wrangling, data analysis

Do you know a great SQL query you'd use if your R data frame were in a SQL database? Run SQL queries on your data frame with sqldf. CRAN.

sqldf("select * from mydf where mycol >

4")G. Grothendieck

dplyrdata wrangling,

data analysis

The essential data-munging R package when working with data frames. Especially useful for operating on data by categories. CRAN.

See the intro vignette Hadley Wickham

data.tabledata wrangling, data analysis

Popular package for heavy-duty data wrangling. While I typically prefer dplyr, data.table has many fans for its speed with large data sets. CRAN.

Useful tutorial Matt Dowle & others

zoodata wrangling, data analysis

Robust package with a slew of functions for dealing with time series data; I like the handy rollmean function for calculating moving averages. CRAN.

rollmean(mydf, 7) Achim Zeileis & others

Data Wrangling

Data munging or data wrangling is loosely the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools. This may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. (Wikipedia)





前処理 解析・他




前処理 解析・他


社内外調整・データ入手・環境整備等 分析

データ分析の工数のうち7割8割は前処理 と言われますが


前処理 解析・他


社内外調整・データ入手・環境整備等 分析

30% * 30% < 10%



前処理 解析・他


社内外調整・データ入手・環境整備等 分析

30% * 30% < 10%



apply family {base}


(例1) iris {base}の各項目の平均

> apply(iris[,-5], 2, mean, na.rm=T)

Sepal.Length Sepal.Width Petal.Length Petal.Width

5.843333 3.057333 3.758000 1.199333


> df <- data.frame(X=LETTERS, x=letters)

> df[] <- lapply(df, as.character)

apply family {base}


> df <- data.frame(X=LETTERS, x=letters)

> str(df)

'data.frame': 26 obs. of 2 variables:

$ X: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...

$ x: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...

> df[] <- lapply(df, as.character)

> str(df)

'data.frame': 26 obs. of 2 variables:

$ X: chr "A" "B" "C" "D" ...

$ x: chr "a" "b" "c" "d" ...

ddply {plyr}

> library(plyr)

Warning message:

パッケージ ‘plyr’ はバージョン 3.1.3 の R の下で造られました

> df <- data.frame(

+ group = c(rep('A', 8), rep('B', 15), rep('C', 6)),

+ sex = sample(c("M", "F"), size = 29, replace = TRUE),

+ age = runif(n = 29, min = 18, max = 54)

+ )

> ddply(df, .(group, sex), summarize,

+ mean = mean(age),

+ sd = sd(age))

Error in withCallingHandlers(tryCatch(evalq((function (i) :

object '.rcpp_warning_recorder' not found


> ddply(df, .(group, sex), summarize,

+ mean = mean(age),

+ sd = sd(age))

group sex mean sd

1 A F 42.43033 8.996826

2 A M 30.09450 13.311536

3 B F 35.64277 11.060713

4 B M 38.96056 6.731923

5 C F 25.01813 4.588658

6 C M 49.29878 NA

> head(df)

group sex age

1 A M 20.23535

2 A F 34.10908

3 A M 45.23656

4 A F 52.72067

5 A M 24.81160

6 A F 37.51441

> df %>% group_by(sex) %>% summarise(mean=mean(age), sd=sd(age))

Source: local data frame [2 x 3]

sex mean sd

1 F 34.51422 10.940603

2 M 37.60556 9.497813



> iris.tbl <- data.table(iris)

> iris.tbl

Sepal.Length Sepal.Width Petal.Length Petal.WidthSpecies

1: 5.1 3.5 1.4 0.2 setosa

2: 4.9 3.0 1.4 0.2 setosa

3: 4.7 3.2 1.3 0.2 setosa

4: 4.6 3.1 1.5 0.2 setosa

5: 5.0 3.6 1.4 0.2 setosa


146: 6.7 3.0 5.2 2.3 virginica

147: 6.3 2.5 5.0 1.9 virginica

148: 6.5 3.0 5.2 2.0 virginica

149: 6.2 3.4 5.4 2.3 virginica

150: 5.9 3.0 5.1 1.8 virginica

> class(iris.tbl)[1] "data.table" "data.frame"


>setkey(iris.tbl, Species)



[1,] iris.tbl 150 5 1 Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species SpeciesTotal: 1MB



> library(Nippon)> zen2han("12345ABC")

[1] "12345ABC"> x <- "12345ABC"

> x[1] "12345ABC"

> zen2han(x)

[1] "12345ABC"




> library(lubridate, type = ‘source’)

> ymd("19810322")

Error in gsub("+", "*", fixed = T, gsub(">", "_e>", num)) :

invalid multibyte string at








・ as.Date: 日付だけで十分な場合

・ as.POSIXct:日時を扱いたい場合

・ as.POSIXlt: 時間、分、秒等各要素を取り出したい場合

・ as.integer:(規則・不規則)時系列データに関する処理を行う必要がある場合

・ as.ts: 時系列関数を利用する場合

・ as.zoo, as.xts:時系列処理用パッケージを利用する場合



