R Packages githubinstall magicfor dplyr.teradata
-
Upload
hoxom -
Category
Data & Analytics
-
view
602 -
download
1
Transcript of R Packages githubinstall magicfor dplyr.teradata
No package? OK, develop it !!!
Koji Makiyama Global Tokyo.R #2
2017/04/01
1
About Me
HOXO-M Inc.
President & CEO
2
HOXO-M Inc.
• Consists of – Awesome young data scientists and – R-Ojisan: people who too love R.
• Our Stance: “No package? OK, develop it !!!” (なければ作る!それがホクソエムの誓い!)
3
Developed Packages
• CRAN githubinstall batade RODBCDBI densratio jpmesh magicfor
• GitHub pforeach jaguchi RFinanceJ SparkRext easyRFM healthplanet dplyr.teradata rOpenWeatherMap etc.
4
Today’s Story
• Introduce Our Packages Picked up – githubinstall – magicfor – dplyr.teradata
5
githubinstall
6
githubinstall
• Install R Packages on GitHub without Developer Names
• Using devtools >install_github("hadley/dplyr")
• Using githubinstall >githubinstall("dplyr")
7
Motivation
• There is a great package ggfortify. • In the past, it is not on CRAN. • I used to install it using devtools
>install_github("sinhrks/ggfortify")
• It is hard for me to remember the correct spelling of “sinhrks”
8
Motivation
• I wanted to install packages even if I forget who created them.
• “No package? OK, develop it !!!”
9
How Does It Work?
• Gepuro Task Views http://rpkg.gepuro.net – Crawling GitHub Repositories Every Day
• Atsushi Hayakawa – One of the Awesome – HOXO-M
10
Preparation
• Install >install.packages("githubinstall")
• Load Library>library(githubinstall)
11
Basics
• Install Packages Hosted on GitHub >githubinstall("package_name")
• Example: twitter/AnomalyDetection >githubinstall("AnomalyDetection")
12
Fuzzy Matching
• Example: DiagrammeR
>githubinstall("DiagramR")Suggestion:-rich-iannone/DiagrammeRDoyouwanttoinstallthepackage?(Y/n)
13
Specify Git References
• From Branchesgithubinstall("ggplot2",ref="sf")
• From Tags Githubinstall("ggplot2",ref="v1.1.0")
• From Commits githubinstall("ggplot2",ref="f4398b")
14
Suggest
• To Know Repository Names w/o Install >gh_suggest("DiagramR")[1]"rich-iannone/DiagrammeR"
• Fuzzy Search for Developer Names >gh_suggest_username("yuhui")[1]"yihui"
15
List Packages• List Packages by Developer Names • Example: Packages Created by Hadley >hadleyverse<-gh_list_packages(username="hadley")>head(hadleyverse)usernamepackage_nametitle1hadleyRcppDateTime2hadleyS3HelpersforProgrammingwith3hadleyassertthatUserfriendlyassertionsfor4hadleybabynamesAnRpackagecontainallbab5hadleybenchBechmarkingtoolsfor6hadleybigrqueryAninterfacetoGoogle'sbig
16
Search Packages• Search Packages by Keywords • Example: Search Packages Related Lasso >lasso_packages<-gh_search_packages("lasso")>head(lasso_packages)usernamepackage_nametitleCY-devsparseSVMSolutionPathsofSparChingChuan-Chenmilrmultiple-instancelogiFrankDfuserFusedlassoforhigh-dManuSettySeqGLSeqGLisagrouplassoPingYangChenmilrmultiple-instancelogiTaddyLabgamlrGammalassoregression
17
Summary
18
• The githubinstall package provides helper functions to install and find packages hosted on GitHub.
magicfor
19
magicfor• Remember Printed Values in for Loops >magic_for(print)>for(iin1:3){+squared<-i^2+print(squared)+}>magic_result_as_vector()[1]149
20
Motivation
• Printed Values in for loops go away. >for(iin1:3){+squared<-i^2+print(squared)+} [1]1[1]4[1]9
21
Motivation
• To keep it, we need to change the code. >result<-vector("numeric",3)>for(iin1:3){+squared<-i^2+result[i]<-squared+}>result[1]149
22
Motivation
• Too much hassle to carefully do that to: – Prepare some containers – With the correct length and – Add assignment statements.
• I don’t want to do that.
• “No package? OK, develop it !!!”
23
magicfor
• Insert one line spell magic_for() >magic_for(print)>for(iin1:3){+squared<-i^2+print(squared)+}
• You can take the values out at later. >magic_result_as_vector()[1]149
24
How Does It Work?
• Magic
25
How Does It Work?
• If you want really to know the magic, read “Advanced R.”
• Then check out magicfor codes on GitHub.
https://github.com/hoxo-m/magicfor
26
Preparation
• Install >install.packages("magicfor")
• Load Library>library(magicfor)
27
Basics• magic_for(func,progress,test,silent)
• Arguments func: function to print values (e.g. print, cat) progress: whether to display a progress bar test: number of iteration for test silent: whether to suppress messages
28
Choose Print Function
• You can choose function to print values >magic_for(cat)>for(iin1:3){+squared<-i^2+cat(squared)+}>magic_result_as_vector()[1]149
• Default print function is put().
29
put()
• put() displays values with high flexibility. >x<-2;y<-3>put(x)x:2>put(x,y)x:2,y:3>put(x,x^2,x^3)x:2,x^2:4,x^3:8>put(x,squared=x^2,cubed=x^3)x:2,squared:4,cubed:8
30
magicfor with put()• magicfor & put() are very compatible >magic_for()>for(iin1:3){+put(x=i,squared=i^2,cubed=i^3)+}>magic_result_as_dataframe(F)xsquaredcubed1111224833927
31
Summary
• The magicfor package provides a magic function to store values in for loops automatically.
32
dplyr.teradata
33
dplyr.teradata
• Teradata Backend for dplyr >tera_db<-src_teradata("schema_name")>table<-tbl(tera_db,"table_name”)
>query<-count(table,gender)
>collect(query)gendern1♀1232♂456
34
Motivation
• I would like to extract data with dplyr verbs from Teradata.
• I found teradata.dplyr package on GitHub but it is not maintained.
• “No package? OK, develop it !!!”
35
How Does it Work?
• There is a way to add new SQL backends to dplyr.
>vignette("new-sql-backend")
36
Preparation
• Install Teradata ODBC Driver • Install Package >library(githubinstall)>githubinstall("dplyr.teradata")
• Load Library>library(dplyr.teradata)
37
Usage• Connect to Teradata >tera_db<-src_teradata("schema_name")
• Create Table Object >table<-tbl(tera_db,"table_name")
• Construct Query >query<-count(table,column_name)
• Send Query >result<-collect(query)
38
Construct Query
• Construct Queries using dplyr Verbs >query<-table%>%+select(gender,reg_date)%>%+filter(reg_date=="2017-04-01")%>%+group_by(gender)%>%+summarise(count=n())
• They are converted to SQL implicitly.
39
Check SQL
• Check Converted SQL >show_query(query)<SQL>SELECT"gender",count(*)AS"count"FROM(SELECT*FROM(SELECT"gender"AS"gender","reg_date"AS"reg_date"FROMtable_name)AS"nwkksckhfq"WHERE("reg_date"='2017-04-01')AS"vmmugivqkw”GROUPBY"gender"
40
NOTE
• Generated SQL are usually redundant. • We can only pray that the compiler will
do well.
41
TIPS
• In most cases, a large amount of data is stored in Teradata.
• We should take care some points when extracting data for saving time. – Establish connection with schema – Use filter() and select()– Use summarise() whenever possible – Check explain(query) before send
42
Query Execution Plan >explain(query)<PLAN>1)First,welockstable.table_nameforreadonareservedRowHashtopreventglobaldeadlock.2)Next,welockstable.table_nameforread.3)Wedoanall-AMPsSUMsteptoaggregatefrom..5)Finally,wesendoutanENDTRANSACTIONsteptoallAMPsinvolvedinprocessingtherequest.->ThecontentsofSpool1aresentbacktotheuserastheresultofstatement1.Thetotalestimatedtimeis2.05seconds.
43
Join Across Schema
• When you need to join tables across schemas, you should establish connections without schemas.
>db<-src_teradata("")>table1<-tbl(db,"schema1.table1")>table2<-tbl(db,"schema2.table2")>left_join(table1,table2,by="id")
44
Summary
• The dplyr.teradata package provides a way to extract data using dplyr verbs from Teradata.
• It is a beta version. • We welcome your bug reports and
contributions. https://github.com/hoxo-m/dplyr.teradata
45