Data analysis with Pandas and Spark

31
Analytics pipelines with Jupyter and Spark

Transcript of Data analysis with Pandas and Spark

Page 1: Data analysis with Pandas and Spark

Analytics pipelines with Jupyter and Spark

Page 2: Data analysis with Pandas and Spark

Who we are

● NETOPIA● mobilPay● mobilPay Wallet● web2sms● btko.in● kartela.ro● mobilender.mx

Page 3: Data analysis with Pandas and Spark

Challenges

Page 4: Data analysis with Pandas and Spark

Three dimensional problem

● Time: Past events or crystall ball?● Profile: Who is looking at the data?● Quantity: How much data is there to look at?

Page 5: Data analysis with Pandas and Spark

Profile

● Data Scientist● Data Engineer

Page 6: Data analysis with Pandas and Spark

Quantity

● Hundreds of MB to a few GB● Up to million events/records

vs.

● GB to TB to PB ● Hundreds of millions to billions and beyond

events/records

Page 7: Data analysis with Pandas and Spark

Also

● Computing vs. Storage● Vertical vs. Horizontal scalability● Distributed/ML libraries● Dependency hell

Page 8: Data analysis with Pandas and Spark

Time

NOWPast Future

Analytics Forecasting(a.k.a. Prediction)

Page 9: Data analysis with Pandas and Spark

“Classic” Approach

Small Data Big Data

Data Engineer grep, sed, awk Java, Scala, Python, PIG, Hadoop, lately Spark & others

Data Scientist R/RStudio No way, Josè!

Page 10: Data analysis with Pandas and Spark

New Approach

Small Data Big Data

Data Engineer

Notebook Technologies: Jupyter (most used), zeppelin, but also less known ones (Rodeo,

Beaker)Data Scientist

Page 11: Data analysis with Pandas and Spark

Data analysis withJupyter, Pandas and Spark

Page 12: Data analysis with Pandas and Spark

Outline

About the data:

● Set of mobile transactions● Set (separate) of retail transactions

About the tools: Jupyter, Pandas and Spark

Our experience

Future work

Page 13: Data analysis with Pandas and Spark

Mobile transactions Retail data

Elements of analysis

Transactions Transactions, Products, Stock data

We know Transaction value, User identifier, Merchant

Transaction value, Sold products, Merchant

We don’t know What product was bought Who the user is

Size Hundreds of thousands of entries Hundreds of millions of entries

Status Building prediction models Gathering data

Datasets

Page 14: Data analysis with Pandas and Spark

Mobile transactions data

Page 15: Data analysis with Pandas and Spark

SQL Database

Mobile data: Environment

Preprocessing notebooks

Analysis and model testing notebooks

Pandas R (with rpy2)

scikit-learn Custom code

CSV files

pickle files

Other input sources

Jupyter notebooks

in Docker container

with Anaconda

DiagnosticsCleaning

Feature building

Raw data

ModelsVisualizations

Page 16: Data analysis with Pandas and Spark

Docker image… with Anaconda

● Anaconda: package manager for data science

● Using docker-compose for setting up container parameters

● Many available images● Our base image:

○ pyspark from Jupyter Docker Stacks

○ Extended with required libraries

● Libraries are added or updated with docker build:

○ Self-contained○ Easy versioning

Page 17: Data analysis with Pandas and Spark

Jupyter Notebook (1)

Web application for creating documents with live code,

explanations and visualizations

● Initially, part of IPython● Narrative with live code● Protocol for interactive

exploration○ Run blocks of code○ Embedded JS

● Executable documents○ Code○ HTML and Markdown○ Metadata

● Kernels for multiple languages

○ Python○ R○ Scala○ Bash

● Internal format: JSON

Page 18: Data analysis with Pandas and Spark

Jupyter Notebook (2)

Web application for creating documents with live code,

explanations and visualizations

● Plugins and widgets● Easy to share (formats:

Notebook, PDF, HTML, …)● Large ecosystem

○ Jupyter Lab / Jupyter Hub○ GitHub visualizations○ Blog integration○ Education: teaching, evaluation

○ Microsoft, Google, Bloomberg, IBM, O'Reilly

○ Executable books

● Versioning is complicated

Page 19: Data analysis with Pandas and Spark

Pandas

● DataFrame objects○ Tabular data structures○ Each column has one data type

● Based on numpy (fast)● Processing is (mostly) done in

memory● Data manipulation:

○ Hierarchical indexing○ Reshaping, pivoting, grouping○ String operations○ Time series operations

● Reading / writing from / to many formats (CSV, JSON, HDF5, …)

● Visualization: matplotlib, Seaborn, Bokeh, …

Python library for data manipulation and analysis

Page 20: Data analysis with Pandas and Spark

rpy2Interface between Python and

R

● Translates DataFrames between Python and R

● Python in Jupyter: use %%R● Direct access to R objects

(rpy2.robjects)

Page 21: Data analysis with Pandas and Spark

Jupyter, Pandas and R

R with Rpy2

Python

HTML and Markdown

Not

eboo

k

Page 22: Data analysis with Pandas and Spark

Mobile data: User retention

Active users:

● Classic: 1+ transactions in a given period● Rolling: 1+ transactions in a given or

subsequent period

Plots:

● X: period (day, week, month)● Y (cohort): period or another type of

segment● By transaction criteria (merchant,

product, etc.)

Results:

● Response to campaigns● Activity recurrence

Coh

orts

Periods

Page 23: Data analysis with Pandas and Spark

Mobile data: Correlations

Features:

● How similar are two features?

Merchants:

● Which merchants have common users?

Products:

● Which products are sold together?

Page 24: Data analysis with Pandas and Spark

Mobile data: Clusters

● Group users by behavior● Identify outliers● Future: automatic cluster labeling

Page 25: Data analysis with Pandas and Spark

Retail transactions data

Page 26: Data analysis with Pandas and Spark

Retail data: Our experience

First try: Out-of-core processing with HDF5

● Data does did not fit in memory● HDF5: format for large data● Pandas + HDF5, Blaze, Dask, Odo● Easy to use functions● Library incompatibilities● Slow queries, use indexes● Occasional runtime errors

Page 27: Data analysis with Pandas and Spark

Cassandra

Retail data: Environment

Preprocessing notebooks

Analysis and model testing notebooks

Large data:Spark ML + scikit-learn

Small (selection) data:Pandas, scikit-learn and R

CSV files

Apache ParquetCassandra

Other input sources

Jupyter notebooks

in Docker containers

with Spark and Anaconda

DiagnosticsCleaning

Feature building

Raw data

ModelsVisualizations

In progress

Page 28: Data analysis with Pandas and Spark

SparkEngine for big data processing

● DataFrames○ Built on top of RDDs○ Similar to Pandas and R○ SQL queries

○ Automatic query optimization through query plan

○ String , date-time and statistics functions

○ Group by, filters

● Jupyter integration: work in progress

https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

Page 29: Data analysis with Pandas and Spark

SparkMachine Learning

MLlib and ML

● MLlib○ Uses RDDs

○ Summaries, correlations, sampling

○ SVMs, logistic regression,

decision trees, ensembles and Naive Bayes

○ Clustering○ Feature transformation

● ML○ Works with DataFrames○ Many wrappers for MLlib○ Pipelines:

■ Transformers, Estimators, Parameters

■ labelCol, featuresCol, predictionCol, ...

○ R formulas (y ~ x1 + x2)

Page 30: Data analysis with Pandas and Spark

Retail data: Our experience

Current: Spark + Docker

● No issues at current size (several GBs)● Docker Compose for creating master, workers and Jupyter container

(driver)● ML libraries are easy to work with● Incomplete Python API for ML (e.g., summaries)● Documentation needs improvement● Model diagnostics

○ Some metrics are available○ Supplement with scikit-learn (example: build ROC curves)

● scikit-learn or R on top of Spark○ Parallelize parameter search (e.g., grid search)○ Spark sklearn (github.com/databricks/spark-sklearn): Grid Search

Page 31: Data analysis with Pandas and Spark

Future work

Mobile wallet transactions:

● Data fits in memory● Use Spark for distributing workload

ERP transactions:

● Some data fits in memory, after processing● Build a web app for data exploration● Forecast

○ Sales○ Inventory requirements

● Try Spark Streaming

http

://xk

cd.c

om/1

425/