SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

43
©AE 2012 1 Bram Vanschoenwinkel Senior Data Scientist, AE @bvschoen @AE_NV R & Hadoop The perfect marriage for your analytics? Avondconferentie 19/06/2014

description

Think big, act small, start now Not only the seemingly endless flow of data but also its variety and complexity are typical for the Digital Era. This evolution offers companies the opportunity to gain new and valuable insights. Some examples of analytics: - A customer segmentation analysis divides customers into several groups, based on specific characteristics. This allows us to target them better, offer them tailor-made products and services or exploit cross-selling and up-selling opportunities more. - Churn prediction even makes real-time prediction possible of which customers are about to leave us. This insight enables us to take proactive action to prevent this. At the same time we are confronted with some new challenges and we need to change the way we handle data. Big data and analytics are the key to gain new insights, which can be incorporated by organizations in their strategic decisions as well as in their operational way of working. The key question is: how do you start? The answer is simple: start with building up the basic competences, start today and keep it simple, prove the added value and add complexity along the way. During this AE foyer two open source solutions (and market standards), R and Hadoop, will be discussed. We will present their characteristics in detail and illustrate (in an accessible way) how to use them and which quick results you can expect. Furthermore a realistic reference architecture will be shown, helping you to make the right choices, based on your needs and ambitions. Don’t miss out and discover how you can take advantage of the opportunities of the Digital Era, in an innovative and pragmatic manner!

Transcript of SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

Page 1: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

©AE 2012 1

Bram VanschoenwinkelSenior Data Scientist, AE

@bvschoen@AE_NV

R & HadoopThe perfect marriage for your analytics?

Avondconferentie 19/06/2014

Page 2: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

2

Agenda

1. It’s a ( R )evolution

2. Intelligent Decision Support in the Digital Age

3. The R Project for Statistical Computing

4. The World of Hadoop

5. Case: A Customer Intelligence Platform

6. Conclusions

Page 3: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

3

It’s a (R)evolution

2000 2010 2015

DATA VOLUME

TIME

MA

JOR

ITY

U

NST

RU

CTU

RED

DA

TA

Page 4: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

4

Abundance of Data

BEYOND

WEB

CRM

ERPPURCHASE DETAIL

PRODUCTION

PAYMENT DETAIL

PLANNING

CONTACT INFORMATION

LEADS

OFFERS

SEGMENTATION

PROSPECTS

CLICK STREAM DATA

WEB SHOPS SOCIAL MEDIA

VIDEO

IMAGES

TEXT

ONLINE SERVICES

AUDIO

OPEN DATA

MOBILE DEVICES

INTERNET OF THINGS

RFID

GPS

SENSORS

USER GENERATED CONTENT

SMART DEVICES

SENSORS

REMOTE MONITORING

CLOUD

MEDICAL

WARABLES

Page 5: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

5

Opportunities

OPERATIONAL EXCELLENCE

INNOVATIVE BUSINESS MODELS

INSIGHTS, STRATEGY AND POLICY

Page 6: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

6

SHORT LIFESPAN OF THE DATA

FAST

MO

VIN

G D

ATA

FAST

DA

TA P

RO

CES

SIN

G

HIGH VARIETY OF DATA

Challenges

Page 7: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

7

intelligent decision support in the digital age

WHAT WE SEE

ABUNDANCE OF HETEROGENOUS DATA

THE WAY WE INTERACT WITH THE WORLD HAS

CHANGED

OPPORTUNITIES

OPERATIONAL EXCELLENCE

BETTER DECISION SUPPORT

CHALLENGES

ANALYSIS GAP

VOLUME, VARIETY, VELOCITY

INNOVATING BUSINESS MODELS

COMPETENCES

Page 8: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

8

Decision Support in the Digital Age

Facing the Challenges and realizing the Opportunities

Business Analytics

Big Data

Page 9: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

9

Elements of a Holistic Information Management Framework

- Data Sources- Internal & External- From Data to Information

- Improving data quality- Integrality of data- From Information to Knowledge

Intelligent Decision Support:

- Reporting- Business Analytics- From Knowledge to Intelligence

DATAInformation

Knowledge

Intelligence Wisdom/Insight

Page 10: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

10

Decision Support in the Digital Age

“Business Analytics is the nontrivial extraction of implicit, previously unknown, and potentially useful

information from data.”

Page 11: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

11

Business Analytics vs Business Intelligence

Page 12: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

12

New Insights

8 stoppen

132 stoppen

10 stoppen

53 stoppen

64 stoppen

14 stoppen 4 stoppen

11 stoppen

Page 13: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

13

Innovating Business Models

Front-end Application(s)

Security

Analytics (on Hadoop)

Web Click StreamingSocial Media

Connectivity

External Application Integration

Operational Data Processing on Hadoop

Page 14: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

14

From Analytics…

Statistics Algorithms

BiologyPsychology

Databases

Page 15: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

15

…to Business Analytics

Page 16: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

16

Analytics Approach

Analytics

Incremental and iterative

Think big act small

Proof-of-Concept

Open source tools

Architecture & Deployment

(Non-)funtional requirements

Information Architecture

Technology

Embedded into operations

Two Phase Approach

Analytics

Architecture Deployment

Page 17: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

17

Analytics Churn Prediction Example

Invoicing CRM Call Center Application

John Doe – 43years – Antwerp – Man – 7calls – 3weeks – 30%down invoicingJane Dan – 32years – Brussels – Woman – 2calls – 12weeks – 10%up invoicing…

Operations

CHURN SCORES

REGION

PR

OD

UC

T

CHURN SCORES

MA

NA

GEM

ENT

DA

SHB

OA

RD

OPERATIONS

DATA DUMP

Analytics Engine

Data Warehouse

Page 18: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

18

Big Data

“Big data is high-volume, high-velocity, high-complexity and

high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight

and decision making.” (Gartner)

Page 19: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

19

Four V’s and a C

Not only volume makes big data big, it’s all about the three V’s: High Volume, Variety, Velocity

High Value!

In addition the data is very complex in nature, often unstructured: Text documents, emails, images and videos, etc.

Click stream data, social media feed data, etc.

Page 20: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

20

Innovative Forms of Information Processing

Traditional methods don’t suffice anymore.

New forms of information processing have emerged.

DISTRIBUTED DATA STORAGE

COMPUTATIONNoSQL DATA STORES

Page 21: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

21

Innovative Forms of Information Processing

Page 22: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

22

The R Project for Statistical Computing

R is a dialect of the S language

S was developed by John Chambers and others at Bell Labs

S was initiated in 1976

Now owned by TIBCO and sold under the name S-PLUS

INTERACTIVE NOT PROGRAMMING

PROGRAMMING WHEN SYSTEM

ASPECTS BECOME IMPORTANT

GRADUALLY MOVING INTO

Page 23: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

23

Advantages of R

Most widely used data analysis software Created and used by 2M+ data scientists, statisticians and analysts

Most powerful statistical programming language Flexible, extensible & comprehensive for productivity, +4800 packages

Create beautiful and unique data visualizations As seen in New York Times, Twitter and Flowing Data

Thriving open-source community Leading edge of analytics research

Fills the talent gap New graduates prefer R

Page 24: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

24

Drawbacks of R

Steep learning curve

Objects must be stored in physical

memory, little thought to memory

management

Functionality is based on consumer demand and user

contributions

Documentation is sometimes patchy

and terse, and impenetrable to the

non-statistician

Vibrant community to help you

Recent advancements to

deal with this

If a package is useful to many people, it will

quickly evolve into a robust product

Vibrant community to help you

Page 25: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

25

Exploding growth and Demand for R

R is the highest paid IT skill – Dice.com, Jan 2014

R most-used data science language after SQL – O’Reilly, Jan 2014

R is used by 70% of data miners – Rexer, Sep 2013

R is #15 of all programming languages – RedMonk, Jan 2014

R growing faster than any other data science language – KDnuggets, Aug 2013

More than 2 million users worldwide

Page 26: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

26

Great Adoption of R by Many Companies

Commercial vendors offering general support and developingspecific R based products, e.g.: Oracle, RevolutionAnalytics.

Companies using R for advanced statistics and analytics, e.g.:Thomas Cook, Google, Twitter.

Also in the AE customer base we see different companies lookinginto R as an alternative or complement to the traditional tools.

Page 27: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

27

Example Packages

twitteR: Provides an interface to the Twitter web API.

tm: Provides Text Mining functionalities like word stemming,stopword removal, etc.

wordcloud: Provides methods for producing wordclouds indifferent forms, shapes and colors.

Page 28: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

28

Apache Hadoop

Open-source software framework.

Storage and large-scale processing of data on clusters of commodity hardware.

Apache top-level project built and used by a global community.

Two core components:

1. Hadoop Distributed File System (HDFS)

2. MapReduce

Page 29: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

29

Apache Hadoop

MapReduce/HDFS based on Google's MapReduce and Google File System.

Other components are:

Hadoop Common – libraries and utilities needed by other Hadoop modules

Hadoop YARN – a resource-management platform

The entire Apache Hadoop “platform” is now commonly considered to consistof a number of related projects as well: Pig, Hive, Hbase,…

Created by Doug Cutting and Mike Cafarella at Yahoo in 2005 originally tosupport distribution for the Apache Nutch search engine project.

All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or

racks of machines) are common and thus should be automatically handled in software by the framework.

Page 30: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

30

The World of Hadoop

Page 31: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

31

Key Properties Apache Hadoop

Transforms commodity hardware into a service that: Stores petabytes of data reliably.

Allows huge distributed computations.

Key Properties: Designed for batch processing.

Write-once-read-many access model for files.

Extremely powerful.

Scalability: • Scales linearly with cores and disks.

• Machines can be added and removed from the cluster.

• Write code once, same program runs on 1, 1000, 4000 machines.

Reliable and fault-tolerant:• Failed tasks/data transfers are automatically retried.

• Data replication, redundancy.

Page 32: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

32

Rack 2 Rack 3Rack 1

A Typical Hadoop Cluster

Client

DATA ASSIGNMENT TO NODES

DATA READDATA WRITE

METADATA FORBLOCK INFO

Task Tracker

Task Tracker

Map Reduce

Map Reduce

Job Tracker

Data Node

Data Node

Task Tracker

Map Reduce

Data Node

Task Tracker

Task Tracker

Map Reduce

Map Reduce

Data Node

Data Node

Task Tracker

Map Reduce

Data Node

Task Tracker

Task Tracker

Map Reduce

Map Reduce

Data Node

Data Node

Task Tracker

Map Reduce

Data Node

Master Node

SlaveNodes

SlaveNodes

SlaveNodes

Name Node

JOB ASSIGNMENT

TASK ASSIGNMENT

1. Client

2. Master Node Name Node

Job Tracker

3. Slave Nodes Data Nodes

Task Trackers

Map / Reduce

Page 33: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

33

1. Client consults Name Node

2. Client writes block to Data Node

3. Data Node replicates block

4. Cycle repeats for next blocks

Rack 2 Rack 3Rack 1

Hadoop File System (HDFS)

Data Node 1 Data Node 4 Data Node 7

Data Node 2 Data Node 5 Data Node 8

Data Node 3 Data Node 6 Data Node 9

Name Node

Client

FILE

FILE

DATA ASSIGNMENT TO NODES

DATA READDATA WRITE

METADATA FORBLOCK INFO

Rack 1:Data Node 1Data Node 2…

Rack 2:Data Node 3…

Page 34: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

34

MapReduce

the, 1quick, 1brown, 1fox, 1

the, 1fox, 1ate, 1the, 1mouse, 1

how, 1now, 1brown, 1cow, 1

the, 1the, 1the, 1

fox, 1fox, 1

quick, 1

brown, 1brown, 1

ate, 1

mouse, 1

how, 1

now, 1

cow, 1

the, 3

fox, 2

quick, 1

brown, 2

ate, 1

mouse, 1

how, 1

now, 1

cow, 1

the, 3fox, 2quick, 1brown, 2ate, 1mouse, 1how, 1now, 1cow, 1

Input Splitting Map ShuffleSort

Reduce

OutputThe Map function processes one line at a time, splits it into tokens seperated by a withespace

and emits a key-value pair <word, 1>.

The Reducer function just sums up the values, which are the occurence counts for each key

(i.e. words in this example).

Page 35: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

35

Hadoop Distributions

Fully equipped, scalable and flexible cloud solutions.

Also different on premise solutions are being offered.

Choice depends on specific requirements. Data Privacy, Scalability, Security, Data Mastership, Configuration, Flexibility,

Price-Performance Ratio, Automation,…

How to get started? Free to download!

Business model is based on training, consulting, support and additional“tooling” (Enterprise Editions).

Many free trial cloud versions available to play around with.

Many tutorials, trainings, blogs, user groups etc.

Page 36: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

36

RHadoop

A collection of four R packages that allow users to manage andanalyze data with Hadoop: rmr: Hadoop MapReduce functionality in R

rhdfs: file management of the HDFS from within R

rhbase: database management for the HBase distributed database

Recently a new package plyrmr was relased providing a familiar interfacewhile hiding many of the MapReduce details (like Hive, Pig and Mahoot).

R and all RHadoop packges should be installed on all nodes inthe Hadoop cluster.

Combining the advantages of R with the power of Hadoop.

Page 37: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

37

MapReduce Wordcount Example in R

Map function.

Reduce function.

Reading the input fromHDFS from.dfs().

Writing the results back to HDFS to.dfs().

Page 38: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

38

Case: A Customer Intelligence Platform

* Non Disclosure Agreement: Contact AE via www.ae.be/contact for more information

Page 39: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

39

Conclusions

The Digital Age brings many opportunities but also challenges.

Big Data and Analytics can face the challenges and realize theopportunities.

It is within anyone’s grasp, do it incremental and iterative.

R and Hadoop: Open source software, active user groups and support.

A great way to start exploring!

Combined power gives you the advantage of 1 + 1 =3.

Sometimes alternatives are better.

Page 40: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

40

Conclusions

Don’t always need Big Data to do Analytics, it depends on therequirements.

Hadoop cloud solutions are scalable, flexible and cost-efficient,but sometimes limited in functionality (or not standardized).

Many differences between Hadoop distributions, constantlyevolving (and getting better).

Need for good Data Scientists in a mixed team of competences tomake the right choices.

Page 41: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

41

What’s next?

Ask yourselves following questions: What opportunities do I see for myself?

What strategic and competitive advantages can I realize?

Is Analytics the right solution for me? Do I need Big Data?

What about my Data Warehouse environment?

And what about the quality of my operational data?

Do I have the right infrastructure in place?

Do I have the right competences in house?

Now you should know what’s in it for you, but also the challengesyour most probably will be facing.

Page 42: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

42

What’s next?

You have a case you would like to discuss…?

You have any questions…?

Please feel free to contact me: Bram Vanschoenwinkel

[email protected]

+32 478 741738

@bvschoen

be.linkedin.com/in/bramvanschoenwinkel/

Page 43: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

@bvschoen / @ae_nv

www.ae.be

blog.ae.be