Machine Learning on dirty data - Dataiku - Forum du GFII 2014

20
www.dataiku.com Machine Learning On Dirty Data

Transcript of Machine Learning on dirty data - Dataiku - Forum du GFII 2014

Page 1: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

Machine Learning On

Dirty Data

Page 2: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

Dataiku in short

Software  editor  behind  Data  Science  Studio,the  «  Photoshop  for  Data  Science  »  

Our  objective:  to  make  data  science  accessible  to  all  types  of  profiles

Page 3: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

• They  build  applications  with  their  data:  – Predicting  parking  spot  availability    – Analysis  of  web  activity  and  behaviour  

segmentation  – Customer  churn  anticipation  and  marketing  

activation  – Maintenance  prevention  and  material  

breakdown  impact  reduction  – Fraud  detection  – …  

• They  shorten  their  innovation  cycles:  – DSS  diminishes  their  entry  barriers  and  gives  

way  to  easy  reconversion  of  internal  teams  – Standardisation  of  practices  and  reduction  

of  the  number  of  tools  necessary  – Easy  collaboration  between  data  analysts,  

business  experts,  and  IT  engineers  on  one  platform

Our clients

Page 4: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

Turn Device Logs Into Next Years' Business

Parking  ticket  machine  data

OpenStreetMapdata

Cleaning  and  enrichment  of  data

Crossing  data

Data Science Studio

Creation  of  a  predictive  algorithm

Availability  of the  predictions

Each  street  is  segmented  into  small  pieces  that  are  enriched  with  geospatial  

information.

The  parking  ticket  history  is  joined  with  the  points  of  

interest  from  OpenStreetMap.

The  availability  of  parking  lots  is  predicted  by  street  segments  from  the  joined  

data.

The  algorithm  is  finally  integrated  in  the  iPhone  

app «  Find  me  a  space  ».  

by

Page 5: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

Predictive Monitoring for Search Engine Relevance

Web  logsUsers  searches

Data  cleaning  and  enrichment

Customized  algorithm

Algorithm  usedwith  all  data

The  Data  Science  team  of  PagesJaunes  identifies  

unsuccessful  searches  and  train  a  customized  

algorithm.  

Words  within  the  requests  are  analysed  by  the  studio.

Web  logs  with  clicks  and  bounce  rates    are  imported  

in  the  studio.

Long-­‐term  monitoring  of  unsuccessful  searches

Web  logs  are  enriched  (time  spent  on  the  website  per  user,  localisation,  etc.)

Erwan  Pigneul,  Project  ManagerPagesJaunes

Dataiku’s  technology  enabled  us  to  

rationalise  our  work  thanks  to  machine  learning  on  millions  

of  searches. The  process  is  

optimized,  we  know  what  and  how  to  do  

it.  

by

Page 6: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

Optimizing Last Mile with Data Science Studio

Data Science Studio

Historical delivery and retrieval data

Modeling of a score for each delivery

Cleaning and temporal enrichment of data

Data aggregation by geographic location

Incorporation of new deliveries to the existing model

by

Page 7: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

Predictive Model To Optimize Restaurant Pages

Data Science StudioRestaurant data (place, type…)

Scoring of a restaurant’s page parameters in terms of customer satisfaction

Centralizing the data

User feedback (comments,

length…)

Traffic logs (visits, clicks,

time…)

Analysis and modeling Increase website

traffic by optimizing the

correct parameters

Cleaning and Enriching

by

Page 8: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

Create value with data driven applications

Parking  ticket  machine  data

OpenStreetMapdata

Cleaning  and  enrichment  of  data

Crossing  data

Data Science Studio

Creation  of  a  predictive  algorithm

Availability  of the  predictions

Each  street  is  segmented  into  small  pieces  that  are  enriched  with  geospatial  

information.

The  parking  ticket  history  is  joined  with  the  points  of  

interest  from  OpenStreetMap.

The  availability  of  parking  lots  is  predicted  by  street  segments  from  the  joined  

data.

The  algorithm  is  finally  integrated  in  the  iPhone  

app «  Find  me  a  space  ».  

DATA  IN VALUE  OUTENRICH  /  COMBINE  /  COMPUTE

Page 9: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

Churn

Volume Forecast

RecommenderSegmentation Lifetime Value

Risk Score Hot Location

Pricing Ranking FraudEvent Paths

A MODEL An automated way to make a computertake a decision from raw (historical) data

The model can be used to take immediate (real-time)actions through an API

Page 10: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

Multiple  Data    Sources  

Analyst Team

Many  Models

CRM

Logs

2015 : BUILD YOUR FACTORY

Server ClusterLight Software

Personalised Experience Model

Acquisition Cost Opportunity

Model

Stock Optimisation Model

Optimize Delivery

Page 11: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

But …

“Data  Science  Superstars  are  really  hard  to  hire.”

“I  spend  too  much  time  cleaning  up  my  data  with  inappropriate  tools.”

“Our  models  are  quite  difficult  to  set  up  so  they  are  rarely  deployed  into  production.”

“There  is  too  much  plumbing  involved  in  making  all  these  Big  Data  technologies  work  together  and  then  in  successfully  deploying  applications  with  them.”

Page 12: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

Data Science StudioA studio for all your data driven applications

Load  and  prepareyour  data

Analyse  and  build your  models

Publish  and  run your  projects

For  all  profiles  

Collaborative  

Open  and  controlled

Page 13: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

Data preparation

• Connect  to  all  your  data  sources  

• Explore  them  visually  

• Transform  and  enrich  them  interactively  

• Save  your  ‘recipes’  and  reuse  them  later

Page 14: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

Analyse and model

• Discover  correlations  and  significant  variables  

• Easily  build  your  first  models  in  a  visual  interface  

• Test  and  improve  several  models  alongside  one  another  

• Deploy  the  models’  results  directly  inside  your  infrastructure

Page 15: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

Deploy into production

• Go  quickly  from  prototypes  to  large  scale  production  

• Manage  data  inputs  and  outputs  from  the  interface  

• Export  and  publish  your  results  in  several  forms  

• Control  the  updates  with  options  such  as    scheduling,  partitions,  and  replications…

Page 16: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

Collaborative work

• Enjoy  a  web  interface  and  a  shared  platform  

• Organise  your  work  by  projects  and  by  teams  

• Reuse  the  team’s  work  at  any  time  

• Make  sure  everyone  is  always  on  the  same  page:  share  insights,  graphics,  comments,  etc.  with  your  team

Page 17: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

Open and controlled

• Take  advantage  of  open  source  technologies  such  as  Hadoop,  iPython,scikit-­‐learn,  R…  

• Integrate  your  own  libraries  and  scripts

• Keep   the   data   safe   in   your   own  infrastructure  

• Keep   your   innovations   under  control:   algorithms   and   predictions  belong  to  you

Page 18: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

http://www.dataiku.com/dss/trynow/

Dataiku HQ

2 rue Jean Lantier

75001 Paris France

Dataiku West

2423A Durant Avenue

Berkeley, CA 94704

Florian [email protected]

You have ideas

“My data is too dirty. I don’t even know where to start ”

“We could probably better understand ours users. But how ?

“There’s a trend here, but our full historical data is just too big”

You have data

You need a tool

Page 19: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

ANNEXES

Page 20: Machine Learning on dirty data - Dataiku - Forum du GFII 2014

www.dataiku.com

A predictive application?

+ =Data Algorithms

(Machine  Learning)Predictions

Collection  Preparation  Crossing

Knowledge  Iterations  Calculation

Industrialization  Deployment

Requirements: