www.dataiku.com
Machine Learning On
Dirty Data
www.dataiku.com
Dataiku in short
Software editor behind Data Science Studio,the « Photoshop for Data Science »
Our objective: to make data science accessible to all types of profiles
www.dataiku.com
• They build applications with their data: – Predicting parking spot availability – Analysis of web activity and behaviour
segmentation – Customer churn anticipation and marketing
activation – Maintenance prevention and material
breakdown impact reduction – Fraud detection – …
• They shorten their innovation cycles: – DSS diminishes their entry barriers and gives
way to easy reconversion of internal teams – Standardisation of practices and reduction
of the number of tools necessary – Easy collaboration between data analysts,
business experts, and IT engineers on one platform
Our clients
www.dataiku.com
Turn Device Logs Into Next Years' Business
Parking ticket machine data
OpenStreetMapdata
Cleaning and enrichment of data
Crossing data
Data Science Studio
Creation of a predictive algorithm
Availability of the predictions
Each street is segmented into small pieces that are enriched with geospatial
information.
The parking ticket history is joined with the points of
interest from OpenStreetMap.
The availability of parking lots is predicted by street segments from the joined
data.
The algorithm is finally integrated in the iPhone
app « Find me a space ».
by
www.dataiku.com
Predictive Monitoring for Search Engine Relevance
Web logsUsers searches
Data cleaning and enrichment
Customized algorithm
Algorithm usedwith all data
The Data Science team of PagesJaunes identifies
unsuccessful searches and train a customized
algorithm.
Words within the requests are analysed by the studio.
Web logs with clicks and bounce rates are imported
in the studio.
Long-‐term monitoring of unsuccessful searches
Web logs are enriched (time spent on the website per user, localisation, etc.)
Erwan Pigneul, Project ManagerPagesJaunes
Dataiku’s technology enabled us to
rationalise our work thanks to machine learning on millions
of searches. The process is
optimized, we know what and how to do
it.
by
www.dataiku.com
Optimizing Last Mile with Data Science Studio
Data Science Studio
Historical delivery and retrieval data
Modeling of a score for each delivery
Cleaning and temporal enrichment of data
Data aggregation by geographic location
Incorporation of new deliveries to the existing model
by
www.dataiku.com
Predictive Model To Optimize Restaurant Pages
Data Science StudioRestaurant data (place, type…)
Scoring of a restaurant’s page parameters in terms of customer satisfaction
Centralizing the data
User feedback (comments,
length…)
Traffic logs (visits, clicks,
time…)
Analysis and modeling Increase website
traffic by optimizing the
correct parameters
Cleaning and Enriching
by
www.dataiku.com
Create value with data driven applications
Parking ticket machine data
OpenStreetMapdata
Cleaning and enrichment of data
Crossing data
Data Science Studio
Creation of a predictive algorithm
Availability of the predictions
Each street is segmented into small pieces that are enriched with geospatial
information.
The parking ticket history is joined with the points of
interest from OpenStreetMap.
The availability of parking lots is predicted by street segments from the joined
data.
The algorithm is finally integrated in the iPhone
app « Find me a space ».
DATA IN VALUE OUTENRICH / COMBINE / COMPUTE
www.dataiku.com
Churn
Volume Forecast
RecommenderSegmentation Lifetime Value
Risk Score Hot Location
Pricing Ranking FraudEvent Paths
A MODEL An automated way to make a computertake a decision from raw (historical) data
The model can be used to take immediate (real-time)actions through an API
www.dataiku.com
Multiple Data Sources
Analyst Team
Many Models
CRM
Logs
2015 : BUILD YOUR FACTORY
Server ClusterLight Software
Personalised Experience Model
Acquisition Cost Opportunity
Model
Stock Optimisation Model
Optimize Delivery
www.dataiku.com
But …
“Data Science Superstars are really hard to hire.”
“I spend too much time cleaning up my data with inappropriate tools.”
“Our models are quite difficult to set up so they are rarely deployed into production.”
“There is too much plumbing involved in making all these Big Data technologies work together and then in successfully deploying applications with them.”
www.dataiku.com
Data Science StudioA studio for all your data driven applications
Load and prepareyour data
Analyse and build your models
Publish and run your projects
For all profiles
Collaborative
Open and controlled
www.dataiku.com
Data preparation
• Connect to all your data sources
• Explore them visually
• Transform and enrich them interactively
• Save your ‘recipes’ and reuse them later
www.dataiku.com
Analyse and model
• Discover correlations and significant variables
• Easily build your first models in a visual interface
• Test and improve several models alongside one another
• Deploy the models’ results directly inside your infrastructure
www.dataiku.com
Deploy into production
• Go quickly from prototypes to large scale production
• Manage data inputs and outputs from the interface
• Export and publish your results in several forms
• Control the updates with options such as scheduling, partitions, and replications…
www.dataiku.com
Collaborative work
• Enjoy a web interface and a shared platform
• Organise your work by projects and by teams
• Reuse the team’s work at any time
• Make sure everyone is always on the same page: share insights, graphics, comments, etc. with your team
www.dataiku.com
Open and controlled
• Take advantage of open source technologies such as Hadoop, iPython,scikit-‐learn, R…
• Integrate your own libraries and scripts
• Keep the data safe in your own infrastructure
• Keep your innovations under control: algorithms and predictions belong to you
www.dataiku.com
http://www.dataiku.com/dss/trynow/
Dataiku HQ
2 rue Jean Lantier
75001 Paris France
Dataiku West
2423A Durant Avenue
Berkeley, CA 94704
Florian [email protected]
You have ideas
“My data is too dirty. I don’t even know where to start ”
“We could probably better understand ours users. But how ?
“There’s a trend here, but our full historical data is just too big”
You have data
You need a tool
www.dataiku.com
ANNEXES
www.dataiku.com
A predictive application?
+ =Data Algorithms
(Machine Learning)Predictions
Collection Preparation Crossing
Knowledge Iterations Calculation
Industrialization Deployment
Requirements: