Cloudera Toolkit (Dark) 2018 · Notable Custom Programs 3 of the top 5 US Banks The largest US...
Transcript of Cloudera Toolkit (Dark) 2018 · Notable Custom Programs 3 of the top 5 US Banks The largest US...
© Cloudera, Inc. All rights reserved. 3© Cloudera, Inc. All rights reserved.
DATA SCIENCE WAVES
?
Source : https://blog.exploratory.io/data-science-by-you-dawn-of-third-wave-e89f2999d994
© Cloudera, Inc. All rights reserved. 4© Cloudera, Inc. All rights reserved.
GARTNER : DEMOCRATIZED BY AUGMENTED ML
democratized AI will be one of the major trends which will shape our future technologies.
© Cloudera, Inc. All rights reserved. 5© Cloudera, Inc. All rights reserved.
DEMOCRATIZATION ALREADY?
+38,000 명
고등학생이 ML 문제푸는 수준은 3년 전연구자들 수준…Source : https://www.youtube.com/watch?v=ZZXnecufXPU
딥러닝개발비용 < 신발값caffe 설치 10위안,cnn 층당 5위안,rnn 층당 8위안
Source: 중국 중관춘 (실리콘밸리)
+23,000 명
AI KOREA(Deep Learning)
© Cloudera, Inc. All rights reserved. 6© Cloudera, Inc. All rights reserved.
ACADEMY & OSG : AUTOMATIC MACHINE LEARNING
참고 : Efficient and Robust Automated Machine Learning,
Feurer et al., Advances in Neural Information
Processing Systems 28 (NIPS 2015).
Data Scientist community 에서활발히쓰이는 scikit-learn 과유사한 coding style
Parameter Search Space 를자동으로찾아줌
CRAN Package 마다다른 I/F를갖는 algorithm 들의 wrapping
다양한 Algorithm 들을포함하고있음 (>160)
(반) 자동화이나이전보다훨씬효율적인분석작업가능• 전처리 (결측치, 변환등) 및후처리작업• Hyper-parameter tuning
• Learning-curve 등모델링중관찰데이터
참고 : https://mlr-org.github.io/mlr-tutorial/release/html/task/index.html
참고 : efficient neural architecture search
(https://arxiv.org/abs/1806.10282)
[network morphing 과정] [자동화된 NN 성능]
Auto-keras
© Cloudera, Inc. All rights reserved. 7© Cloudera, Inc. All rights reserved.
ENTERPRISE NEEDS : SCALING DATA SCIENCE
The small pool of data scientists and large amount of time needed to research, construct, and deploy models leaves many businesses unable to quickly deliver time-sensitive projects.
Predictive Algorithm Demand
Unmet Demand for Data Science
Supply of Internal Resources
Time
HIGH COSTS
HIGH TURNOVER
SLOW, COSTLY INTEGRATION
LESS INSIGHTS
© Cloudera, Inc. All rights reserved. 8© Cloudera, Inc. All rights reserved.
DATA SCIENTIST NEEDS : AGONIES
CRISP – DM방법론
① Iterationso 무한 반복o 종료 조건
★ 정확도 vs 설명 가능성o 현업 이해 가능한 설명o 복잡한 모델일수록 정확
★ Open-endedo 추가 데이터 확보o 어떤 데이터를 확보
② Comprehensiveo 모든 분석 모델 (<4)o Param. Tuning
③ Re-trainingo Growing erroro 데이터 변화
© Cloudera, Inc. All rights reserved. 9© Cloudera, Inc. All rights reserved.
WHY DATAROBOT FOR DEMOCRATIZATION
Confidential | Copyright © DataRobot, Inc. | All Rights Reserved
“ DataRobot is a machine learning platform
for analysts and data scientists to build
and deploy accurate predictive models in a
fraction of the time it used to take. ”
© Cloudera, Inc. All rights reserved. 11© Cloudera, Inc. All rights reserved.
DATAROBOT의 해답
Data Scientist 의생산성효율화, 기업내 AI 적용분야확산
HackingSkills
Math&
Stats
DomainExpertise
Do much more with little to no
coding
+
Expanded modelingtoolkit
© Cloudera, Inc. All rights reserved.Confidential. ©2018 DataRobot, Inc. – All rights reserved
$220M+
200+
IN FUNDING
750,000,000+MODELS BUILT ONDATAROBOT CLOUD
INSURANCE & BANKING HEALTHCARE FINTECH ON-DEMAND SERVICES MANY MORE
50+TOP 3 FINISHES
The world’s most advanced Automated Machine Learning platform
DATA SCIENTISTS & ENGINEERS (OF 450+)
2012FOUNDED HQ in Boston, MA
#1 RANKEDDATA SCIENTISTS
4
Notable Custom Programs
3 of the top 5 US Banks
The largest US for-profit Healthcare System
The largest US Supermarket chain
The largest US Pharmacy chain
The world’s largest Retailer
The world’s largest Auto Manufacturer
3 of the top 5 global Reinsurers
2 of the world’s largest Biotech companies
Global Telecommunication companies
Major League Baseball teams
Federal & Public Sector
Customer Success Tuned for Enterprise Customers
© Cloudera, Inc. All rights reserved. 14© Cloudera, Inc. All rights reserved.
DATAROBOT 솔루션의 특징 (1/4)
축적된분석지식과기술
Jeremy AchinCEO & Co-Founder
Highest: 20th
Xavier ConortChief Data Scientist
Highest: 1st
Tom de Godoy CTO & Co-Founder
Highest: 20th
Owen Zhang Product Advisor
Highest: 1st
Sergey YurgensonData Scientist
Highest: 1st
The top ranked Data Scientists in the world
MASTER MASTER MASTER MASTER MASTER
The best technologies in the world
Amanda SchierzData Scientist
Current: 1st Female, 1st in UK
MASTER
© Cloudera, Inc. All rights reserved. 15© Cloudera, Inc. All rights reserved.
DATAROBOT 솔루션의 특징 (2/4)
자동화된분석 :현업사용자도예측모델생성및활용가능
© Cloudera, Inc. All rights reserved. 16© Cloudera, Inc. All rights reserved.
DATAROBOT 솔루션의 특징 (3/4)
설명가능성 :모든 Algorithm각각에대해데이터기반,설명제시
[Feature Impact] [Feature Effect] [Prediction Explanation]
• 각변수들의중요도는어떻게다른가?
• 중요도의순위는업무지식과일치하는가?
• 새로운 insight가있는가?
• 각변수는 Target 과어떤관계인가?
• 함수관계는업무지식을반영하고있는가?
• 새로운 Insight가있는가?
• 예측은어떤근거로생성되는가?
• 모델의예측값은신뢰할만한가?
© Cloudera, Inc. All rights reserved. 17© Cloudera, Inc. All rights reserved.
DATAROBOT 솔루션의 특징 (4/4)
API를통한연동
Application server
Prediction worker
Notebook
RestAPI, R/Python
Model FactoryAutomatic
Model RefreshModel
Diags & VizFeature
EngineeringApp.
Integration
API를 활용한 분석 관련 다양한 작업 가능
Cloud
Hadoop
Web UIConsole
© Cloudera, Inc. All rights reserved. 18© Cloudera, Inc. All rights reserved.
CLOUDERA & DATAROBOT
Data Science Workbench
DataRobot
CDWS allows data scientists easy and secure access to data and distributed processing provided by Cloudera Enterprise. This enables data scientists to develop models in Python, R or Scala , without having to worry about the details of Hadoop and Spark. Focus is on the coding data scientist - CDSW can also leverage libraries available in DataRobot
DataRobot offers an automated machine learning platform that empowers users of all skill levels to make better predictions faster. The integration with DataRobot allows CDSW users to either build models manually in R and Python or utilize the Machine Learning Automation in DataRobot, all from the same workbench. Focus on both business analyst and data scientist
© Cloudera, Inc. All rights reserved. 19© Cloudera, Inc. All rights reserved.
CLOUDERA & DATAROBOT INTEGRATION
DataRobot has the highest level of integration with Cloudera
Cloudera Parcels A few click to install DR in Cloudera Manager!
Cloudera CSDs Can use all the functionalities of Cloudera Manager (monitoring, resource mgmt…)
Kerberos / Sentry Secured authentication
YARN All the resources consumed by DataRobot are managed by YARN
Spark DataRobot uses Spark for Hadoop scoring
© Cloudera, Inc. All rights reserved.
Challenge: Reducing the need for human inspection in the processes that are
difficult to control
Fault Detection
Data: Grinding, hitting, etc., especially effective in the process where physics modeling is difficult
Results: Accurate alert when products are likely to have faults --the model refreshed hourly and deployed immediately to reflect the changes in machine settings
“Extremely high accuracy and highly automated process only possible with DataRobot”
- SI vendor working to implement the system at heavy industry manufacturer
Heavy industry manufacturing
© Cloudera, Inc. All rights reserved.
PredictiveMaintenance
Data: Data included age, construction materials, text description, location, power-grid, previous repairs, etc
Results: Allowed this energy company to predict most incidents that were unrelated to weather (chance)
Challenge: Optimizing maintenance cost by predicting failing asset
Gas utility company
The ability to predict failing assetreduces the need for human inspection
© Cloudera, Inc. All rights reserved.
Sales Forecasting
Data: Time series sales data for thousands of products
Results: Allowed forecasting of all products not just few
Challenge: Preventing opportunity loss while minimizing excess-production
International Retail
Accuracy over 80% for over 70% of productsAs good as human expert prediction
© Cloudera, Inc. All rights reserved.
More accurate results achieved in 4 hours vs. 2 weeks;
85% vs. 64% (AUC)
Portfolio ROI = $10M per year
Claim cost savings by rejecting riskiest patients
Identifying simple underwriting rules to segment patients
● Replaced inaccurate & hard-to-maintain medical expert rules
Insurance Underwriting
Identifying 10% of customers with 5x higher than average mortality risk
GLOBAL REINSURANCE COMPANY
© Cloudera, Inc. All rights reserved.
Potential very large ability to reduce big cost in claim
More accurate models built faster
REST API: faster, simpler deployment
Identifying claim fraud to support payments
We’ve looked at just about every viable vendor in this space &
we have not seen anyone do what DataRobot can do. - SVP of Technology Innovation
Fraud Detection
“
© Cloudera, Inc. All rights reserved.
Customer Churn
Potential $10M in additional revenue
Increased accuracy in targeting high churn risk customers
Better identification of customers who can be persuaded to stay
Faster data analysisTargeting customers likely not to
renew the next contract
We cannot find or pay for the data scientist necessary to
accomplish our goals, but with DataRobot we can get there. -SVP of Data Analytics
“
© Cloudera, Inc. All rights reserved. 28© Cloudera, Inc. All rights reserved.
LIVE DEMO DATA
대출 Risk 모델링
• Problem
• 대출 신청자의 Profile 기반으로
• 최적화된 승인/거절에 활용하기위한 Default Risk를 예측 모델
• Data
• 대출 정보 (신청액, 상환 기간)
• 개인 정보 (직장, 연봉, 주소 등)
• 과거 신용 정보 (계좌수 등)[LeadingTree 사례]
© Cloudera, Inc. All rights reserved.
THANK YOU
Woonpyo HongData Scientist, DataRobot