AI&BigData Lab 2016. Руденко Петр: Особенности обучения,...

Training, tuning, selecting & serving of machine learning models at scale

Peter Rudenko@peter_rud

peter.rudenko@datarobot.com

Typical machine learning workflow

Input data

Model trainingPrediction

Preprocessing, feature engineering

Model tuning (selecting best

hyperparameters)

Data partitioning

Optimising model parameters

Low latency Batch

Automatic Machine Learning

fblearner

Deep Feature Synthesis: Towards Automating Data Science Endeavors (MIT)

Datarobot.com

Test data

Input data

Balanced vs skewed target distribution

The devil is in the detail:○ Partitioning○ Leakage○ Sample size

http://blog.mrtz.org/2015/03/09/competition.html

In [42]: ar2d = numpy.array([[1, 2, 3], [11, 12, 13], [10, 20, 40]], dtype='uint8', order='C')

In [43]: ' '.join(str(ord(x)) for x in ar2d.data)

Out[43]: '1 2 3 11 12 13 10 20 40'

In [44]: ar2df = numpy.array([[1, 2, 3], [11, 12, 13], [10, 20, 40]], dtype='uint8', order='F')

In [45]: ' '.join(str(ord(x)) for x in ar2df.data)

Out[45]: '1 11 10 2 12 20 3 13 40'

Big Data?

Criteo 1tb data:

Data size:● ~46GB/day● ~180,000,000/day● ~3.5% events rate

Raw Data:35TB@.1%

Data:1TB@3.5%(189 GB in columnar parquet format)

Balanced classes:70GB(12 GB parquet)

Scalability! But at what COST?

“You can have a second computer once you’ve shown you know how to use the first one.” – Paul Barham

50 shades of machine learning

Supervised Unsupervised

Semi-supervised

Classification Regression Sequence prediction

Structure prediction

Reinforcement learning

Time series forecasting

Clustering Dimensionality reduction

Topic modeling

Recommendation

Online/Streaming ML

Ranking

Survival Analysis

Anomaly detection

Buzzword maker: REALTIME + BIGDATA + 1 or 2 boxes above = Profit

Model state (knowledge) vs hyperparameters

LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION

* Pedro Domingos, A few useful things to know about machine learning, 2012.

Evaluation = LossFunction(Prediction, True label)

OptimizationModel parameters Hyperparameters

Combinatorial optimization:● Greedy search ● Beam search ● Branch-and-bound

Continuous optimization❖ Unconstrained ❏ Gradient descent ❏ Conjugate gradient ❏ Quasi-Newton methods ❖ Constrained ❏ Linear programming ❏ Quadratic programming

● Grid search● Random Search● Bayesian Optimization● Tree of Parzen Estimators (TPE)● Gradient based optimization

Distributed Machine Learning

Model fits in memory

Data fits in memory

Yes No

No Distributed data (hdfs, spark)

Distributed data, distributed models

Distributed Machine Learning

Data1 Model 1...DataN Model N

Model Data Parallelism

http://parameterserver.org/https://github.com/intel-machine-learning/DistMLhttp://www.dmtk.io/https://petuum.github.io/bosen.html

Speed up distributed machine learning

● Approximate all the things● Update asynchronously ● Early stopping

We draw inspiration from the high-level programming models of dataflow systems, and the low-level efficiency of parameter servers.

TensorFlow: A system for large-scale machine learning

A better model when time is the constraint

Сost based optimization

Automating Model Search for Large Scale Machine Learning

Apache SystemMLAutomatic OptimizationAlgorithms specified in DML and PyDML are dynamically compiled and optimized based on data and cluster characteristics using rule-based and cost-based optimization techniques. The optimizer automatically generates hybrid runtime execution plans ranging from in-memory single-node execution to distributed computations on Spark or Hadoop. This ensures both efficiency and scalability. Automatic optimization reduces or eliminates the need to hand-tune distributed runtime execution plans and system configurations.

Ensembles● Bagging.

● Boosting.

● Blending.

● Stacking.

Dark knowledge

http://www.ttic.edu/dl/dark14.pdf https://www.youtube.com/watch?v=EK61htlw8hY

Test time prediction

● Different environment● Different hardware ● Different requirements

Types of model transferring1. Model serialization:- Bound to a single language- Bound to a single version

2. Metadata + data (Spark-2.0)(https://tensorflow.github.io/serving/) 3. PMML (http://dmg.org/pmml/v4-2-1/GeneralStructure.html) 4. PFA (http://dmg.org/pfa/index.html) 5. Code generation (h2o.ai)

http://tullo.ch/articles/decision-tree-evaluation/https://blog.acolyer.org/2016/02/29/machine-learning-the-high-interest-credit-card-of-technical-debt/https://blog.acolyer.org/2016/03/01/ad-click-prediction-a-view-from-the-trenches/Automating Model Search for Large Scale Machine Learning

Papers & articles

Thanks, QA

AI&BigData Lab 2016. Руденко Петр: Особенности обучения,...

Technology

Transcript of AI&BigData Lab 2016. Руденко Петр: Особенности обучения,...

Cостояние и перспективы машинного интеллекта

Методы машинного обучения для задачи ранжированияelar.urfu.ru/bitstream/10995/2070/1/RuSSIR_2009_03.pdf · Методы машинного

Новые возможности человеко-машинного интерфейса

ИСПОЛЬЗОВАНИЕ ВОЗМОЖНОСТЕЙ МАШИННОГО …

Основы машинного обучения. Дмитрий Соболев

Применение машинного обучения для навигации и управления роботами

Yanukopedia, автор Сергей Руденко

Методы машинного обучения в физике элементарных частиц

Как подготовиться к гигабитной DDoS-атаке при помощи машинного обучения

Практика машинного обучения: вопросы и проблемы при работе над ML-проектом

Lamoda - Vertica › wp-content › uploads › 2018 › 07 › ... · Python, для машинного обучения, интеграции с сервисами обработки

портфоліо руденко світлани григорівни

ДАННЫЕ - amr.ru€¦ · Big Data -х лет ... подхода в Маркетинге? 1. ... Применение методов машинного обучения к массивам

МЕТОДЫ МАШИННОГО ОБУЧЕНИЯazforus-med.ru/wp-content/uploads/2018/02/Mezhfakultetskij2018.pdf · Методы машинного обучения Методы

ФУНДАМЕНТАЛЬНЫЕ ПРОБЛЕМЫ ВОДЫ И …...15:30–15:45 Применение современных методов машинного обучения для

Руденко Е.Н. Этнолингвистика без границ

Руденко-задачник По С

Задачи машинного обучения

А.Левенчук -- практики ЖЦ систем машинного обучения

Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой аналитики