AI&BigData Lab 2016. Руденко Петр: Особенности обучения,...

Post on 17-Jan-2017

118 views 0 download

Transcript of AI&BigData Lab 2016. Руденко Петр: Особенности обучения,...

Training, tuning, selecting & serving of machine learning models at scale

Peter Rudenko@peter_rud

peter.rudenko@datarobot.com

Input data

Balanced vs skewed target distribution

The devil is in the detail:○ Partitioning○ Leakage○ Sample size

http://blog.mrtz.org/2015/03/09/competition.html

In [42]: ar2d = numpy.array([[1, 2, 3], [11, 12, 13], [10, 20, 40]], dtype='uint8', order='C')

In [43]: ' '.join(str(ord(x)) for x in ar2d.data)

Out[43]: '1 2 3 11 12 13 10 20 40'

In [44]: ar2df = numpy.array([[1, 2, 3], [11, 12, 13], [10, 20, 40]], dtype='uint8', order='F')

In [45]: ' '.join(str(ord(x)) for x in ar2df.data)

Out[45]: '1 11 10 2 12 20 3 13 40'

Big Data?

Criteo 1tb data:

Data size:● ~46GB/day● ~180,000,000/day● ~3.5% events rate

Raw Data:35TB@.1%

Data:1TB@3.5%(189 GB in columnar parquet format)

Balanced classes:70GB(12 GB parquet)

Scalability! But at what COST?

“You can have a second computer once you’ve shown you know how to use the first one.” – Paul Barham

50 shades of machine learning

Supervised Unsupervised

Semi-supervised

Classification Regression Sequence prediction

Structure prediction

Reinforcement learning

Time series forecasting

Clustering Dimensionality reduction

Topic modeling

Recommendation

Online/Streaming ML

Ranking

Survival Analysis

Anomaly detection

Buzzword maker: REALTIME + BIGDATA + 1 or 2 boxes above = Profit

Model state (knowledge) vs hyperparameters

LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION

* Pedro Domingos, A few useful things to know about machine learning, 2012.

Evaluation = LossFunction(Prediction, True label)

OptimizationModel parameters Hyperparameters

Combinatorial optimization:● Greedy search ● Beam search ● Branch-and-bound

Continuous optimization❖ Unconstrained ❏ Gradient descent ❏ Conjugate gradient ❏ Quasi-Newton methods ❖ Constrained ❏ Linear programming ❏ Quadratic programming

● Grid search● Random Search● Bayesian Optimization● Tree of Parzen Estimators (TPE)● Gradient based optimization

Distributed Machine Learning

Model fits in memory

Data fits in memory

Yes No

Yes

No Distributed data (hdfs, spark)

Distributed data, distributed models

Distributed Machine Learning

Data1 Model 1...DataN Model N

Model Data Parallelism

http://parameterserver.org/https://github.com/intel-machine-learning/DistMLhttp://www.dmtk.io/https://petuum.github.io/bosen.html

Model

Speed up distributed machine learning

● Approximate all the things● Update asynchronously ● Early stopping

We draw inspiration from the high-level programming models of dataflow systems, and the low-level efficiency of parameter servers.

TensorFlow: A system for large-scale machine learning

A better model when time is the constraint

Сost based optimization

Automating Model Search for Large Scale Machine Learning

Apache SystemMLAutomatic OptimizationAlgorithms specified in DML and PyDML are dynamically compiled and optimized based on data and cluster characteristics using rule-based and cost-based optimization techniques. The optimizer automatically generates hybrid runtime execution plans ranging from in-memory single-node execution to distributed computations on Spark or Hadoop. This ensures both efficiency and scalability. Automatic optimization reduces or eliminates the need to hand-tune distributed runtime execution plans and system configurations.

Ensembles● Bagging.

● Boosting.

● Blending.

● Stacking.

Dark knowledge

http://www.ttic.edu/dl/dark14.pdf https://www.youtube.com/watch?v=EK61htlw8hY

Test time prediction

● Different environment● Different hardware ● Different requirements

Types of model transferring1. Model serialization:- Bound to a single language- Bound to a single version

2. Metadata + data (Spark-2.0)(https://tensorflow.github.io/serving/) 3. PMML (http://dmg.org/pmml/v4-2-1/GeneralStructure.html) 4. PFA (http://dmg.org/pfa/index.html) 5. Code generation (h2o.ai)

Thanks, QA