HadoopCon'16, Taipei @myui

60
Hivemall: Machine Learning Library for Apache Hive/Spark Research Engineer Makoto YUI (油井 誠) @myui <[email protected]> 1 2016/09/09 HadoopCon 16, Taipei

Transcript of HadoopCon'16, Taipei @myui

Page 1: HadoopCon'16, Taipei @myui

Hivemall:MachineLearningLibraryforApacheHive/Spark

ResearchEngineerMakotoYUI(油井誠)@myui

<[email protected]>

12016/09/09HadoopCon16,Taipei

Page 2: HadoopCon'16, Taipei @myui

Ø 2015.04~ ResearchEngineeratTreasureData,Inc.• MymissionisdevelopingML-as-a-ServiceinaHadoop-as-

a-servicecompany

Ø 2010.04-2015.03SeniorResearcheratNationalInstituteofAdvancedIndustrialScienceandTechnology,Japan.産業技術総合研究所• DevelopedHivemallasapersonalresearchproject

Ø 2009.03Ph.D.inComputerSciencefromNAIST• MajoredinParallelDataProcessing,notMLthen

Ø VisitingscholarinCWI,AmsterdamandUniv.Edinburgh

Littleaboutme..

2016/09/09HadoopCon16,Taipei 2

Page 3: HadoopCon'16, Taipei @myui

2016/09/09HadoopCon16,Taipei 3

Hiro YoshikawaCEO

Kaz OtaCTO

Sada FuruhashiChief Architect

Open source business veteran

Founder - world’s largest Hadoop group

Invented Fluentd, Messagepack

TODAY100+ Employees, 30M+ funding

2015 New office in Seoul, Korea

2013 New office in Tokyo, Japan

2012 Founded in Mountain View, CA

InvestorsJerry YangYahoo! Founder

Bill TaiAngel Investor

Yukihiro MatsumotoRuby Inventor

Sierra Ventures - Tim GuleriEntrerprise Software

Scale Ventures - Andy Vitus B2B SaaS

TreasureData

Page 4: HadoopCon'16, Taipei @myui

2016/09/09HadoopCon16,Taipei 4

WeOpen-source!TDinvented..

Streaming log collector Bulk data import/export efficient binary serialization

Streaming Query ProcessorMachine learning on Hadoop

digdag.io

Workflow engine (Beta)

Page 5: HadoopCon'16, Taipei @myui

2016/09/09HadoopCon 16,Taipei 5

Microsoft OperationManagementSuite andGoogleCloudPlatform(Kubernates)areusingFluentd forlogcollection

Point

Ourtechnologyusers

Page 6: HadoopCon'16, Taipei @myui

2016/09/09HadoopCon 16,Taipei 6

Microsoft OperationManagementSuite andGoogleCloudPlatform(Kubernates)areusingFluentd forlogcollection

Point

Ourtechnologyusers

Page 7: HadoopCon'16, Taipei @myui

2016/09/09HadoopCon16,Taipei 7

TreasureData’sSolution

Page 8: HadoopCon'16, Taipei @myui

2016/09/09HadoopCon16,Taipei 8

BigDataStatsinTD

Page 9: HadoopCon'16, Taipei @myui

Ad-tech

IoT

三菱重工

Agency/Trading Desk DMP / DSP Ad-Network

Diverse Corporate Identity Manual 02

コーポレートカラー

千歳緑(ちとせみどり)この千歳緑をDiversのコーポレートカラーとします。

千歳緑は、常緑の松の緑をさし、吉祥的な意味を持つ事から、おめでたく、喜ばしい意味を持ちます。

繁栄・幸運を意味し、吉祥天は幸福・美・富を顕す神であるとともに、美女の代名詞ともされています。

■ CMYK / プロセスカラーC : 85% M : 17% Y : 76% K : 57%

■ PANTONE / プロセスカラー555EC

■ RGB / モニターR : 0 G : 80 B : 60

背景と干渉する場合に使用するボックスロゴ

背景と干渉する場合に使用するボックスロゴ 白黒

白黒のみの場合

EC Media Game/SNS

Gaminge-Commerce InternetService

Retail Finance TechnologyTelecommunicationMaker

Otherdomain

OurCustomers

2016/09/09HadoopCon16,Taipei 9

Page 10: HadoopCon'16, Taipei @myui

Ad-tech

IoT

三菱重工

Agency/Trading Desk DMP / DSP Ad-Network

Diverse Corporate Identity Manual 02

コーポレートカラー

千歳緑(ちとせみどり)この千歳緑をDiversのコーポレートカラーとします。

千歳緑は、常緑の松の緑をさし、吉祥的な意味を持つ事から、おめでたく、喜ばしい意味を持ちます。

繁栄・幸運を意味し、吉祥天は幸福・美・富を顕す神であるとともに、美女の代名詞ともされています。

■ CMYK / プロセスカラーC : 85% M : 17% Y : 76% K : 57%

■ PANTONE / プロセスカラー555EC

■ RGB / モニターR : 0 G : 80 B : 60

背景と干渉する場合に使用するボックスロゴ

背景と干渉する場合に使用するボックスロゴ 白黒

白黒のみの場合

EC Media Game/SNS

Gaminge-Commerce InternetService

Retail Finance TechnologyTelecommunicationMaker

Otherdomain

OurCustomers

2016/09/09HadoopCon16,Taipei 10

Page 11: HadoopCon'16, Taipei @myui

1. WhatisHivemall(introduction)

2. WhyHivemall(motivationsetc.)

3. HivemallInternals

4. HowtouseHivemall

5. Futureroadmap

Agenda

2016/09/09HadoopCon16,Taipei 11

Page 12: HadoopCon'16, Taipei @myui

WhatisHivemall

ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs,licensedundertheApacheLicensev2

12

https://github.com/myui/hivemall

2016/09/09HadoopCon16,Taipei

Page 13: HadoopCon'16, Taipei @myui

HadoopHDFS

MapReduce(MRv1)

Hivemall

ApacheYARN

ApacheTezDAGprocessing

Machine Learning

Query Processing

Parallel Data Processing Framework

Resource Management

Distributed File SystemCloud Storage

SparkSQL

ApacheSpark

MESOS

Hive Pig

MLlib

Hivemall’s TechnologyStack

AmazonS3

2016/09/09HadoopCon16,Taipei 13

Page 14: HadoopCon'16, Taipei @myui

Hivemall’s Vision:MLonSQL

ClassificationwithMahout

CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)✓InteractiveandStableAPIsw/ SQLabstraction

ThisSQLqueryautomaticallyrunsinparallelonHadoop

142016/09/09HadoopCon16,Taipei

Page 15: HadoopCon'16, Taipei @myui

ListofsupportedAlgorithms

Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

15

Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression

SCW is a good first choiceTry RandomForest if SCW does not work

Logistic regression is good for getting a probability of a positive class

Factorization Machines is good where features are sparse and categorical ones

2016/09/09HadoopCon16,Taipei

Page 16: HadoopCon'16, Taipei @myui

ListofAlgorithmsforRecommendation

16

K-NearestNeighbor✓ Minhash andb-BitMinhash

(LSHvariant)✓ SimilaritySearchonVectorSpace

(Euclid/Cosine/Jaccard/Angular)

MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)

each_top_k functionofHivemallisusefulforrecommendingtop-kitems

2016/09/09HadoopCon16,Taipei

Page 17: HadoopCon'16, Taipei @myui

OtherSupportedAlgorithms

17

AnomalyDetection✓ LocalOutlierFactor(LoF)

FeatureEngineering✓FeatureHashing✓FeatureScaling

(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion

(FeaturePairing)✓ Amplifier

NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)

2016/09/09HadoopCon16,Taipei

Page 18: HadoopCon'16, Taipei @myui

• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore

• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.

IndustryusecasesofHivemall

182016/09/09HadoopCon16,Taipei

Page 19: HadoopCon'16, Taipei @myui

• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore

• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.

• Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo

IndustryusecasesofHivemall

19

Problem:Recommendationusinghot-itemishardinhand-craftedproductmarketbecauseeachcreatorsellsfewsingleitems(willsoonbecomeout-of-stock)

2016/09/09HadoopCon16,Taipei

minne.com

Page 20: HadoopCon'16, Taipei @myui

• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore

• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.

• Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo

• ValuepredictionofRealestates• Algorithm:Regression• Livesense

IndustryusecasesofHivemall

202016/09/09HadoopCon16,Taipei

Page 21: HadoopCon'16, Taipei @myui

• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore

• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.

• Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo

• ValuepredictionofRealestates• Algorithm:Regression• Livesense

• Userscorecalculation• Algrorithm:Regression• Klout

IndustryusecasesofHivemall

21

bit.ly/klout-hivemall

2016/09/09HadoopCon16,Taipei

Influencermarketing

klout.com

Page 22: HadoopCon'16, Taipei @myui

OISIX,aleadingfooddeliveryservicecompanyinJapan,usedHivemall’s LogisticRegressiontogetchurnprobability

2016/09/09HadoopCon16,Taipei 22

ChurnDetectionofMonthlyPaymentService

ChurnratedroppedalmostbyhalfbygivinggiftpointstocustomersbeingpredictedtoleaveJ

Page 23: HadoopCon'16, Taipei @myui

1. WhatisHivemall

2. WhyHivemall(motivationsetc.)

3. HivemallInternals

4. HowtouseHivemall

5. Futureroadmap

Agenda

2016/09/09HadoopCon16,Taipei 23

Page 24: HadoopCon'16, Taipei @myui

2016/09/09HadoopCon16,Taipei

Motivation– WhyanewMLframework?

Mahout?

VowpalWabbit?(w/Hadoopstreaming)

SparkMLlib?

0xdataH2O? ClouderaOryx?

MachineLearningframeworksoutthere thatrunwithHadoop

QuickPoll:Howmanypeopleinthisroomareusingthem?

24

Page 25: HadoopCon'16, Taipei @myui

HowIusedtodoMLprojectsbeforeHivemall

GivenrawdatastoredonHadoopHDFS

RawData

HDFSS3 FeatureVector

Extract-Transform-Load

MachineLearning

file

2016/09/09HadoopCon16,Taipei 25

height:173cmweight:60kg

age:34gender:man

Page 26: HadoopCon'16, Taipei @myui

HowIusedtodoMLprojectsbeforeHivemall

GivenrawdatastoredonHadoopHDFS

RawData

HDFSS3 FeatureVector

height:173cmweight:60kg

age:34gender:man

Extract-Transform-Load

file

Needtodoexpensivedatapreprocessing

(Joins,Filtering,andFormattingofDatathatdoesnotfitinmemory)

MachineLearning2016/09/09HadoopCon16,Taipei 26

Page 27: HadoopCon'16, Taipei @myui

HowIusedtodoMLprojectsbeforeHivemall

GivenrawdatastoredonHadoopHDFS

RawData

HDFSS3 FeatureVector

Extract-Transform-Load

file

DonotscaleHavetolearnR/PythonAPIs

height:173cmweight:60kg

age:34gender:man

2016/09/09HadoopCon16,Taipei 27

Page 28: HadoopCon'16, Taipei @myui

Hivemall’s Vision:MLonSQL(again)

ClassificationwithMahout

CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)✓InteractiveandStableAPIsw/ SQLabstraction

ThisSQLqueryautomaticallyrunsinparallelonHadoop

2016/09/09HadoopCon16,Taipei 28

Page 29: HadoopCon'16, Taipei @myui

29

HivemallonApacheSpark

Installationisveryeasyasfollows:$spark-shell--packagesmaropu:hivemall-spark:0.0.6

2016/09/09HadoopCon16,Taipei

Page 30: HadoopCon'16, Taipei @myui

1. WhatisHivemall

2. WhyHivemall(motivationsetc.)

3. HivemallInternals

4. HowtouseHivemall

5. Futureroadmap

Agenda

2016/09/09HadoopCon16,Taipei 30

Page 31: HadoopCon'16, Taipei @myui

ImplementedmachinelearningalgorithmsasUser-DefinedTablegeneratingFunctions(UDTFs)

HowHivemallworksintraining

+1,<1,2>..+1,<1,7,9>

-1,<1,3,9>..+1,<3,8>

tuple<label,array<features>>

tuple<feature,weights>

Predictionmodel

UDTF

Relation<feature,weights>

param-mix param-mix

Trainingtable

Shufflebyfeature

train train

● Resulting prediction model is a relation of feature and its weight

● # of mapper and reducers are configurable

UDTFisafunctionthatreturnsarelation

ParallelismisPowerful

2016/09/09HadoopCon16,Taipei 31

Page 32: HadoopCon'16, Taipei @myui

32

train train

+1,<1,2>..

+1,<1,7,9>

-1,<1,3,9>..

+1,<3,8>

tuple<label,featues>

array<weight>

Trainingtable

-1,<2,7,9>..

+1,<3,8>

MIX

-1,<2,7,9>..

+1,<3,8>

train train

array<weight>

Parameteraveraging(bagging)

2016/09/09HadoopCon16,Taipei

Page 33: HadoopCon'16, Taipei @myui

AlternativeApproachinHivemallHivemallprovidesthe amplify UDTFtoenumerateiterationeffectsinmachinelearningwithoutseveralMapReduce steps

SET hivevar:xtimes=3;

CREATE VIEW training_x3asSELECT*

FROM (SELECTamplify(${xtimes}, *) as (rowid, label, features)FROMtraining

) tCLUSTER BY rand()

2016/09/09HadoopCon16,Taipei 33

Page 34: HadoopCon'16, Taipei @myui

1. WhatisHivemall

2. WhyHivemall(motivationsetc.)

3. HivemallInternals

4. HowtouseHivemall

5. Futureroadmap

Agenda

2016/09/09HadoopCon16,Taipei 34

Page 35: HadoopCon'16, Taipei @myui

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Datapreparation352016/09/09HadoopCon16,Taipei

Page 36: HadoopCon'16, Taipei @myui

Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';

HowtouseHivemall- Datapreparation

DefineaHivetablefortraining/testingdata

362016/09/09HadoopCon16,Taipei

Page 37: HadoopCon'16, Taipei @myui

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

FeatureEngineering

372016/09/09HadoopCon16,Taipei

Page 38: HadoopCon'16, Taipei @myui

create view e2006tfidf_train_scaled asselect

rowid,rescale(target,${min_label},${max_label}) as label,

featuresfrom

e2006tfidf_train;

Applying a Min-Max Feature Normalization

HowtouseHivemall- FeatureEngineering

Transformingalabelvaluetoavaluebetween0.0and1.0

382016/09/09HadoopCon16,Taipei

Page 39: HadoopCon'16, Taipei @myui

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Training

392016/09/09HadoopCon16,Taipei

Page 40: HadoopCon'16, Taipei @myui

HowtouseHivemall- Training

CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Trainingbylogisticregression

map-onlytasktolearnapredictionmodel

Shufflemap-outputstoreducesbyfeature

Reducersperformmodelaveraginginparallel

402016/09/09HadoopCon16,Taipei

Page 41: HadoopCon'16, Taipei @myui

HowtouseHivemall- Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

TrainingofConfidenceWeightedClassifier

Votetousenegativeorpositiveweightsforavg

+0.7,+0.3,+0.2,-0.1,+0.7

TrainingfortheCWclassifier

412016/09/09HadoopCon16,Taipei

Page 42: HadoopCon'16, Taipei @myui

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Prediction

422016/09/09HadoopCon16,Taipei

Page 43: HadoopCon'16, Taipei @myui

HowtouseHivemall- Prediction

CREATE TABLE lr_predictasSELECT

t.rowid, sigmoid(sum(m.weight)) as prob

FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)

GROUP BY t.rowid

PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel

Noneedtoloadtheentiremodelintomemory

432016/09/09HadoopCon16,Taipei

Page 44: HadoopCon'16, Taipei @myui

Real-timeprediction

MachineLearning

Batch Training on Hadoop

Online Prediction on RDBMS

PredictionModel Label

FeatureVector

FeatureVector

Label

Exportpredictionmodels

44

bit.ly/hivemall-rtp

2016/09/09HadoopCon16,Taipei

Page 45: HadoopCon'16, Taipei @myui

RandomForestinHivemall

EnsembleofDecisionTrees

2016/09/09HadoopCon16,Taipei 45

Page 46: HadoopCon'16, Taipei @myui

TrainingofRandomForest

2016/09/09HadoopCon16,Taipei 46

Page 47: HadoopCon'16, Taipei @myui

PredictionofRandomForest

2016/09/09HadoopCon16,Taipei 47

Page 48: HadoopCon'16, Taipei @myui

1. WhatisHivemall

2. WhyHivemall(motivationsetc.)

3. HivemallInternals

4. HowtouseHivemall

5. Futureroadmap

Agenda

2016/09/09HadoopCon16,Taipei 48

Page 49: HadoopCon'16, Taipei @myui

49

FutureofHivemall

HivemallwillbecomeApacheHivemall(?)Nowonvotingthough..

2016/09/09HadoopCon16,Taipei

Page 50: HadoopCon'16, Taipei @myui

50

ApacheIncubationstatus

2016/09/09HadoopCon16,Taipei

Page 51: HadoopCon'16, Taipei @myui

•MakotoYui<TreasureData>• TakeshiYamamuro <NTT>Ø HivemallonApacheSpark• DanielDai<Hortonworks>Ø HivemallonApachePigØ ApachePigPMCmember• TsuyoshiOzawa<NTT>ØApacheHadoopPMCmember• KaiSasaki<TreasureData>

51

Initialcommitters

2016/09/09HadoopCon16,Taipei

Page 52: HadoopCon'16, Taipei @myui

Champion

NominatedMentors

52

Projectmentors

• ReynoldXin<Databricks,ASFmember>ApacheSparkPMCmember• MarkusWeimer<Microsoft,ASFmember>ApacheREEFPMCmember• Xiangrui Meng <Databricks,ASFmember>ApacheSparkPMCmember

• RomanShaposhnik <Pivotal,ASFmember>ApacheBigtop/IncubatorPMCmember

2016/09/09HadoopCon16,Taipei

Page 53: HadoopCon'16, Taipei @myui

• PossiblyenterApacheIncubatorsoon• IPclearanceandproject/repositorysitesetup•Contributionguideline•CreatewhouseHivemalllist•Moredocumentations!SepttoNov• InitialApacheReleasewillbeDec(orlateNov?)

53

Roadmap

2016/09/09HadoopCon16,Taipei

Page 54: HadoopCon'16, Taipei @myui

ü HivemallonSpark2.0w/Dataframesupportü XGBoost support

54

ComingNewFeatures- alreadymergedinMaster

2016/09/09HadoopCon16,Taipei

PleaseReferbit.ly/hivemall-xgboost

fordetail

Page 55: HadoopCon'16, Taipei @myui

ü ChangeFinder• Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data

55

ComingNewFeatures- alreadymergedinMaster

J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.

2016/09/09HadoopCon16,Taipei

Page 56: HadoopCon'16, Taipei @myui

ü ChangeFinder• Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data

56

ComingNewFeatures- alreadymergedinMaster

J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.

2016/09/09HadoopCon16,Taipei

Page 57: HadoopCon'16, Taipei @myui

ü VariousEvaluationMetrics•PR#326

57

ComingNewFeatures- alreadymergedinMaster

2016/09/09HadoopCon16,Taipei

Page 58: HadoopCon'16, Taipei @myui

• v0.5-beta{1,2}release(Oct-Nov)üone-hotencodingü Field-awareFactorizationMachinesü Kernelized PassiveAggressiveüGeneralizedLinearModelü OptimizerframeworkincludingADAMü L1/L2regularization

ü GradientTreeBoostingü OnlineLDA

58

Otherundergoingnewfeatures

2016/09/09HadoopCon16,Taipei

Page 59: HadoopCon'16, Taipei @myui

ConclusionandTakeaway

HivemallprovidesacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs

59

Ø ForSQLusersthatneedMLØ ForwhomalreadyusingHiveØ Easy-of-useandscalabilityinmind

Donotrequirecoding,packaging,compilingorintroducinganewprogramminglanguageor APIs.

Hivemall’s Positioning

WewelcomeyourcontributionstoApacheHivemallJ

2016/09/09HadoopCon16,Taipei

Page 60: HadoopCon'16, Taipei @myui

60

Anyfeaturerequestorquestions?

#hivemall

2016/09/09HadoopCon16,Taipei