Hive/Pigを使ったKDD'12 track2の広告クリック率予測

油井誠 m.yui@aist.go.jp

産業技術総合研究所情報技術研究部門

Twitter ID: @myui

スライド http://www.slideshare.net/myui/dsirnlp-myuilt http://goo.gl/Ulf3A 1

KDDcup 2012 track2

• 検索ログを基に、検索エンジンの広告のクリック率(Click-Through Rate)を推定するタスク

– 中国の3大検索エンジンの一つsoso.comの実データ

• 検索語などはHash値などを利用してすべて数値化されている

– Trainingデータ(約10GB+2.2GB, 15億レコード）

– Testデータ（約1.3GB, 2億レコード）

• 学習データの1.33割が評価用データセット

– CTRがsubmission format

• クラス分類というより回帰（もちろんクラス分類でも解ける）

学習データのテーブル構成

UserID AdID QueryID Depth Position Impression Click

DisplayURL AdvertiserID KeywordID TitleID DescriptionID

AdID properties Training table

UserID Gender Age User table

QueryID Tokens

Query table

KeywordID Tokens TitleID Tokens DescriptionID Tokens

Keyword table Title table Description table

評価用のテーブルにはimpression、click以外の素性(feature) 基本的に、全部、質的変数 → 二値変数の素性に分解

Click = Positive Impression – Click = Negative CTR = Click / Impression

Label A B

-1 2 7

Label A:1 A:2 A:3 B:7 B:8 B:9

1 1 0 0 0 1 0

-1 0 1 0 0 0 1

1 0 0 1 1 0 0 3

ロジスティック回帰での発生予測

• 発生確率を予測する手法

• 各変数の影響力の強さを計算(Train)

– 入力: Label, Array<feature>

– 出力: 素性ごとの重みのMap<feature, float>

– # of features = 54,686,452

• ただし、token tableは利用していない (Token ID = <token,..,token>)

• 影響力を基に生起確率を計算(Predict)

– P(X) = Pr(Y=1|x1,x2,..,xn)

– f: X → Yとなる関数fを導出したい s.t. empirical lossを最小化 • 勾配降下法を使う

𝑎𝑟𝑔𝑚𝑖𝑛1

𝑛 𝑙𝑜𝑠𝑠(𝑓(𝑥𝑖

𝑖=0

; 𝑤), 𝑦𝑖)

各素性の重み 4

Gradient Descent(勾配降下法)

𝑤𝑡+1 = 𝑤𝑡 − 𝛾𝑡1

𝑛 𝛻𝑙𝑜𝑠𝑠(𝑓(𝑥𝑖

𝑖=0

; 𝑤𝑡), 𝑦)

新しい重み古い重み

経験損失の勾配を基に重みを更新

Jimmy LinのLarge-Scale Machine Learning at Twitterより https://speakerdeck.com/u/lintool/p/large-scale-machine-learning-at-twitter

学習率

𝑖=0

; 𝑤𝑡), 𝑦)

勾配の並列計算

mappers

single reducer

勾配をmapperで並列に計算重みの更新をreducerで行う

• 実際には重みの更新の時に更新されたfeature(xi)が必要 • wはMap<feature, weight>でMap.size()=54,686,452

• Iteration数が多く必要で、入出力がDFSを介すMapReduceに向かない

• Reducerでの計算がボトルネックになる 6

確率的勾配降下法

• Gradient Descent

• Stochastic Gradient Descent (SGD)

– Iterative Parameter Mixで処理すれば、実際意外とうまく動くし、そんなにイテレーション数が必要でない • データ分割して、各mapperで並列にを計算

• モデルパラメタはイテレーション/epochごとに配る

𝑖=0

; 𝑤𝑡), 𝑦)

モデルの更新に全てのトレーニングインスタンスが必要(バッチ学習）

𝑤𝑡+1 = 𝑤𝑡 − 𝛾𝑡𝛻𝑙𝑜𝑠𝑠(𝑓(𝑥;𝑤𝑡), 𝑦)

それぞれのトレーニングインスタンスで重みを更新(オンライン学習）

よくある機械学習のデータフロー

Label, array<feature>

Map <feature, weight>

Trainingデータ Modelデータ

array<feature>

Testデータ

predict

Label/Prob

よくある並列trainのデータフロー

Trainingデータ

Map <feature,weight>

reduce

Modelデータ

重みの平均をとる

SGDで重みを計算

機械学習はaggregationの問題

直感的にはHive/PigのUDAF(user defined aggregation function)で実装すればよいほんとはM/Rよりもparallel aggregationに特化したDremelに向いてる

イテレーションする場合は古いmodelを渡す

よくある並列trainのデータフロー

Trainingデータ

Map <feature,weight>

reduce

Modelデータ

重みの平均をとる

SGDで重みを計算

最初は素直にmapを返すUDAFで作った create table model as select trainLogisticUDAF(features,label [, params]) as weight from training

イテレーションする場合は古いmodelを渡す

mapはsplitサイズの調整でメモリ内に収まるけど、より規模がでかくなると reduceでメモリ不足になるのでデータ量に対してスケールしない

Think relational

Trainingデータ Modelデータ

array<feature>

Testデータ

predict

Label/Prob Scaler値として返すのはダメリレーションでfeature, weightを返そうでも、UDAFは使えない →そこでUDTF (User Defined Table Function)

UDTF (parameter-mix)

select feature, CAST(avg(weight) as FLOAT) as weight from ( select TrainLogisticSgdUDTF(features,label,..) as (feature,weight) from train ) t group by feature;

どうやってiterative parameter mixさせよう？？？

古いmodelを渡さないといけない毎行渡すのはあれだし…

HadoopのInputSplitSizeの設定に応じたmapperが立ち上がる（map-only)

UDTF(iterative parameter mix) create table model1sgditor2 as

select

feature,

CAST(avg(weight) as FLOAT) as weight

from (

select

TrainLogisticIterUDTF(t.features, w.wlist, t.label, ..) as (feature, weight)

training t join feature_weight w on (t.rowid = w.rowid)

group by feature; ここで必要なのは、各行の素性ごとに古いModel

Map<feature, weight>, label相当を渡せばよいので、 Array<feature>に対応するArray<weight>をテーブルを作って inner joinで渡す

Pig版のフローの一例 training_raw = load '$TARGET' as (clicks: int, impression: int, displayid: int, adid: int, advertiserid: int, depth: int, position: int, queryid: int, keywordid: int, titleid: int, descriptionid: int, userid: int, gender: int, age: int); training_bin = foreach training_raw generate flatten(predictor.ctr.BinSplit(clicks, impression)), displayid, adid, advertiserid, depth, position, queryid, keywordid, titleid, descriptionid, userid, gender, age; training_smp = sample training_bin 0.1; training_rnd = foreach training_smp generate (int)(RANDOM() * 100) as dataid, TOTUPLE(*) as training; training_dat = group training_rnd by dataid; model = foreach training_dat generate predictor.ctr.TrainLinear(training_rnd.training.training_smp); store model into '$MODEL'; model = load '$MODEL' as (mdl: map[]); model_lmt = limit model 10; testing_raw = load '$TARGET' as (dataid: int, displayid: int, adid: int, advertiserid: int, depth: int, position: int, queryid: int, keywordid: int, titleid: int, descriptionid: int, userid: int, gender: int, age: int); testing_with_model = cross model_lmt, testing_raw; result = foreach testing_with_model generate dataid, predictor.ctr.Pred(mdl, displayid, adid, advertiserid, depth, position, queryid, keywordid, titleid, descriptionid, userid, gender, age) as ctr; result_grp = group result by dataid; result_ens = foreach result_grp generate group as dataid, predictor.ctr.Ensemble(result.ctr); result_ens_ord = order result_ens by dataid; result_fin = foreach result_ens_ord generate $1; store result_fin into '$RESULT';

弱学習

アンサンブル学習

まとめ

• データ量に対してちゃんとスケールするものができた – インターン生にpig版を作ってもらった

• こちらはUTDFではやっていなくて、モデルファイルを分割して作って、アンサンブル学習させる戦略

– オンラインのモデル更新とかをやるには、updateのないhiveだとinsertにしないといけないので一工夫いる

– Passive-aggressive版も作る予定

• 現状、AUC=0.75程度（優勝者の台湾国立大は0.8） – a9aデータセットだとlibsvm, svm-light, liblinear, tinysvmなどと同程度の精度(0.85ぐらい)

• 余裕があったらHiveにパッチとして送る – でも、ドキュメントとかテストとかｘｘｘｘｘ

実データを持つ共同研究先募集 (一件、広告配信企業とやってる） 15

Hive/Pigを使ったKDD'12 track2の広告クリック率予測

Documents

Transcript of Hive/Pigを使ったKDD'12 track2の広告クリック率予測

Juan Pedro Febles KDD y MD “KDD y MD” “KDD y MD” Dr. Juan Pedro Febles Rodríguez BIOINFO CITMA2005 febles@bioinfo.cufebles@bioinfo.cu .

KDD e Data Mining

Track2 -杨世芬--cloudena-apac-8-11-2012

KDD Overview

MY HIVE€¦ · My Hive 120 o My Hive 120 o med soffa / with sofa My Hive 90o. i Teknisk information/ Technical information (mm) Bordsskivans position över golv: minimum 685 mm,

Triple Hive Stand

Hive ppt (1)

MIEJSCOWY PLAN ZAGOSPODAROWANIA … · kdd-g kdd-g kdd-g kdd-g kdd-g kdd-g kdd-g kdd-g kdw kdd-g kdd-g kdd-g kdd-g kdd-g kdl-p kdl-p kdl-p kdd-g kdd-g kdd-g kdz-p kdz-p kdz-g kdd-g

Effective Hive Queries

Hive Anatomy

10.Introduccion Al KDD

Apache hive

Track2 -刘继伟--openstack in gamewave

Pentesting con back track2

hive FERTILIDAD

[OpenStack Days Korea 2016] Track2 - 가상화 네트워크와 클라우드간 협업

Apresentaçao hive

El Proceso de KDD

20150522 kintone hive

Warre Tree Hive