Spark Сегментация пользователей в он-‐лайн рекламе Сергей Жемжицкий, CTO CleverDATA для Data Science Week 2015
DATA MINING
Company Profile Сегментация пользователей в online-‐рекламе
Spark vs Hadoop
Сергей Жемжицкий, CTO, CleverDATA, 28 августа, 2015
cleverdata.ru | [email protected]
InternaPonal market business development since 2012
One of three leading IT companies in Russia 43 branches in Russia and abroad +5500 employees 100K projects for 10K customers
Data management innovaPve plaXorm (Data Exchange Service)
Cloud Service In-‐house development
Internet adver[sing soluPons Data Management Placorms
Customers Base Management Web Analy[cs
Marke[ng automaPon Big Data Data Mining Digital Intelligence Opera[onal Intelligence Low Latency and NoSQL Cloud Compu[ng
cleverdata.ru | [email protected]
Агенда
• Про задачу; • Hadoop vs. Spark; • Особенности; • Что дальше.
cleverdata.ru | [email protected]
publishers
AD NETWORK AD NETWORK AD NETWORK
AD NETWORK AD NETWORK AD NETWORK
adver[sers
DS P
SS P
Real Time Bidding (RTB)
TRACKING DATA
cleverdata.ru | [email protected]
publishers
COOKIE SYNCs ACCESS LOGS
PARTNER’S DATA 3rd PARTY DATA CLICK STREAMS
adver[sers
SS P
DS P
DMP
Data Management PlaXorm (DMP)
cleverdata.ru | [email protected]
3rd party data
Rela[onal Data Store
raw data 3rd party data
3rd party data
Raw Data Store & Processing
RealTime Data Store
user profiles aggregates
Типовые потоки данных
cleverdata.ru | [email protected]
Типовые потоки данных :: RTB
3rd party data
Rela[onal Data Store
RTB
SRV
Exchange SSP
bid req. bid resp.
pixels :: impressions :: clicks
bid requests
user profiles
raw data 3rd party data
3rd party data
Raw Data Store & Processing
RealTime Data Store
user profiles aggregates
cleverdata.ru | [email protected]
1st-‐party data
3rd party data
Rela[onal Data Store
RTB
SRV
Exchange SSP
bid req. bid resp.
pixels :: impressions :: clicks
bid requests
user profiles
raw data 3rd party data
3rd party data
Raw Data Store & Processing
RealTime Data Store
user profiles aggregates
cleverdata.ru | [email protected]
Задача
Найти всех пользователей, которые участвовали в рекламной кампании “Star Wars” [и] видели один из баннеров “Darth Vader” или “Luke Skywalker”
в течении последних 6 дней [и] кликнули на этот баннер [и] посетили страницу покупки светового меча Darth’а Vader’а [и] но так ничего и не купили
Для того, чтобы сделать ретаргетинг персонифицированным баннером со скидкой на меч в 40%
cleverdata.ru | [email protected]
find all users who have taken part in campaign[s] “Star Wars” [and] viewed banner[s] “Darth Vader” or “Luke Skywalker”
during [last] 6 day[s] [and] clicked banner[s] “Darth Vader's lightsaber” [and] visited buying area of “Darth Vader's lightsaber” [and] not visited order confirmed area of “Darth Vader's lightsaber”
Задача
[impression]
[click] [tr. pixel] [tr. pixel]
id cookie event_id event_type campaign_id [mestamp …
1 c1 “Darth Vader” impression “Star Wars” 2015-‐04-‐20 14:25:11.462 … 2 c1 “Darth Vader's lightsaber” click “Star Wars” 2015-‐04-‐21 06:31:12.157 … 3 c1 “Darth Vader's lightsaber” tr. pixel “Star Wars” 2015-‐04-‐22 18:57:19.628 …
[cookies]
cleverdata.ru | [email protected]
Задача reduce
find all users who have taken part in campaign[s] “Star Wars” viewed banner[s] “Darth Vader” or “Luke Skywalker” during [last] 6 day[s] clicked banner[s] “Darth Vader's lightsaber” visited buying area of “Darth Vader's lightsaber” not visited order confirmed area of “Darth Vader’s lightsaber”
(c1, 0)
(c1, 1)
(c1, 2)
(c1, 3)
Ø
map
(c1, 0;1;2;3) true(0) and true(1) and true(2) and true(3) and not false(4)
C1
id cookie event_id event_type campaign_id [mestamp …
1 c1 “Darth Vader” impression “Star Wars” 2015-‐04-‐20 14:25:11.462 … 2 c1 “Darth Vader's lightsaber” click “Star Wars” 2015-‐04-‐21 06:31:12.157 … 3 c1 “Darth Vader's lightsaber” tr. pixel “Star Wars” 2015-‐04-‐22 18:57:19.628 …
cleverdata.ru | [email protected]
Материалы и инструменты
Hardware (3 Nodes) • 12 Core AMD Opteron™ 6338P ~ 2.8 GHz
• 64 GB RAM
• 1 GBPS NICs
So�ware • CDH 5.3.1 (Hadoop 2.5.0) • Spark 1.2.0
Data
• 14.2 GB of raw data
• 61.1 M of transac[ons
• 128 MB block size
cleverdata.ru | [email protected]
MR vs Spark :: Инициализация
MR
ü protected void setup(Context ctx) ü o.a.h.c.Configured ü distributed cache
Spark
ü mapRegion ü broadcast vars
cleverdata.ru | [email protected]
MR vs Spark :: Параллелизм
MR
ü mapred.reduce.tasks ü mapreduce.job.reduces ü spli�able formats
Spark
ü spark.default.parallelism ü num-‐executors, executor-‐cores in
yarn ü numTasks в groupByKey,
reduceByKey, aggregateByKey…
cleverdata.ru | [email protected]
MR vs Spark :: Зависимости
MR
ü �o.a.h.u.Tool ü o.a.h.u.ToolRunner ü -‐conf app.conf ü -‐files ü -‐libjars ü �setUserClassesTakesPrecedence
Spark
ü -‐-‐jars ü -‐-‐files ü -‐-‐conf ü -‐-‐driver-‐java-‐op[ons ü spark.driver.extraJavaOp[ons ü spark.executor.extraJavaOp[ons ü spark.driver.userClassPathFirst ü spark.executor.userClassPathFirst
cleverdata.ru | [email protected]
MR vs Spark :: Secondary Sort
MR
ü ��setSortComparatorClass ü �setGroupingComparatorClass ü �setPar[[onerClass
Spark
ü repar[[onAndSortWithinPar[[ons ü mapPar[[ons ü En[re par[[on processing result
must be able to fit in memory
cleverdata.ru | [email protected]
MR vs Spark :: Статистика
MR
ü Counters
Spark
ü Accumulators – use in ac[ons only Spark гарантирует, что вызов accumulator-‐а примениться единожды только для ac[on-‐а, но не для трансформаций
cleverdata.ru | [email protected]
MR vs Spark :: Тестирование
MR
ü ��MRUnit ü o.a.h.h.MiniDFSCluster ü o.a.h.m.MiniMRCluster ü o.a.h.y.s.MiniYARNCluster ü o.a.h.m.v2.MiniMRYarnCluster
Spark
ü Local executor
cleverdata.ru | [email protected]
Что дальше и почему Spark?
• Spark Streaming;
• Micro Batches;
• λ-‐архитектура.
без серьезного хирургического вмешательства
[email protected] :: [email protected]
cleverleaf.co.uk :: cleverdata.ru
1dmp.io :: crawler.1dmp.io
facebook.com/CleverData :: +7 (495) 967-‐66-‐50
Top Related