Download - Big Data Real-Time Analytics in Erlang

Transcript
Page 1: Big Data Real-Time Analytics in Erlang

Big Data Real-Time Analytics

Knut Nesheim · @knutinGameAnalytics

Page 2: Big Data Real-Time Analytics in Erlang

‣ What we do

‣ What we want

‣ Our solution

Topics

Page 3: Big Data Real-Time Analytics in Erlang

Me

‣ Klarna

‣ Wooga

‣ Game Analytics

‣ github.com/knutin

Page 4: Big Data Real-Time Analytics in Erlang

Analytics SaaS for games

Page 5: Big Data Real-Time Analytics in Erlang

Game Analytics

‣ Startup, venture funded, ~2 years old

‣ HQ in Copenhagen, engineering in Berlin

‣ Need to move fast

‣ Willing to take on technical debt

‣ Be ready for traffic growth

‣ Big games are big, millions of DAU

Page 6: Big Data Real-Time Analytics in Erlang

GA.Event.Business(“sheep”, “gold”, 200);

Page 7: Big Data Real-Time Analytics in Erlang

Metrics

‣ Daily Active Users, Monthly Active Users

‣ Revenue

‣ Histogram of event values

Page 8: Big Data Real-Time Analytics in Erlang

Idea: some real-time metrics

Suitable subset

Page 9: Big Data Real-Time Analytics in Erlang

Scope creep: real-time everything

GA v2.0

Page 10: Big Data Real-Time Analytics in Erlang

MapReduce architecture, v1.0

Page 11: Big Data Real-Time Analytics in Erlang

iOS SDK Android SDK Unity SDK

Collectors

HTTP

S3

MongoDB

Online

MapReduce

OfflineSpecial stuff

Website

Work queue

Page 12: Big Data Real-Time Analytics in Erlang

Streaming architecture, v2.0

Page 13: Big Data Real-Time Analytics in Erlang

iOS SDK Android SDK Unity SDK

Collectors

S3

HTTP

Stream DynamoDB

Query API

HistoryReal-time

Website Customers

Page 14: Big Data Real-Time Analytics in Erlang

2013-10-14 2013-10-15

DynamoDB RAM

Today

Page 15: Big Data Real-Time Analytics in Erlang

Streaming

‣ Partition on game

‣ One process per game, autonomous

‣ Prototype very promising

Page 16: Big Data Real-Time Analytics in Erlang

Implementation

Page 17: Big Data Real-Time Analytics in Erlang

Game process‣ Process files sequentially

‣ Keep running results in RAM

‣ Flush to DB when window closes

‣ Answer real-time queries

‣ Put all in one node

Page 18: Big Data Real-Time Analytics in Erlang

“Metric DSL”

Page 19: Big Data Real-Time Analytics in Erlang

HyperLogLog‣ Estimate cardinality

‣ Clever hash tricks

‣ Millions unique with <1% error in ~40k words

‣ Unions

‣ “hyper”: Erlang HLL++ from Google paper[0]

‣ Will open source Soon (TM)

[0]: “HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm”

Page 20: Big Data Real-Time Analytics in Erlang

Recordinality[0]

‣ Estimate frequency of values

‣ User session count

‣ Keeps a reservoir, clever hash tricks

‣ Will open source Soon (TM)

[0]: “Data Streams as Random Permutations: the Distinct Element Problem”

Page 21: Big Data Real-Time Analytics in Erlang

Next step: More parallelization

Page 22: Big Data Real-Time Analytics in Erlang

loop(State) -> NewState = process(next_file(), State), loop(NewState).

Game #123

Page 23: Big Data Real-Time Analytics in Erlang

Worker #1

Part = process(next_file(), empty()),game_123 ! Part.

loop(State) -> receive Part -> NewState = merge(Part, State)), loop(NewState) end.

Game #123

Page 24: Big Data Real-Time Analytics in Erlang

Worker #1

Game #123

Worker #2 Worker #N

Manager

Page 25: Big Data Real-Time Analytics in Erlang

Mapper #1

Reducer #123

Mapper #2 Mapper #N

Scheduler

Page 26: Big Data Real-Time Analytics in Erlang

Scheduler

‣ Take ideas from Hadoop, Riak Pipe

‣ Manage limited resources

‣ Distribute work across nodes

‣ Colocate processing with state storage

Page 27: Big Data Real-Time Analytics in Erlang

Implementation

‣ DIY?

‣ riak_core for state processes?

‣ riak_pipe for managing work?

Page 28: Big Data Real-Time Analytics in Erlang

Conclusion

Page 29: Big Data Real-Time Analytics in Erlang

Erlang: The bad parts

‣ Big process state not great

‣ Lots of updates, lots of garbage

Page 30: Big Data Real-Time Analytics in Erlang

Erlang: The good parts

‣ Total allocated RAM >30GB

‣ Per process heap is gold

‣ Easy parallelization, distribution

‣ Looking forward to maps!

Page 31: Big Data Real-Time Analytics in Erlang

Questions?

Knut Nesheim · @knutinGameAnalytics