Running A Realtime Stats Service On My Sql

39
Running a Realtime Stats Service on MySQL Cybozu Labs, Inc. Kazuho Oku

description

 

Transcript of Running A Realtime Stats Service On My Sql

Page 1: Running A Realtime Stats Service On My Sql

Running a Realtime Stats Service on MySQL

Cybozu Labs, Inc. Kazuho Oku

Page 2: Running A Realtime Stats Service On My Sql

Background

Apr. 23 2009 Running Realtime Stats Service on MySQL 2

Page 3: Running A Realtime Stats Service On My Sql

Who am I?

 Name: Kazuho Oku (奥 一穂)  Original Developer of Palmscape / Xiino

 The oldest web browser for Palm OS

 Working at Cybozu Labs since 2005  Research subsidiary of Cybozu, Inc.  Cybozu is a leading groupware vendor in Japan  My weblog: tinyurl.com/kazuho

Apr. 23 2009 Running Realtime Stats Service on MySQL 3

Page 4: Running A Realtime Stats Service On My Sql

Introduction of Pathtraq

Apr. 23 2009 Running Realtime Stats Service on MySQL 4

Page 5: Running A Realtime Stats Service On My Sql

What is Pathtraq?

 Started in Aug. 2007  Web ranking service

 One of Japan’s largest  ~10,000 users submit access information  ~1,000,000 access infomation per day

 like Alexa, but semi-realtime, and per-page

Apr. 23 2009 Running Realtime Stats Service on MySQL 5

Page 6: Running A Realtime Stats Service On My Sql

What is Pathtraq? (cont'd)

 Automated Social News Service  find what's hot  like Google News + Digg  calculate relevance from access stats

 Search by...  no filtering (all the Internet)  by category  by keyword  by URL (per-domain, etc.)

Apr. 23 2009 Running Realtime Stats Service on MySQL 6

Page 7: Running A Realtime Stats Service On My Sql
Page 8: Running A Realtime Stats Service On My Sql
Page 9: Running A Realtime Stats Service On My Sql

How to Provide Real-time Analysis?

 Data Set (as of Apr. 23 2009)  # of URLs: 147,748,546  # of total accesses: 413,272,527

 Sharding is not a good option  since we need to join the tables and aggregate

 prefix-search by URL, search by keyword, then join with access data table

 core tables should be stored on RAM  not on HDD, due to lots of random access

Apr. 23 2009 Running Realtime Stats Service on MySQL 9

Page 10: Running A Realtime Stats Service On My Sql

Our Decision was to...

 Keep URL and access stats on RAM  compression for size and speed

 Create a new message queue  Limit Pre-computation Load  Create our own cache, with locks

 to minimize database access

 Fulltext-search database on SSD

Apr. 23 2009 Running Realtime Stats Service on MySQL 10

Page 11: Running A Realtime Stats Service On My Sql

Our Servers

 Main Server  Opteron 2218 x2, 64GB Mem  MySQL, Apache

 Fulltext Search Server  Opteron 240EE, 2GB Mem, Intel SSD  MySQL (w. Tritonn/Senna)

 Helper Servers  for Content Analysis  for Screenshot Generation

Apr. 23 2009 Running Realtime Stats Service on MySQL 11

Page 12: Running A Realtime Stats Service On My Sql

The Long Tail of the Internet

y=C・x-0.44

# of URLs with 1/10 hits: x2.75

Apr. 23 2009 Running Realtime Stats Service on MySQL 12

Page 13: Running A Realtime Stats Service On My Sql

Compressing URLs

Apr. 23 2009 Running Realtime Stats Service on MySQL 13

Page 14: Running A Realtime Stats Service On My Sql

Compressing URLs

 The Challenges:  URLs are too short for gzip, etc.  URLs should be prefix-searchable in compressed

form  How to run like 'http://www.mysql.com/%' on a

compressed URL?

 The Answer:  Static PPM + Range Coder

Apr. 23 2009 Running Realtime Stats Service on MySQL 14

Page 15: Running A Realtime Stats Service On My Sql

Static PPM

 PPM: Prediction by Partial Matching  What is the next character after ".co"?

 The answer is "m"!

 PPM is used by 7-zip, etc.

 Static PPM is PPM with static probabilistic model  Many URLs (or English words) have common

patterns  Suitable for short texts (like URLs)

Apr. 23 2009 Running Realtime Stats Service on MySQL 15

Page 16: Running A Realtime Stats Service On My Sql

Range Coder

 A fast variant of arithmetic compression  similar to huffmann encoding, but better  If probability of next character being "m" was

75%, it will be encoded into 0.42 bit

 Compressed strings preserve the sort order of uncompressed form

Apr. 23 2009 Running Realtime Stats Service on MySQL 16

Page 17: Running A Realtime Stats Service On My Sql

Create Compression Functions

 Build prediction table from stored URLs  Implement range coder

 took an open-source impl. and optimized it  original impl. added some bits unnecessary at the tail  use SSE instructions for faster operation  coderepos.org/share/browser/lang/cplusplus/range_coder

 Link the coder and the table to create MySQL UDFs

Apr. 23 2009 Running Realtime Stats Service on MySQL 17

Page 18: Running A Realtime Stats Service On My Sql

Rewriting the Server Logic

 Change schema url varchar(255) not null # with unique index ↓ urlc varbinary(767) not null # with unique index

 Change prefix-search form url like 'http://example.com/%' ↓ url_compress('http://example.com/')<=urlc and

urlc<url_compress('http://example.com0')

Note: "0" is next character of '/' Apr. 23 2009 Running Realtime Stats Service on MySQL 18

Page 19: Running A Realtime Stats Service On My Sql

Compression Ratio

 Compression ratio: 37%  Size of prediction table: 4MB

 Benchmark of the compression functions  compression: 40MB/sec. (570k URLs/sec.)  decompression: 19.3MB/sec. (280k URLs/sec.)  fast enough since searchable in compressed form

 Prefix-search became faster  shorter indexes lead to faster operation

Apr. 23 2009 Running Realtime Stats Service on MySQL 19

Page 20: Running A Realtime Stats Service On My Sql

Apr. 23 2009 Running Realtime Stats Service on MySQL 20

Re InnoDB Compression

 URL Compression can coexist with InnoDB compression

 though we aren't using InnoDB compression on our production environment

Compression Table Size N/A 100% URL compression 57% InnoDB compression 50% using both 33%

Page 21: Running A Realtime Stats Service On My Sql

Compressing the Stats Table

 Used to have two int columns: at, cnt  it was waste of space, since...

 most cnt values are very small numbers  most accesses to each URL occur on a short period (ex.

the day the blog entry was written)  at field should be part of the indexes

Apr. 23 2009 Running Realtime Stats Service on MySQL 21

at (hours since epoch) cnt (# of hits)

330168 1

330169 2

330173 1

330197 1

Page 22: Running A Realtime Stats Service On My Sql

Compressing the Stats Table (cont'd)

 Merge the rows into a sparse array  example on the prev. page becomes: (offset=330197),1,0(repeated 23 times),1,2,1

 Then compress the array  the example becomes a blob of 8 bytes  originally was 8 bytes x 4 rows with index

 And store the array in a single column  fewer rows lead to smaller table, faster access

Apr. 23 2009 Running Realtime Stats Service on MySQL 22

Page 23: Running A Realtime Stats Service On My Sql

Compressing the Stats Table (cont'd)

 Write MySQL UDFs to access the sparse array

 cnt_add(column,at,cnt) -- adds cnt on given index (at)  cnt_between(column,from,to) -- returns # of hits between given hours  and more...

 We use int[N] arrays for vectorized calc.  especially when creating access charts

Apr. 23 2009 Running Realtime Stats Service on MySQL 23

Page 24: Running A Realtime Stats Service On My Sql

Create a new Message Queue

Apr. 23 2009 Running Realtime Stats Service on MySQL 24

Page 25: Running A Realtime Stats Service On My Sql

Q4M

 A simple, reliable, fast message queue  runs as a pluggable storage engine of MySQL  GPL License; q4m.31tools.com  presented yesterday at MySQL Conference :-p

 slides at tinyurl.com/q4m2009

 Used for relaying messages between our servers

Apr. 23 2009 Running Realtime Stats Service on MySQL 25

Page 26: Running A Realtime Stats Service On My Sql

Limiting Pre-computation Load

Apr. 23 2009 Running Realtime Stats Service on MySQL 26

Page 27: Running A Realtime Stats Service On My Sql

Limit # of CPU-intensive Pre-computations

 Use cron & setlock  setlock is part of daemontools by djb

 setlock  serializes processes by using flock  -n option: use trylock; if locked, do nothing

# use only one CPU core for pre-computation */2 * * * * setlock –n /tmp/tasks.lock precompute_hot_entries 5 0 * * * setlock /tmp/tasks.lock precompute_yesterday_data

Apr. 23 2009 Running Realtime Stats Service on MySQL 27

Page 28: Running A Realtime Stats Service On My Sql

Limit # of Disk-intensive Pre-computations

 Divide pre-computation to blocks and sleep depending on the elapsed time

my $LOAD = 0.25;

while (true) { my $start = time(); precompute_block(); sleep(min(time - $start, 0) * (1 - $LOAD) / $LOAD); }

Apr. 23 2009 Running Realtime Stats Service on MySQL 28

Page 29: Running A Realtime Stats Service On My Sql

Creating our own Cache System

Apr. 23 2009 Running Realtime Stats Service on MySQL 29

Page 30: Running A Realtime Stats Service On My Sql

The Problem

 Query cache is flushed on table update  access stats can be (should be) cached for a

certain period

 Memcached has a thundering-herd problem  all clients try to read the database when a

cached-entry expires  critical for us since our queries does joins,

aggregations, and sort operations

Apr. 23 2009 Running Realtime Stats Service on MySQL 30

Page 31: Running A Realtime Stats Service On My Sql

Swifty and KeyedMutex

 Swifty is a mmap-based cache  cached data shared between processes  lock-free on read, flock on write  notifies a single client that the accessed entry is

going to expire within few seconds  notified client can start updating a cache entry

before it expires

 KeyedMutex  a daemon used to block multiple clients issuing

same SQL queries Apr. 23 2009 Running Realtime Stats Service on MySQL 31

Page 32: Running A Realtime Stats Service On My Sql

Swifty and KeyedMutexd (cont'd)

 Source codes are available:  coderepos.org/share/browser/lang/c/swifty  coderepos.org/share/browser/lang/perl/Cache-Swifty  coderepos.org/share/browser/lang/perl/KeyedMutex

Apr. 23 2009 Running Realtime Stats Service on MySQL 32

Page 33: Running A Realtime Stats Service On My Sql

Fulltext-search on SSD

Apr. 23 2009 Running Realtime Stats Service on MySQL 33

Page 34: Running A Realtime Stats Service On My Sql

Senna / Tritonn

 Senna is a FTS engine popular in Japan  might not work well with European languages

 Tritonn is a replacement of MyISAM FTS  uses Senna as backend  faster than MyISAM FTS

 Wrote patches to support SSD  during our transition from RAM to SSD  patches accepted in Senna 1.1.4 / Tritonn 1.0.12

Apr. 23 2009 Running Realtime Stats Service on MySQL 34

Page 35: Running A Realtime Stats Service On My Sql

FTS: RAM-based vs. SSD-based

 Size of FTS data: ~ 20GB  Downgraded hardware to see if SSD-

based FTS is feasible  Speed became ¼

 but latency of searches are well below one second

Apr. 23 2009 Running Realtime Stats Service on MySQL 35

Old Hardware New Hardware

CPU Opteron 2218 (2.6GHz) x2 Opteron 240 (1.4GHz)

Memory 32GB 2GB

Storage 7200rpm SATA HDD SSD (Intel X25-M)

Page 36: Running A Realtime Stats Service On My Sql

Summary

Apr. 23 2009 Running Realtime Stats Service on MySQL 36

Page 37: Running A Realtime Stats Service On My Sql

Summary

 Use UDFs for optimization  Sometime it is easier to scale UP

 esp. when you can estimate your data growth

 Use SSD for FTS  Baidu (China's leading search engine) uses SSD

 Most of the things introduced are OSS  We plan to open-source our URL compression

table as well

Apr. 23 2009 Running Realtime Stats Service on MySQL 37

Page 38: Running A Realtime Stats Service On My Sql

We are Looking for...

 If you are interested in localizing Pathtraq to your country, please contact us  we do not have resources outside of Japan

 to translate the web interface  to ask people to install our browser extension  to follow local regulations, etc.

Apr. 23 2009 Running Realtime Stats Service on MySQL 38

Page 39: Running A Realtime Stats Service On My Sql

Thank you for listening

tinyurl.com/kazuho

Apr. 23 2009 Running Realtime Stats Service on MySQL 39