Running a Realtime Stats Service on MySQL

39
Running a Realtime Stats Service on MySQL Cybozu Labs, Inc. Kazuho Oku

description

Slides used at Percona Performance Conference. Describes the optimizations / tweeks used on running pathtraq.com, one of Japan's largest web stats service.

Transcript of Running a Realtime Stats Service on MySQL

Page 1: Running a Realtime Stats Service on MySQL

Running a Realtime Stats Service on

MySQL

Running a Realtime Stats Service on

MySQL

Cybozu Labs, Inc.Kazuho Oku

Page 2: Running a Realtime Stats Service on MySQL

Background

Apr. 23 2009 Running Realtime Stats Service on MySQL 2

Page 3: Running a Realtime Stats Service on MySQL

Who am I?Who am I?

Name: Kazuho Oku (奥 一穂 )Original Developer of Palmscape / XiinoThe oldest web browser for Palm OS

Working at Cybozu Labs since 2005Research subsidiary of Cybozu, Inc.Cybozu is a leading groupware vendor in Japan

My weblog: tinyurl.com/kazuhoApr. 23 2009 Running Realtime Stats Service on MySQL 3

Page 4: Running a Realtime Stats Service on MySQL

Introduction of Pathtraq

Apr. 23 2009 Running Realtime Stats Service on MySQL 4

Page 5: Running a Realtime Stats Service on MySQL

What is Pathtraq?What is Pathtraq?

Started in Aug. 2007Web ranking service

One of Japan’s largest〜 10,000 users submit access information〜 1,000,000 access infomation per day

like Alexa, but semi-realtime, and per-page

Apr. 23 2009 Running Realtime Stats Service on MySQL 5

Page 6: Running a Realtime Stats Service on MySQL

What is Pathtraq? (cont'd)What is Pathtraq? (cont'd)

Automated Social News Servicefind what's hotlike Google News + Diggcalculate relevance from access stats

Search by...no filtering (all the Internet)by categoryby keywordby URL (per-domain, etc.)

Apr. 23 2009 Running Realtime Stats Service on MySQL 6

Page 7: Running a Realtime Stats Service on MySQL

QuickTime˛ Ç∆ êLí£ÉvÉçÉOÉâÉÄ

ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB

Page 8: Running a Realtime Stats Service on MySQL

QuickTime˛ Ç∆ êLí£ÉvÉçÉOÉâÉÄ

ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB

Page 9: Running a Realtime Stats Service on MySQL

How to Provide Real-time Analysis?How to Provide Real-time Analysis?

Data Set (as of Apr. 23 2009)# of URLs: 147,748,546# of total accesses: 413,272,527

Sharding is not a good optionsince we need to join the tables and aggregateprefix-search by URL, search by keyword, then join with access data table

core tables should be stored on RAMnot on HDD, due to lots of random access

Apr. 23 2009 Running Realtime Stats Service on MySQL 9

Page 10: Running a Realtime Stats Service on MySQL

Our Decision was to...Our Decision was to...

Keep URL and access stats on RAMcompression for size and speed

Create a new message queueLimit Pre-computation LoadCreate our own cache, with locks

to minimize database access

Fulltext-search database on SSD

Apr. 23 2009 Running Realtime Stats Service on MySQL 10

Page 11: Running a Realtime Stats Service on MySQL

Our ServersOur Servers

Main ServerOpteron 2218 x2, 64GB MemMySQL, Apache

Fulltext Search ServerOpteron 240EE, 2GB Mem, Intel SSDMySQL (w. Tritonn/Senna)

Helper Serversfor Content Analysisfor Screenshot Generation

Apr. 23 2009 Running Realtime Stats Service on MySQL 11

Page 12: Running a Realtime Stats Service on MySQL

The Long Tail of the InternetThe Long Tail of the Internet

y=C・ x-0.44

# of URLs with 1/10 hits: x2.75

Apr. 23 2009 Running Realtime Stats Service on MySQL 12

Page 13: Running a Realtime Stats Service on MySQL

Compressing URLs

Apr. 23 2009 Running Realtime Stats Service on MySQL 13

Page 14: Running a Realtime Stats Service on MySQL

Compressing URLsCompressing URLs

The Challenges:URLs are too short for gzip, etc.URLs should be prefix-searchable in compressed formHow to run like 'http://www.mysql.com/%' on a compressed URL?

The Answer:Static PPM + Range Coder

Apr. 23 2009 Running Realtime Stats Service on MySQL 14

Page 15: Running a Realtime Stats Service on MySQL

Static PPMStatic PPM

PPM: Prediction by Partial MatchingWhat is the next character after ".co"?

The answer is "m"!PPM is used by 7-zip, etc.

Static PPM is PPM with static probabilistic modelMany URLs (or English words) have common patterns

Suitable for short texts (like URLs)Apr. 23 2009 Running Realtime Stats Service on MySQL 15

Page 16: Running a Realtime Stats Service on MySQL

Range CoderRange Coder

A fast variant of arithmetic compressionsimilar to huffmann encoding, but better

If probability of next character being "m" was 75%, it will be encoded into 0.42 bit

Compressed strings preserve the sort order of uncompressed form

Apr. 23 2009 Running Realtime Stats Service on MySQL 16

Page 17: Running a Realtime Stats Service on MySQL

Create Compression FunctionsCreate Compression Functions

Build prediction table from stored URLs

Implement range codertook an open-source impl. and optimized itoriginal impl. added some bits unnecessary at the tail

use SSE instructions for faster operationcoderepos.org/share/browser/lang/cplusplus/range_coder

Link the coder and the table to create MySQL UDFsApr. 23 2009 Running Realtime Stats Service on MySQL 17

Page 18: Running a Realtime Stats Service on MySQL

Rewriting the Server LogicRewriting the Server Logic

Change schemaurl varchar(255) not null # with unique index

↓urlc varbinary(767) not null # with unique index

Change prefix-search formurl like 'http://example.com/%' ↓url_compress('http://example.com/')<=urlc and urlc<url_compress('http://example.com0')

Note: "0" is next character of '/'

Apr. 23 2009 Running Realtime Stats Service on MySQL 18

Page 19: Running a Realtime Stats Service on MySQL

Compression RatioCompression Ratio

Compression ratio: 37%Size of prediction table: 4MB

Benchmark of the compression functionscompression: 40MB/sec. (570k URLs/sec.)decompression: 19.3MB/sec. (280k URLs/sec.)

fast enough since searchable in compressed form

Prefix-search became fastershorter indexes lead to faster operationApr. 23 2009 Running Realtime Stats Service on MySQL 19

Page 20: Running a Realtime Stats Service on MySQL

Apr. 23 2009 Running Realtime Stats Service on MySQL 20

Re InnoDB CompressionRe InnoDB Compression

URL Compression can coexist with InnoDB compression

though we aren't using InnoDB compression on our production environment

Compression Table Size

N/A 100%URL compression 57%InnoDB compression

50%

using both 33%

Page 21: Running a Realtime Stats Service on MySQL

Compressing the Stats TableCompressing the Stats Table

Used to have two int columns: at, cntit was waste of space, since...

most cnt values are very small numbersmost accesses to each URL occur on a short period (ex. the day the blog entry was written)

at field should be part of the indexes

Apr. 23 2009 Running Realtime Stats Service on MySQL 21

at (hours since epoch)

cnt (# of hits)

330168 1

330169 2

330173 1

330197 1

Page 22: Running a Realtime Stats Service on MySQL

Compressing the Stats Table (cont'd)Compressing the Stats Table (cont'd)

Merge the rows into a sparse arrayexample on the prev. page becomes:

(offset=330197),1,0(repeated 23 times),1,2,1

Then compress the arraythe example becomes a blob of 8 bytesoriginally was 8 bytes x 4 rows with index

And store the array in a single columnfewer rows lead to smaller table, faster access

Apr. 23 2009 Running Realtime Stats Service on MySQL 22

Page 23: Running a Realtime Stats Service on MySQL

Compressing the Stats Table (cont'd)Compressing the Stats Table (cont'd)

Write MySQL UDFs to access the sparse array

cnt_add(column,at,cnt)-- adds cnt on given index

(at)cnt_between(column,from,to)

-- returns # of hits between given hours

and more...

We use int[N] arrays for vectorized calc.especially when creating access chartsApr. 23 2009 Running Realtime Stats Service on MySQL 23

Page 24: Running a Realtime Stats Service on MySQL

Create a new Message Queue

Apr. 23 2009 Running Realtime Stats Service on MySQL 24

Page 25: Running a Realtime Stats Service on MySQL

Q4MQ4M

A simple, reliable, fast message queueruns as a pluggable storage engine of MySQL

GPL License; q4m.31tools.compresented yesterday at MySQL Conference :-pslides at tinyurl.com/q4m2009

Used for relaying messages between our servers

Apr. 23 2009 Running Realtime Stats Service on MySQL 25

Page 26: Running a Realtime Stats Service on MySQL

Limiting Pre-computation Load

Apr. 23 2009 Running Realtime Stats Service on MySQL 26

Page 27: Running a Realtime Stats Service on MySQL

Limit # of CPU-intensive Pre-computationsLimit # of CPU-intensive Pre-computations

Use cron & setlocksetlock is part of daemontools by djb

setlockserializes processes by using flock-n option: use trylock; if locked, do nothing

# use only one CPU core for pre-computation*/2 * * * * setlock –n /tmp/tasks.lock precompute_hot_entries5 0 * * * setlock /tmp/tasks.lock precompute_yesterday_data

Apr. 23 2009 Running Realtime Stats Service on MySQL 27

Page 28: Running a Realtime Stats Service on MySQL

Limit # of Disk-intensive Pre-computationsLimit # of Disk-intensive Pre-computations

Divide pre-computation to blocks and sleep depending on the elapsed time

my $LOAD = 0.25;

while (true) { my $start = time(); precompute_block(); sleep(min(time - $start, 0) * (1 - $LOAD) / $LOAD);} Apr. 23 2009 Running Realtime Stats Service on MySQL 28

Page 29: Running a Realtime Stats Service on MySQL

Creating our own Cache System

Apr. 23 2009 Running Realtime Stats Service on MySQL 29

Page 30: Running a Realtime Stats Service on MySQL

The ProblemThe Problem

Query cache is flushed on table updateaccess stats can be (should be) cached for a certain period

Memcached has a thundering-herd problemall clients try to read the database when a cached-entry expires

critical for us since our queries does joins, aggregations, and sort operations

Apr. 23 2009 Running Realtime Stats Service on MySQL 30

Page 31: Running a Realtime Stats Service on MySQL

Swifty and KeyedMutexSwifty and KeyedMutex

Swifty is a mmap-based cachecached data shared between processeslock-free on read, flock on writenotifies a single client that the accessed entry is going to expire within few seconds

notified client can start updating a cache entry before it expires

KeyedMutexa daemon used to block multiple clients issuing same SQL queriesApr. 23 2009 Running Realtime Stats Service on MySQL 31

Page 32: Running a Realtime Stats Service on MySQL

Swifty and KeyedMutexd (cont'd)Swifty and KeyedMutexd (cont'd)

Source codes are available:coderepos.org/share/browser/lang/c/swiftycoderepos.org/share/browser/lang/perl/Cache-Swifty

coderepos.org/share/browser/lang/perl/KeyedMutex

Apr. 23 2009 Running Realtime Stats Service on MySQL 32

Page 33: Running a Realtime Stats Service on MySQL

Fulltext-search on SSD

Apr. 23 2009 Running Realtime Stats Service on MySQL 33

Page 34: Running a Realtime Stats Service on MySQL

Senna / TritonnSenna / Tritonn

Senna is a FTS engine popular in Japanmight not work well with European languages

Tritonn is a replacement of MyISAM FTSuses Senna as backendfaster than MyISAM FTS

Wrote patches to support SSDduring our transition from RAM to SSDpatches accepted in Senna 1.1.4 / Tritonn 1.0.12

Apr. 23 2009 Running Realtime Stats Service on MySQL 34

Page 35: Running a Realtime Stats Service on MySQL

FTS: RAM-based vs. SSD-basedFTS: RAM-based vs. SSD-based

Size of FTS data: 〜 20GBDowngraded hardware to see if SSD-based FTS is feasible

Speed became ¼but latency of searches are well below one second

Apr. 23 2009 Running Realtime Stats Service on MySQL 35

Old Hardware New Hardware

CPU Opteron 2218 (2.6GHz) x2

Opteron 240 (1.4GHz)

Memory 32GB 2GB

Storage 7200rpm SATA HDD SSD (Intel X25-M)

Page 36: Running a Realtime Stats Service on MySQL

Summary

Apr. 23 2009 Running Realtime Stats Service on MySQL 36

Page 37: Running a Realtime Stats Service on MySQL

SummarySummary

Use UDFs for optimizationSometime it is easier to scale UP

esp. when you can estimate your data growth

Use SSD for FTSBaidu (China's leading search engine) uses SSD

Most of the things introduced are OSSWe plan to open-source our URL compression table as well

Apr. 23 2009 Running Realtime Stats Service on MySQL 37

Page 38: Running a Realtime Stats Service on MySQL

We are Looking for...We are Looking for...

If you are interested in localizing Pathtraq to your country, please contact uswe do not have resources outside of Japanto translate the web interfaceto ask people to install our browser extension

to follow local regulations, etc.

Apr. 23 2009 Running Realtime Stats Service on MySQL 38

Page 39: Running a Realtime Stats Service on MySQL

Thank you for listening

tinyurl.com/kazuho

Apr. 23 2009 Running Realtime Stats Service on MySQL 39