Running a Realtime Stats Service on MySQL
-
Upload
kazuho-oku -
Category
Technology
-
view
11.108 -
download
1
description
Transcript of Running a Realtime Stats Service on MySQL
Running a Realtime Stats Service on
MySQL
Running a Realtime Stats Service on
MySQL
Cybozu Labs, Inc.Kazuho Oku
Background
Apr. 23 2009 Running Realtime Stats Service on MySQL 2
Who am I?Who am I?
Name: Kazuho Oku (奥 一穂 )Original Developer of Palmscape / XiinoThe oldest web browser for Palm OS
Working at Cybozu Labs since 2005Research subsidiary of Cybozu, Inc.Cybozu is a leading groupware vendor in Japan
My weblog: tinyurl.com/kazuhoApr. 23 2009 Running Realtime Stats Service on MySQL 3
Introduction of Pathtraq
Apr. 23 2009 Running Realtime Stats Service on MySQL 4
What is Pathtraq?What is Pathtraq?
Started in Aug. 2007Web ranking service
One of Japan’s largest〜 10,000 users submit access information〜 1,000,000 access infomation per day
like Alexa, but semi-realtime, and per-page
Apr. 23 2009 Running Realtime Stats Service on MySQL 5
What is Pathtraq? (cont'd)What is Pathtraq? (cont'd)
Automated Social News Servicefind what's hotlike Google News + Diggcalculate relevance from access stats
Search by...no filtering (all the Internet)by categoryby keywordby URL (per-domain, etc.)
Apr. 23 2009 Running Realtime Stats Service on MySQL 6
QuickTime˛ Ç∆ êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB
QuickTime˛ Ç∆ êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB
How to Provide Real-time Analysis?How to Provide Real-time Analysis?
Data Set (as of Apr. 23 2009)# of URLs: 147,748,546# of total accesses: 413,272,527
Sharding is not a good optionsince we need to join the tables and aggregateprefix-search by URL, search by keyword, then join with access data table
core tables should be stored on RAMnot on HDD, due to lots of random access
Apr. 23 2009 Running Realtime Stats Service on MySQL 9
Our Decision was to...Our Decision was to...
Keep URL and access stats on RAMcompression for size and speed
Create a new message queueLimit Pre-computation LoadCreate our own cache, with locks
to minimize database access
Fulltext-search database on SSD
Apr. 23 2009 Running Realtime Stats Service on MySQL 10
Our ServersOur Servers
Main ServerOpteron 2218 x2, 64GB MemMySQL, Apache
Fulltext Search ServerOpteron 240EE, 2GB Mem, Intel SSDMySQL (w. Tritonn/Senna)
Helper Serversfor Content Analysisfor Screenshot Generation
Apr. 23 2009 Running Realtime Stats Service on MySQL 11
The Long Tail of the InternetThe Long Tail of the Internet
y=C・ x-0.44
# of URLs with 1/10 hits: x2.75
Apr. 23 2009 Running Realtime Stats Service on MySQL 12
Compressing URLs
Apr. 23 2009 Running Realtime Stats Service on MySQL 13
Compressing URLsCompressing URLs
The Challenges:URLs are too short for gzip, etc.URLs should be prefix-searchable in compressed formHow to run like 'http://www.mysql.com/%' on a compressed URL?
The Answer:Static PPM + Range Coder
Apr. 23 2009 Running Realtime Stats Service on MySQL 14
Static PPMStatic PPM
PPM: Prediction by Partial MatchingWhat is the next character after ".co"?
The answer is "m"!PPM is used by 7-zip, etc.
Static PPM is PPM with static probabilistic modelMany URLs (or English words) have common patterns
Suitable for short texts (like URLs)Apr. 23 2009 Running Realtime Stats Service on MySQL 15
Range CoderRange Coder
A fast variant of arithmetic compressionsimilar to huffmann encoding, but better
If probability of next character being "m" was 75%, it will be encoded into 0.42 bit
Compressed strings preserve the sort order of uncompressed form
Apr. 23 2009 Running Realtime Stats Service on MySQL 16
Create Compression FunctionsCreate Compression Functions
Build prediction table from stored URLs
Implement range codertook an open-source impl. and optimized itoriginal impl. added some bits unnecessary at the tail
use SSE instructions for faster operationcoderepos.org/share/browser/lang/cplusplus/range_coder
Link the coder and the table to create MySQL UDFsApr. 23 2009 Running Realtime Stats Service on MySQL 17
Rewriting the Server LogicRewriting the Server Logic
Change schemaurl varchar(255) not null # with unique index
↓urlc varbinary(767) not null # with unique index
Change prefix-search formurl like 'http://example.com/%' ↓url_compress('http://example.com/')<=urlc and urlc<url_compress('http://example.com0')
Note: "0" is next character of '/'
Apr. 23 2009 Running Realtime Stats Service on MySQL 18
Compression RatioCompression Ratio
Compression ratio: 37%Size of prediction table: 4MB
Benchmark of the compression functionscompression: 40MB/sec. (570k URLs/sec.)decompression: 19.3MB/sec. (280k URLs/sec.)
fast enough since searchable in compressed form
Prefix-search became fastershorter indexes lead to faster operationApr. 23 2009 Running Realtime Stats Service on MySQL 19
Apr. 23 2009 Running Realtime Stats Service on MySQL 20
Re InnoDB CompressionRe InnoDB Compression
URL Compression can coexist with InnoDB compression
though we aren't using InnoDB compression on our production environment
Compression Table Size
N/A 100%URL compression 57%InnoDB compression
50%
using both 33%
Compressing the Stats TableCompressing the Stats Table
Used to have two int columns: at, cntit was waste of space, since...
most cnt values are very small numbersmost accesses to each URL occur on a short period (ex. the day the blog entry was written)
at field should be part of the indexes
Apr. 23 2009 Running Realtime Stats Service on MySQL 21
at (hours since epoch)
cnt (# of hits)
330168 1
330169 2
330173 1
330197 1
Compressing the Stats Table (cont'd)Compressing the Stats Table (cont'd)
Merge the rows into a sparse arrayexample on the prev. page becomes:
(offset=330197),1,0(repeated 23 times),1,2,1
Then compress the arraythe example becomes a blob of 8 bytesoriginally was 8 bytes x 4 rows with index
And store the array in a single columnfewer rows lead to smaller table, faster access
Apr. 23 2009 Running Realtime Stats Service on MySQL 22
Compressing the Stats Table (cont'd)Compressing the Stats Table (cont'd)
Write MySQL UDFs to access the sparse array
cnt_add(column,at,cnt)-- adds cnt on given index
(at)cnt_between(column,from,to)
-- returns # of hits between given hours
and more...
We use int[N] arrays for vectorized calc.especially when creating access chartsApr. 23 2009 Running Realtime Stats Service on MySQL 23
Create a new Message Queue
Apr. 23 2009 Running Realtime Stats Service on MySQL 24
Q4MQ4M
A simple, reliable, fast message queueruns as a pluggable storage engine of MySQL
GPL License; q4m.31tools.compresented yesterday at MySQL Conference :-pslides at tinyurl.com/q4m2009
Used for relaying messages between our servers
Apr. 23 2009 Running Realtime Stats Service on MySQL 25
Limiting Pre-computation Load
Apr. 23 2009 Running Realtime Stats Service on MySQL 26
Limit # of CPU-intensive Pre-computationsLimit # of CPU-intensive Pre-computations
Use cron & setlocksetlock is part of daemontools by djb
setlockserializes processes by using flock-n option: use trylock; if locked, do nothing
# use only one CPU core for pre-computation*/2 * * * * setlock –n /tmp/tasks.lock precompute_hot_entries5 0 * * * setlock /tmp/tasks.lock precompute_yesterday_data
Apr. 23 2009 Running Realtime Stats Service on MySQL 27
Limit # of Disk-intensive Pre-computationsLimit # of Disk-intensive Pre-computations
Divide pre-computation to blocks and sleep depending on the elapsed time
my $LOAD = 0.25;
while (true) { my $start = time(); precompute_block(); sleep(min(time - $start, 0) * (1 - $LOAD) / $LOAD);} Apr. 23 2009 Running Realtime Stats Service on MySQL 28
Creating our own Cache System
Apr. 23 2009 Running Realtime Stats Service on MySQL 29
The ProblemThe Problem
Query cache is flushed on table updateaccess stats can be (should be) cached for a certain period
Memcached has a thundering-herd problemall clients try to read the database when a cached-entry expires
critical for us since our queries does joins, aggregations, and sort operations
Apr. 23 2009 Running Realtime Stats Service on MySQL 30
Swifty and KeyedMutexSwifty and KeyedMutex
Swifty is a mmap-based cachecached data shared between processeslock-free on read, flock on writenotifies a single client that the accessed entry is going to expire within few seconds
notified client can start updating a cache entry before it expires
KeyedMutexa daemon used to block multiple clients issuing same SQL queriesApr. 23 2009 Running Realtime Stats Service on MySQL 31
Swifty and KeyedMutexd (cont'd)Swifty and KeyedMutexd (cont'd)
Source codes are available:coderepos.org/share/browser/lang/c/swiftycoderepos.org/share/browser/lang/perl/Cache-Swifty
coderepos.org/share/browser/lang/perl/KeyedMutex
Apr. 23 2009 Running Realtime Stats Service on MySQL 32
Fulltext-search on SSD
Apr. 23 2009 Running Realtime Stats Service on MySQL 33
Senna / TritonnSenna / Tritonn
Senna is a FTS engine popular in Japanmight not work well with European languages
Tritonn is a replacement of MyISAM FTSuses Senna as backendfaster than MyISAM FTS
Wrote patches to support SSDduring our transition from RAM to SSDpatches accepted in Senna 1.1.4 / Tritonn 1.0.12
Apr. 23 2009 Running Realtime Stats Service on MySQL 34
FTS: RAM-based vs. SSD-basedFTS: RAM-based vs. SSD-based
Size of FTS data: 〜 20GBDowngraded hardware to see if SSD-based FTS is feasible
Speed became ¼but latency of searches are well below one second
Apr. 23 2009 Running Realtime Stats Service on MySQL 35
Old Hardware New Hardware
CPU Opteron 2218 (2.6GHz) x2
Opteron 240 (1.4GHz)
Memory 32GB 2GB
Storage 7200rpm SATA HDD SSD (Intel X25-M)
Summary
Apr. 23 2009 Running Realtime Stats Service on MySQL 36
SummarySummary
Use UDFs for optimizationSometime it is easier to scale UP
esp. when you can estimate your data growth
Use SSD for FTSBaidu (China's leading search engine) uses SSD
Most of the things introduced are OSSWe plan to open-source our URL compression table as well
Apr. 23 2009 Running Realtime Stats Service on MySQL 37
We are Looking for...We are Looking for...
If you are interested in localizing Pathtraq to your country, please contact uswe do not have resources outside of Japanto translate the web interfaceto ask people to install our browser extension
to follow local regulations, etc.
Apr. 23 2009 Running Realtime Stats Service on MySQL 38
Thank you for listening
tinyurl.com/kazuho
Apr. 23 2009 Running Realtime Stats Service on MySQL 39