Post on 15-Jan-2015
description
+
MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads
Shunsuke Nakamura (Tokyo Institute of Technology, NHN Japan) Kazuyuki Shudo (Tokyo Inistitute of Technology)
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
+Cloud Storage
NoSQL, Key-Value Storage (KVS), Document-Oriented DB, GraphDB Example: memcached, Google Bigtable, Amazon Dynamo, Amazon SimpleDB, Apache
Cassandra, Voldemort, Ringo, Vpork, MongoDB, CouchDB, Tokyo Tyrant, Flare, ROMA, kumofs, Kai, Redis, LevelDB, Hadoop HBase, Hypertable, Yahoo! PNUTS, Scalaris, Dynomite, ThruDB, Neo4j, IBM ObjectGrid, Giraph, Oracle Coherence, and the others. (> 100 products)
Characteristics: “limited functions, massive volume, high performance” Data access only by primary key
No luxury features such as join, global transaction
Scalable to much larger data and number of nodes
Distributed data store processing large amount of data
+Design policies of cloud storages
data model key/value, multi-dimensional map, document or graph
performance - write vs. read
latency vs. persistence latency – memory and disk utilization persistence – synchronous vs. asynchronous (snapshot)
replication – synchronous vs. asynchronous
consistency between replicas – strong vs. weak
data partitioning – row vs. column
distribution – master/slave vs. decentralized
There are many trade-offs.
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
+MyCassandra focuses on performance trade-off
data model key/value vs. multi-dimensional map vs. document vs. graph
performance - write vs. read
latency vs. persistence latency – memory and disk utilization persistence – synchronous vs. asynchronous (snapshot)
replication – synchronous vs. asynchronous
consistency between replicas – strong vs. weak
data partitioning – row vs. column
distribution – master/slave vs. decentralized
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
+Performance trade-off Write-optimized vs. read-optimized
A cloud storage with persistence is designed to optimize either write or read workload.
Storage engine determines which workload a cloud storage treats efficiently.
Bigtable, Cassandra, HBase
MySQL, Yahoo! Sherpa
Indexing Log-Structured Merge Tree [P. O’Neil ‘96]
B-Trees [R.Bayer ’70]
Write to disk append random reads, writes
Read to disk random reads + merge random read
Performance write-optimized read-optimized
Storage engine Bigtable clone MySQL Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
+
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
Performance trade-off - write-optimized vs. read-optimized -
- mycassandra -
6
Write latency for write-heavy workload
Yahoo! Cloud Serving Benchmark, SOCC ’10
write-optimized
read-optimized
Better
+
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
Read latency for read-heavy workload
write-optimized
read-optimized
Better
Performance trade-off - write-optimized vs. read-optimized -
- mycassandra - Yahoo! Cloud Serving Benchmark, SOCC ’10
+Research overview
Contribution: A technique to build a cloud storage performing well with both read and write workloads
Steps: 1. MyCassandra: Storage engine enabled Apache Cassandra
2. MyCassandra Cluster: Heterogeneous cluster with different storage engines
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
select
1. MyCassandra
read-optimized
write-optimized
2. MyCassandra Cluster
read and write-optimized
+Apache Cassandra
Features: Scalability up to hundreds of servers across multiple racks/datacenters
High availability without SPOF by adopting a decentralized architecture
Write-optimized
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
Open-sourced by in 2008
A top-level project in
dc1 dc2
dc3
Clustering across multiple racks/DCs Replication strategy based on region
+Apache Cassandra
Consistent Hashing (a decentralized algorithm): Assign identifiers to both nodes and data on its circular ID space.
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
A decentralized cloud storage without SPOF
A Z F
N V
key values
hash(key) = Q
Q
ID space
primary
Num of replica := 3
secondary 1
secondary 2
A-Z: hash value
Roles of each node • Proxy, serving clients • Primary/secondary data nodes
+Apache Cassandra
O(1) fast write operation Write an update to disk sequentially
- Fast because of no random I/O to disk - Always writable because of no write-lock
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
Write-optimized storage engine, a Bigtable clone
memory
disk
CommitLog
Memtable
SSTable 1 SSTable 2 SSTable 3
write path sync async flush
Only sequential write
<k1, obj (v1+v2)>
<k1, v1>, <k1, v2>
<k1,obj1>
<k1,obj3>
<k1,obj2>
1. Append an update to CommitLog for persistence
2. Update Memtable, a map in memory, for quick reading
3. Acknowledge a client 4. Asynchronously flush Memtable
to SSTable 5. Delete flushed data from
CommitLog and Memtable
+Apache Cassandra
Slow read operation Read data from Memtable and multiple SSTables, and merge them
- Slow because of multiple random I/Os on disk
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
Write-optimized storage engine, a Bigtable clone
CommitLog
Memtable
SSTable 1 SSTable 2 SSTable 3
merge
<k1,obj>
<k1,obj1>
<k1,obj2>
<k1,obj3>
memory
disk
multiple random I/Os
+ Performance of original Cassandra
YCSB results show: Average: write is 9 x as fast as read.
99.9%ile: write is 43.5 x as fast as read.
Write performance is much higher. N
um
ber
of o
per
atio
ns
Latency (ms)
Better read
avg. 6.16 ms
write
avg. 0.69 ms
write
read
Latency (ms)
99.9 %ile
write: 2.0 ms read: 86.9 ms
+
select
1.MyCassandra
read-optimized
write-optimized
1. Storage Engine Support
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
+MyCassandra: A modular cloud storage
Storage engine feature inspired by MySQL An independent and pluggable component Perform disk I/O
A cloud storage can be either write-optimized or read-optimized by selecting storage engine
Keep Cassandra’s original distribution architecture and data model
Storage engines are supported
Bigtable MySQL Redis …
Decentralized + Storage engine
selectable
Consistent Hashing Gossip Protocol
Decentralized
Bigtable
InnoDB MyISAM Memory …
Storage engine selectable
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
MyCassandra implementation Cassandra’s original distribution arch.
Storage Engine Interface introduced
Implement Storage Engine
Interface Storage Engine Interface
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
Performance of each storage engine storage engines
Bigtable: write-optimized (original Casssandra 0.7.5)
MySQL: read-optimized (MySQL 6.0 with InnoDB, JDBC API, stored procedure)
Redis: in-memory KVS (Redis 2.2.8)
x 11.79
x 9.87
6 nodes - Crucial’s SSD - allocate 6GB mem in 8GB
1KB x 36 million data set
workload
+2.MyCassandra Cluster
read and write-optimized
2. Heterogeneous cluster of different storage engines
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
Replicate data on different storage engine nodes
Route a query to nodes processing it efficiently Synchronously route to nodes processing quickly
Asynchronously route to nodes processing slowly → Exploit each node’s advantage
Furthermore, maintain consistency between replicas as much as the original Cassandra
Quorum Protocol: (write agreements) + (read agreements) > (num of replicas)
= Guarantee retrieval of the latest data
Consequence: At least one node processes both read and write queries synchronously and quickly
→ In-memory nodes play this role.
Basic idea
W R
W R
sync async
RW
write read
write query
• W: write-optimized • R: read-optimized • RW: in-memory
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
Cluster design • W: write-optimized • R: read-optimized • RW: in-memory
W R
Cluster configuration (N=3)
RW W
RW gossip
responsible nodes
Combine nodes with different storage engines write-optimized (W), read-optimized (R), in-memory (RW)
Disseminate storage engine types of each nodes The type is attached to gossip messages
Place replicas on nodes with different storage engines Proxy (any node requested) selects the storing nodes
1. The primary node determined based on the queried key
2. N -1 secondary nodes with different storage engines
Multiple nodes share a single server for load balance
W R RW RW
Proxy (any node)
primary secondary2 secondary1
Process for a write access • W: write-optimized • R: read-optimized • RW: in-memory
• Quorum parameters= 3, = = 2
• Num. of replicas W:RW:R = 1:1:1
……
…
…
Proxy
Client
Wait for two ACKs for write and return
Async write
Write for a single record
WW
R R
RW
RW
Nodes storing the record
1) A proxy receives a write query from a client. The proxy routes to nodes storing the record.
2) The proxy waits ACKs. W, RW nodes usually reply quickly.
3-a) If writing succeeds and the proxy receives ACKs, it returns a success message.
3-b) If a data node fails to write, the proxy waits for ACKs including R nodes and returns a success message.
4) After returning, the proxy asynchronously waits ACKs from the remaining nodes.
Write latency: max (W, RW)
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
Process for a read access • W: write-optimized • R: read-optimized • RW: in-memory
……
…
…
Proxy
Client Read for a single record
WW
R R
RW
RW
Nodes storing the record
1) A proxy receives a read query and routes to storing nodes.
2) The proxy waits for ACKs. R and RW nodes reply quickly.
3-a) If returned values are consistent, the proxy returns it.
3-b) If the values are mismatched, the proxy waits for consistent values including W nodes.
4) After returning, the proxy waits from the remaining nodes. If the proxy notices inconsistent values, it asynchronously updates them to the consistent one. Cassandra’s feature ReadRepair does it.
Check consistentcy and return result
Async check consistency
Read latency: max (R, RW)
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
• Quorum parameters= 3, = = 2
• Num. of replicas W:RW:R = 1:1:1
+Performance Evaluation
Targets MyCassandra Cluster: 3 different nodes/server x 6 servers Cassandra: 1 node/server x 6 servers
Quorum parameters = 3, = = 2
Storage Engine Bigtable (W), MySQL / InnoDB (R), Redis (RW)
Yahoo! Cloud Serving Benchmark (YCSB) [SOCC ’10] 1. Load data (1KB record, 10 x 100bytes columns) from a YCSB client 2. Warm up 3. Run benchmark and measure response times from a client
Demonstrate that a heterogeneous cluster performs well with both read- and write-heavy workloads
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
+YCSB workloads
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
Workload Application example
Operation ratio
Record selection
Write-Only Log Read: 0% Write: 100%
Zipfian
Write-Heavy Session store Read: 50% Write: 50%
Read-Heavy Photo tagging
Read: 95% Write: 5%
Read-Only Cache Read: 100% Write: 0%
Write heavy
Read heavy
Zipfian distribution: the access frequency of each datum is determined by its popularity, not by freshness.
0
0.5
1
1.5 avg. write-latency Cassandra MyCassandra Cluster
Write/Read latency (Response time)
0
5
10
15
20
25
30
35
Write-Only Write-Heavy Read-Heavy Read-Only
avg. read-latency
Better
Better - 88.8% - 90.4% - 83.3%
write:100%
read:0%
write:0%
read:100% read:95%
write:5% write:50%
read:50%
(ms)
(ms)
+ 42.5% + 59.5% + 69.5%
+ 0.57ms (max)
max 90.4% lower in read-only workload
Performs well with
MySQL + Redis
- 26.5ms (max)
0
5000
10000
15000
20000
25000
Write-Only Write-Heavy Read-Heavy Read-Only
QPS for 40 clients Cassandra MyCassandra Cluster
Throughput
(query/sec)
Read heavy Write heavy
Better
x 4.07 x 11.00 x 2.16
x 0.87
[100:0] [50:50] [5:95] [0:100] [write:read]
• 11.0 times as high as Cassandra in Read-Only workload • Write performance is comparable with Cassandra.
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
+Conclusion
A cloud storage supporting both write-heavy and read-heavy workloads by combining different storage engine nodes.
MyCassandra Cluster achieved better throughput than the original Cassandra on read heavy workload.
With a read-heavy workload
Read latency: 90.4 % lower at most
Throughput: 11.0 times at most
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
+Related Work
Indexing algorithm whose goals include achieving both write and read performance FD-Tree: Tree Indexing on Flash Disks, VLDB ’10 bLSM: A General Purpose Log Structured Merge Tree, SIGMOD ‘12 Fractal-Tree: It’s implemented in TokuDB (MySQL storage engine)
Modular data stores: MySQL Anvil, SOSP ’09 Cloudy, VLDB ’10 Dynamo, SOSP ’07 Fractured Mirrors:
MyCassandra, SYSTOR ‘12: read vs. write
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
+ Discussion 1. the slight higher write latency
Cassandra Write to any nodes in N nodes
MyCassandra Cluster Write to the specified and nodes
However this cost well worths improving for read performance.
The cause is load balancing.
W R RW
write read write read
Cassandra MyCassandra
Cluster
Sync operation is equally distributed.
Sync operation is fixed.
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
+Discussion 2. in-memory node Q. Memory overflow
A. In-memory node plays as LRU-like cache. The swapped data is recovered from the other persistent nodes by read repair.
Q. Fault tolerance
A. 1) Write to an alternative node, and if the node is recovered, it resolves inconsistency using values from the node.
2) Asynchronous snapshot (Redis’s feature)
Q. Whole in-memory nodes
A. This case limits capacities in cluster with the memory’s capacity.
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
+オープンソース化
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
+
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)