MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads

Shunsuke Nakamura (Tokyo Institute of Technology, NHN Japan) Kazuyuki Shudo (Tokyo Inistitute of Technology)

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

+Cloud Storage

  NoSQL, Key-Value Storage (KVS), Document-Oriented DB, GraphDB   Example: memcached, Google Bigtable, Amazon Dynamo, Amazon SimpleDB, Apache

Cassandra, Voldemort, Ringo, Vpork, MongoDB, CouchDB, Tokyo Tyrant, Flare, ROMA, kumofs, Kai, Redis, LevelDB, Hadoop HBase, Hypertable, Yahoo! PNUTS, Scalaris, Dynomite, ThruDB, Neo4j, IBM ObjectGrid, Giraph, Oracle Coherence, and the others. (> 100 products)

  Characteristics: “limited functions, massive volume, high performance”   Data access only by primary key

  No luxury features such as join, global transaction

  Scalable to much larger data and number of nodes

Distributed data store processing large amount of data

+Design policies of cloud storages

  data model   key/value, multi-dimensional map, document or graph

  performance - write vs. read

  latency vs. persistence   latency – memory and disk utilization   persistence – synchronous vs. asynchronous (snapshot)

  replication – synchronous vs. asynchronous

  consistency between replicas – strong vs. weak

  data partitioning – row vs. column

  distribution – master/slave vs. decentralized

There are many trade-offs.

+MyCassandra focuses on performance trade-off

  data model   key/value vs. multi-dimensional map vs. document vs. graph

  performance - write vs. read

  latency vs. persistence   latency – memory and disk utilization   persistence – synchronous vs. asynchronous (snapshot)

  replication – synchronous vs. asynchronous

  consistency between replicas – strong vs. weak

  data partitioning – row vs. column

  distribution – master/slave vs. decentralized

+Performance trade-off Write-optimized vs. read-optimized

  A cloud storage with persistence is designed to optimize either write or read workload.

  Storage engine determines which workload a cloud storage treats efficiently.

Bigtable, Cassandra, HBase

MySQL, Yahoo! Sherpa

Indexing Log-Structured Merge Tree [P. O’Neil ‘96]

B-Trees [R.Bayer ’70]

Write to disk append random reads, writes

Read to disk random reads + merge random read

Performance write-optimized read-optimized

Storage engine Bigtable clone MySQL Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Performance trade-off - write-optimized vs. read-optimized -

- mycassandra -

　Write latency for write-heavy workload

Yahoo! Cloud Serving Benchmark, SOCC ’10

write-optimized

read-optimized

Better

Read latency for read-heavy workload

write-optimized

read-optimized

Better

Performance trade-off - write-optimized vs. read-optimized -

- mycassandra - Yahoo! Cloud Serving Benchmark, SOCC ’10

+Research overview

  Contribution: A technique to build a cloud storage performing well with both read and write workloads

  Steps: 1.  MyCassandra: Storage engine enabled Apache Cassandra

2.  MyCassandra Cluster: Heterogeneous cluster with different storage engines

select

　1. MyCassandra

read-optimized

write-optimized

2. MyCassandra Cluster

read and write-optimized

+Apache Cassandra

  Features:   Scalability up to hundreds of servers across multiple racks/datacenters

  High availability without SPOF by adopting a decentralized architecture

  Write-optimized

Open-sourced by in 2008

A top-level project in

dc1 dc2

Clustering across multiple racks/DCs Replication strategy based on region

+Apache Cassandra

  Consistent Hashing (a decentralized algorithm): Assign identifiers to both nodes and data on its circular ID space.

A decentralized cloud storage without SPOF

key values

hash(key) = Q

ID space

primary

Num of replica := 3

secondary 1

secondary 2

A-Z: hash value

Roles of each node •  Proxy, serving clients •  Primary/secondary data nodes

+Apache Cassandra

  O(1) fast write operation   Write an update to disk sequentially

- Fast because of no random I/O to disk - Always writable because of no write-lock

Write-optimized storage engine, a Bigtable clone

memory

CommitLog

Memtable

SSTable 1 SSTable 2 SSTable 3

write path sync async flush

Only sequential write

<k1, obj (v1+v2)>

<k1, v1>, <k1, v2>

<k1,obj1>

<k1,obj3>

<k1,obj2>

1.  Append an update to CommitLog for persistence

2.  Update Memtable, a map in memory, for quick reading

3.  Acknowledge a client 4.  Asynchronously flush Memtable

to SSTable 5.  Delete flushed data from

CommitLog and Memtable

+Apache Cassandra

  Slow read operation   Read data from Memtable and multiple SSTables, and merge them

- Slow because of multiple random I/Os on disk

Write-optimized storage engine, a Bigtable clone

CommitLog

Memtable

SSTable 1 SSTable 2 SSTable 3

<k1,obj>

<k1,obj1>

<k1,obj2>

<k1,obj3>

memory

multiple random I/Os

+ Performance of original Cassandra

  YCSB results show:   Average: write is 9 x as fast as read.

  99.9%ile: write is 43.5 x as fast as read.

Write performance is much higher. N

Latency (ms)

Better read

avg. 6.16 ms

avg. 0.69 ms

Latency (ms)

99.9 %ile

write: 2.0 ms read: 86.9 ms

select

　1.MyCassandra

read-optimized

write-optimized

1. Storage Engine Support

+MyCassandra: A modular cloud storage

  Storage engine feature inspired by MySQL   An independent and pluggable component   Perform disk I/O

  A cloud storage can be either write-optimized or read-optimized by selecting storage engine

  Keep Cassandra’s original distribution architecture and data model

Storage engines are supported

Bigtable MySQL Redis …

Decentralized + Storage engine

selectable

Consistent Hashing Gossip Protocol

Decentralized

Bigtable

InnoDB MyISAM Memory …

Storage engine selectable

MyCassandra implementation Cassandra’s original distribution arch.

Storage Engine Interface introduced

Implement Storage Engine

Interface Storage Engine Interface

Performance of each storage engine  storage engines

  Bigtable: write-optimized (original Casssandra 0.7.5)

  MySQL: read-optimized (MySQL 6.0 with InnoDB, JDBC API, stored procedure)

  Redis: in-memory KVS (Redis 2.2.8)

x 11.79

x 9.87

6 nodes -  Crucial’s SSD -  allocate 6GB mem in 8GB

1KB x 36 million data set

workload

+2.MyCassandra Cluster

read and write-optimized

2. Heterogeneous cluster of different storage engines

  Replicate data on different storage engine nodes

  Route a query to nodes processing it efficiently   Synchronously route to nodes processing quickly

  Asynchronously route to nodes processing slowly → Exploit each node’s advantage

  Furthermore, maintain consistency between replicas as much as the original Cassandra

Quorum Protocol: (write agreements) + (read agreements) > (num of replicas)

= Guarantee retrieval of the latest data

Consequence: At least one node processes both read and write queries synchronously and quickly

→ In-memory nodes play this role.

Basic idea

sync async

write read

write query

•  W: write-optimized •  R: read-optimized •  RW: in-memory

Cluster design •  W: write-optimized •  R: read-optimized •  RW: in-memory

Cluster configuration (N=3)

RW gossip

responsible nodes

 Combine nodes with different storage engines   write-optimized (W), read-optimized (R), in-memory (RW)

 Disseminate storage engine types of each nodes   The type is attached to gossip messages

 Place replicas on nodes with different storage engines   Proxy (any node requested) selects the storing nodes

1.  The primary node determined based on the queried key

2.  N -1 secondary nodes with different storage engines

 Multiple nodes share a single server for load balance

W R RW RW

Proxy (any node)

primary secondary2 secondary1

Process for a write access •  W: write-optimized •  R: read-optimized •  RW: in-memory

•  Quorum parameters= 3, = = 2

•  Num. of replicas W:RW:R = 1:1:1

……

Client

Wait for two ACKs for write and return

Async write

Write for a single record

Nodes storing the record

1) A proxy receives a write query from a client. The proxy routes to nodes storing the record.

2) The proxy waits ACKs. W, RW nodes usually reply quickly.

3-a) If writing succeeds and the proxy receives ACKs, it returns a success message.

3-b) If a data node fails to write, the proxy waits for ACKs including R nodes and returns a success message.

4) After returning, the proxy asynchronously waits ACKs from the remaining nodes.

Write latency: max (W, RW)

Process for a read access •  W: write-optimized •  R: read-optimized •  RW: in-memory

……

Client Read for a single record

Nodes storing the record

1) A proxy receives a read query and routes to storing nodes.

2) The proxy waits for ACKs. R and RW nodes reply quickly.

3-a) If returned values are consistent, the proxy returns it.

3-b) If the values are mismatched, the proxy waits for consistent values including W nodes.

4) After returning, the proxy waits from the remaining nodes. If the proxy notices inconsistent values, it asynchronously updates them to the consistent one. Cassandra’s feature ReadRepair does it.

Check consistentcy and return result

Async check consistency

Read latency: max (R, RW)

•  Quorum parameters= 3, = = 2

•  Num. of replicas W:RW:R = 1:1:1

+Performance Evaluation

  Targets   MyCassandra Cluster: 3 different nodes/server x 6 servers   Cassandra: 1 node/server x 6 servers

  Quorum parameters = 3, = = 2

  Storage Engine   Bigtable (W), MySQL / InnoDB (R), Redis (RW)

  Yahoo! Cloud Serving Benchmark (YCSB) [SOCC ’10] 1.  Load data (1KB record, 10 x 100bytes columns) from a YCSB client 2.  Warm up 3.  Run benchmark and measure response times from a client

Demonstrate that a heterogeneous cluster performs well with both read- and write-heavy workloads

+YCSB workloads

Workload Application example

Operation ratio

Record selection

Write-Only Log Read: 0% Write: 100%

Zipfian

Write-Heavy Session store Read: 50% Write: 50%

Read-Heavy Photo tagging

Read: 95% Write: 5%

Read-Only Cache Read: 100% Write: 0%

Write heavy

Read heavy

Zipfian distribution: the access frequency of each datum is determined by its popularity, not by freshness.

1.5 avg. write-latency Cassandra MyCassandra Cluster

Write/Read latency (Response time)

Write-Only Write-Heavy Read-Heavy Read-Only

avg. read-latency

Better

Better - 88.8% - 90.4% - 83.3%

write:100%

read:0%

write:0%

read:100% read:95%

write:5% write:50%

read:50%

+ 42.5% + 59.5% + 69.5%

+ 0.57ms (max)

max 90.4% lower in read-only workload

Performs well with

MySQL + Redis

- 26.5ms (max)

Write-Only Write-Heavy Read-Heavy Read-Only

QPS for 40 clients Cassandra MyCassandra Cluster

Throughput

(query/sec)

Read heavy Write heavy

Better

x 4.07 x 11.00 x 2.16

x 0.87

[100:0] [50:50] [5:95] [0:100] [write:read]

•  11.0 times as high as Cassandra in Read-Only workload •  Write performance is comparable with Cassandra.

+Conclusion

 A cloud storage supporting both write-heavy and read-heavy workloads by combining different storage engine nodes.

  MyCassandra Cluster achieved better throughput than the original Cassandra on read heavy workload.

  With a read-heavy workload

  Read latency: 90.4 % lower at most

  Throughput: 11.0 times at most

+Related Work

  Indexing algorithm whose goals include achieving both write and read performance   FD-Tree: Tree Indexing on Flash Disks, VLDB ’10   bLSM: A General Purpose Log Structured Merge Tree, SIGMOD ‘12   Fractal-Tree: It’s implemented in TokuDB (MySQL storage engine)

  Modular data stores:   MySQL   Anvil, SOSP ’09   Cloudy, VLDB ’10   Dynamo, SOSP ’07   Fractured Mirrors:

 MyCassandra, SYSTOR ‘12: read vs. write

+ Discussion 1. the slight higher write latency

  Cassandra   Write to any nodes in N nodes

  MyCassandra Cluster   Write to the specified and nodes

However this cost well worths improving for read performance.

The cause is load balancing.

W R RW

write read write read

Cassandra MyCassandra

Cluster

Sync operation is equally distributed.

Sync operation is fixed.

+Discussion 2. in-memory node Q. Memory overflow

A. In-memory node plays as LRU-like cache. The swapped data is recovered from the other persistent nodes by read repair.

Q. Fault tolerance

A. 1) Write to an alternative node, and if the node is recovered, it resolves inconsistency using values from the node.

2) Asynchronous snapshot (Redis’s feature)

Q. Whole in-memory nodes

A. This case limits capacities in cluster with the memory’s capacity.

+オープンソース化

MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

Technology

Transcript of MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

Heavy Arduino

AWS Microsoft Workloads Competency - Amazon S3...Программа AWS Microsoft Workloads ompetency: контрольный список для проверки партнера‑технолога,

Using Azure for Computationally Intensive Workloads

Supporting visual queries

Heavy equipment

Employer Supporting Letter

Monitorando Desvios de Workloads Mainframe por Etéocles Cavalcanti e Silvio Joly

Microsoft Trainings und Zertifizierungen · 2 Examen Microsoft Azure for SAP Workloads Specialty Architekten von Azure for SAP Workloads verfügen über umfangreiche Erfahrung und

Life Supporting Technologies

Supporting Information1 Supporting Information Heavy-Atom Effect on Xanthene Dyes for Photopolymerization by Visible Light Jieun Yoon,a,‡ Young Jae Jung,a,‡ Joon Bo Yoon,a Kongara

Supporting information 20201210

Raising the bar #5 - Melhores práticas de workloads Microsoft

Estratégias para proteger Cloud Workloads com Modelos de ...

Identifying Workloads to Move to the Cloud

Heavy musika

Supporting the vision

"Heavy Metal"

Heavy metal

Section B Heavy Body, Not Heavy Heart Not Heavy Heart.

Migrating Application Workloads to Azure