Post on 17-Feb-2017
1
Cassandra 2.2 and 3.0 new features
DuyHai DOAN Apache Cassandra Technical Evangelist
#VoxxedBerlin @doanduyhai
Datastax
2
• Founded in April 2010
• We contribute a lot to Apache Cassandra™
• 400+ customers (25 of the Fortune 100), 450+ employees
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = OSS Cassandra + extra features
Materialized Views (MV)• Why ? • Detailed Impl • Gotchas
Why Materialized Views ?• Relieve the pain of manual denormalization
CREATE TABLE user( id int PRIMARY KEY, country text, …);CREATE TABLE user_by_country( country text, id int, …, PRIMARY KEY(country, id));
4
CREATE TABLE user_by_country ( country text, id int, firstname text, lastname text, PRIMARY KEY(country, id));
Materialzed View In ActionCREATE MATERIALIZED VIEW user_by_country AS SELECT country, id, firstname, lastnameFROM user WHERE country IS NOT NULL AND id IS NOT NULLPRIMARY KEY(country, id)
5
Materialzed View SyntaxCREATE MATERIALIZED VIEW [IF NOT EXISTS] keyspace_name.view_name
AS SELECT column1, column2, ...
FROM keyspace_name.table_name
WHERE column1 IS NOT NULL AND column2 IS NOT NULL ...
PRIMARY KEY(column1, column2, ...)
Must select all primary key columns of base table
• IS NOT NULL condition for now • more complex conditions in future
• at least all primary key columns of base table (ordering can be different)
• maximum 1 column NOT pk from base table
6
Materialized Views Demo 7
Materialized View Impl
C*
C*
C*
C*
C* C*
C* C*
UPDATE user SET country=‘FR’
WHERE id=1
① • send mutation to all replicas• waiting for ack(s) with CL
8
Materialized View Impl
C*
C*
C*
C*
C* C*
C* C*
UPDATE user SET country=‘FR’
WHERE id=1
② Acquire local lock on base table partition
9
Materialized View Impl
C*
C*
C*
C*
C* C*
C* C*
UPDATE user SET country=‘FR’
WHERE id=1
③ Local read to fetch current values
SELECT * FROM user WHERE id=1
10
Materialized View Impl
C*
C*
C*
C*
C* C*
C* C*
UPDATE user SET country=‘FR’
WHERE id=1
④ Create BatchLog with
• DELETE FROM user_by_country WHERE country = ‘old_value’
• INSERT INTO user_by_country(country, id, …) VALUES(‘FR’, 1, ...)
11
Materialized View Impl
C*
C*
C*
C*
C* C*
C* C*
UPDATE user SET country=‘FR’
WHERE id=1
⑤ Execute async BatchLog
to paired view replica with CL = ONE
12
Materialized View Impl
C*
C*
C*
C*
C* C*
C* C*
UPDATE user SET country=‘FR’
WHERE id=1
⑥ Apply base table updade locallySET COUNTRY=‘FR’
13
Materialized View Impl
C*
C*
C*
C*
C* C*
C* C*
UPDATE user SET country=‘FR’
WHERE id=1
⑦ Release local lock
14
Materialized View Impl
C*
C*
C*
C*
C* C*
C* C*
UPDATE user SET country=‘FR’
WHERE id=1
⑧ Return ack to coordinator
15
Materialized View Impl
C*
C*
C*
C*
C* C*
C* C*
UPDATE user SET country=‘FR’
WHERE id=1
⑨ If CL ack(s)
received, ack client
16
MV Failure Cases: concurrent updates
Read base row (country=‘UK’)• DELETE FROM mv WHERE
country=‘UK’• INSERT INTO mv …(country)
VALUES(‘US’)• Send async BatchLog • Apply update country=‘US’
1) UPDATE … SET country=‘US’ 2) UPDATE … SET country=‘FR’
Read base row (country=‘UK’)• DELETE FROM mv WHERE
country=‘UK’• INSERT INTO mv …(country)
VALUES(‘FR’)• Send async BatchLog • Apply update country=‘FR’
t0
t1
t2
Without local
lock
17
MV Failure Cases: concurrent updates
Read base row (country=‘UK’)• DELETE FROM mv WHERE
country=‘UK’• INSERT INTO mv …(country)
VALUES(‘US’)• Send async BatchLog • Apply update country=‘US’
1) UPDATE … SET country=‘US’ 2) UPDATE … SET country=‘FR’
Read base row (country=‘UK’)• DELETE FROM mv WHERE
country=‘UK’• INSERT INTO mv …(country)
VALUES(‘FR’)• Send async BatchLog • Apply update country=‘FR’
t0
t1
t2
Without local
lock
18
INSERT INTO mv …(country) VALUES(‘US’)INSERT INTO mv …(country) VALUES(‘FR’)
MV Failure Cases: concurrent updates
Read base row (country=‘UK’)• DELETE FROM mv WHERE
country=‘UK’• INSERT INTO mv …(country)
VALUES(‘US’)• Send async BatchLog • Apply update country=‘US’
1) UPDATE … SET country=‘US’ 2) UPDATE … SET country=‘FR’
Read base row (country=‘US’)• DELETE FROM mv WHERE
country=‘US’• INSERT INTO mv …(country)
VALUES(‘FR’)• Send async BatchLog • Apply update country=‘FR’
With local lo
ck
🔒
🔓 🔒
🔓 19
MV Failure Cases: failed updates to MV
C*
C*
C*
C*
C* C*
C* C*
UPDATE user SET country=‘FR’
WHERE id=1
⑤ Execute async BatchLog
to paired view replica with CL = ONE
✘
MV replica d
own
20
MV Failure Cases: failed updates to MV
C*
C*
C*
C*
C* C*
C* C*
UPDATE user SET country=‘FR’
WHERE id=1
BatchLog replay
MV replica u
p
21
Materialized View Performance• Write performance
• local lock• local read-before-write for MV à update contention on partition (most of perf hits) • local batchlog for MV• ☞ you only pay this price once whatever number of MV • for each base table update: mv_count x 2 (DELETE + INSERT) extra mutations
22
Materialized View Performance• Write performance vs manual denormalization
• MV better because no client-server network traffic for read-before-write • MV better because less network traffic for multiple views (client-side BATCH)
• Makes developer life easier à priceless
23
Materialized View Performance• Read performance vs secondary index
• MV better because single node read (secondary index can hit many nodes)• MV better because single read path (secondary index = read index + read data)
24
Materialized Views Consistency• Consistency level
• CL honoured for base table, ONE for MV + local batchlog
• Weaker consistency guarantees for MV than for base table.
• Exemple, write at QUORUM • guarantee that QUORUM replicas of base tables have received write• guarantee that QUORUM of MV replicas will eventually receive DELETE + INSERT
25
Materialized Views Gotchas• Beware of hot spots !!!
• MV user_by_gender 😱
26
Q & A
! "
27
User Define Functions (UDF)• Why ? • Detailed Impl • UDAs • Gotchas
Rationale• Push computation server-side
• save network bandwidth (1000 nodes!)• simplify client-side code• provide standard & useful function (sum, avg …)• accelerate analytics use-case (pre-aggregation for Spark)
29
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALL ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURN returnTypeLANGUAGE language AS $$ // source code here$$;
30
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURN returnTypeLANGUAGE language AS $$ // source code here$$;
An UDF is keyspace-wide
31
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURN returnTypeLANGUAGE languageAS $$ // source code here$$;
Param name to refer to in the code Type = CQL3 type
32
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURN returnTypeLANGUAGE language // jAS $$ // source code here$$;
Always called Null-check mandatory in code
33
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURN returnTypeLANGUAGE language // javAS $$ // source code here$$;
If any input is null, code block is skipped and return null
34
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURN returnType LANGUAGE languageAS $$ // source code here$$;
CQL types • primitives (boolean, int, …) • collections (list, set, map) • tuples • UDT
35
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURN returnTypeLANGUAGE languageAS $$ // source code here$$; JVM supported languages
• Java, Scala • Javascript (slow) • Groovy, Jython, JRuby • Clojure ( JSR 223 impl issue)
36
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURN returnTypeLANGUAGE languageAS $$ // source code here$$;
37
UDF Demo
38
UDA• Real use-case for UDF
• Aggregation server-side à huge network bandwidth saving
• Provide similar behavior for Group By, Sum, Avg etc …
39
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS][keyspace.]aggregateName(type1, type2, …)SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction]INITCOND initCond;
Only type, no param name
State type
Initial state type
40
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS][keyspace.]aggregateName(type1, type2, …)SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction]INITCOND initCond;
Accumulator function. Signature: accumulatorFunction(stateType, type1, type2, …)
RETURNS stateType
41
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS][keyspace.]aggregateName(type1, type2, …)SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction]INITCOND initCond;
Optional final function. Signature: finalFunction(stateType)
42
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS][keyspace.]aggregateName(type1, type2, …)SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction]INITCOND initCond;
UDA return type ? If finalFunction • return type of finalFunction Else • return stateType
43
UDA Demo
44
Gotchas
C* C*
C*
C*
UDA ①
② & ③
⑤ ② & ③
② & ③
45
Gotchas
C* C*
C*
C*
UDA ①
② & ③
⑤ ② & ③
② & ③
46
Why do not apply UDF/UDA on replica node ?
Gotchas
C* C*
C*
C*
UDA ①
② & ③
④ • apply accumulatorFunction• apply finalFunction
⑤ ② & ③
② & ③
1. Because of eventual consistency
2. UDF/UDA applied AFTER last-write-win logic
47
Gotchas
48
• UDA in Cassandra is not distributed !
• Execute UDA on a large number of rows (106 for ex.) • single fat partition• multiple partitions• full table scan
• à Increase client-side timeout• default Java driver timeout = 12 secs• JAVA-1033 JIRA for per-request timeout setting
Cassandra UDA or Apache Spark ?
49
Consistency Level
Single/Multiple Partition(s)
Recommended Approach
ONE Single partition UDA with token-aware driver because node local
ONE Multiple partitions Apache Spark because distributed reads
> ONE Single partition UDA because data-locality lost with Spark
> ONE Multiple partitions Apache Spark definitely
Cassandra UDA or Apache Spark ?
50
Consistency Level
Single/Multiple Partition(s)
Recommended Approach
ONE Single partition UDA with token-aware driver because node local
ONE Multiple partitions Apache Spark because distributed reads
> ONE Single partition UDA because data-locality lost with Spark
> ONE Multiple partitions Apache Spark definitely
Q & A
! "
51
New Storage Engine• Data structure • Disk space usage
Pre 3.0 data structure
Map<byte[ ], SortedMap<byte[ ], Cell>>
53
CREATE TABLE sensor_data( sensor_id uuid, date timestamp, sensor_type text, sensor_value double, PRIMARY KEY(sensor_id, date));
Pre 3.0 on disk layout
54
RowKey: de305d54-75b4-431b-adb2-eb6b9e546014 => (column=2015-04-27 10:00:00+0100:, value=, timestamp=1430128800) => (column=2015-04-27 10:00:00+0100:sensor_type, value=‘Temperature’, timestamp=1430128800) => (column=2015-04-27 10:00:00+0100:sensor_value, value=23.48, timestamp=1430128800) => (column=2015-04-27 10:01:00+0100:, value=, timestamp=1430128860) => (column=2015-04-27 10:01:00+0100:sensor_type, value=‘Temperature’, timestamp=1430128860) => (column=2015-04-27 10:01:00+0100:sensor_value, value=24.08, timestamp=1430128860)
Clustering values are repeated for each normal column
Full timestamp storage
Cassandra 3.0 data structure
Map<byte[ ], SortedMap<ClusteringColumn, Row>>
55
CREATE TABLE sensor_data( sensor_id uuid, date timestamp, sensor_type text, sensor_value double, PRIMARY KEY(sensor_id, date));
Cassandra 3.0 on disk layout
56
PartitionKey: de305d54-75b4-431b-adb2-eb6b9e546014 => clusteringColumn:2015-04-27 10:00:00+0100 => row_timestamp=1430128800 => (column_value=‘Temperature’, delta_encoded_timestamp=+0) => (column_value=23.48, delta_encoded_timestamp=+0) => clusteringColumn:2015-04-27 10:01:00+0100 => row_timestamp=1430128860 => (column_value=‘Temperature’, delta_encoded_timestamp=+0) => (column_value=24.08, delta_encoded_timestamp=+0)
Delta-encoded timestamp vs row timestamp
Gains
57
• No clustering value repetition
• Column labels are stored only once in meta data
• Delta encoding of timestamp, 8 bytes saved each time
• Less disk space used
Benchmarks
58
CREATE TABLE events ( id uuid, date timeuuid, prop1 int, prop2 text, prop3 float, PRIMARY KEY(id, date));
106 rows
Small string
Benchmarks
59
CREATE TABLE largetext( key int, prop1 int, prop2 text,PRIMARY KEY(id));
106 rows
Large string (1000)
Benchmarks
60
CREATE TABLE largeclustering( key int, clust text, prop1 int, prop2 set<float>,PRIMARY KEY(id, clust));
106 rows Medium string (100)
50 items
Benchmarks
61
CREATE TABLE events ( id uuid, date timeuuid, prop1 int, prop2 text, prop3 float, PRIMARY KEY(id, date)) WITH COMPACT STORAGE ;
Q & A
! "
62
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/
Thank You
63