A middleware for storing massive RDF graphs into NoSQL
-
Upload
luiz-henrique-zambom-santana -
Category
Technology
-
view
45 -
download
4
Transcript of A middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF
graphs into NoSQL
Workshop de Teses e Dissertações em Banco de Dados (WTDBD)
Simpósio Brasileiro de Banco de Dados (SBBD)
Uberlândia, Outubro/2017
Luiz Henrique Zambom Santana
PhD Candidate
Prof. Dr. Ronaldo dos Santos Mello
Advisor
Agenda
● Initial information
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Partitioning, and Mapping
Querying: Query decomposition and Caching
Initial information
● Data de ingresso: 3/2015
● Data prevista de término: 2/2019
● Requisitos cumpridos:
○ SAD, EQD, Proficiências, Estágio Docência
Introduction: Motivation
● Since the of Semantic Web proposal in 2001, many advances introduced by
W3C
● RDF and SPARQL is currently widespread:
○ Best buy:
■ http://www.nytimes.com/external/readwriteweb/2010/07/01/01readwr
iteweb-how-best-buy-is-using-the-semantic-web-23031.html
○ Globo.com:
■ https://www.slideshare.net/icaromedeiros/apresantacao-ufrj-
icaro2013
Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
Background: NoSQL
● No SQL interface
● No ACID transactions
● Very scalable
● Schemaless
https://db-engines.com/en/ranking
Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
State of the Art - TriplestoresTriplestore Frag. Replication Partitioning Model In-memory Workload
Hexastore (2008) No No No Native No No
SW-Store (2009) No No Vertical SQL No No
CumulusRDF
(2011)
No No Vertical Columnar
(Cassandra)
No No
SPOVC (2012) No No Horizontal Columnar
(MonetDB)
No No
WARP (2013) Yes N-hop replication on
partition boundary
Hash Native No Typical queries
Rainbow (2015) No No Hash Polyglot K/V cache Static
ScalaRDF (2016) No Next-hop Hash Polyglot K/V cache No
Rendezvous Yes N-hop replication fragment
and on partition boundary
V and H Polyglot K/V and local cache Typical queries
Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
Rendezvous
● Triplestore implemented as middleware for storing
massing RDF graphs into multiple NoSQL databases
● Novel data partitioning approach
● Fragmentation strategy that maps pieces of this RDF
graph into NoSQL databases with different data
models
● Caching structure that accelerate the querying response
Workload awareness
Given the graph:
If the following query is issued:
SELECT ?x WHERE {
B p2 C .
C p3 x?
}
SELECT ?x WHERE {
F p6 G .
F p9 L .
F p8 x?}
A
B
C
M F G
LHI
p1 p2p3
p4p5
p6
p7 p9p8
J p11
D
Star-shaped
... ...
F {Fp6G,
Fp9L,
Fp8?}
Indexed by the predicateChain-shaped
... ...
p3 {Bp2C,
Cp3?}
Indexed by the subject/object
Dataset
Characterizer
When a new
triple is to be
stored...
Star Fragmentation (n-hop expansion)
Given the graph and this state
on Dataset Characterizer
A
B
C
M F G
LHI
p1 p2p3
p4p5
p6
p7 p9p8
J p11
D
F Cp10
Chain-shaped
... ...
p3 {Bp2C,
Cp3?}
Dataset
Characterizer
Star-shaped
... ...
F {Fp6G,
Fp9L,
Fp8?}
F tends to be in star queries
with diameter 1,
so we expand the triple
Fp10C to a 1-hop fragment
B C
F G
LHI
p5 p6
p7 p9p8
p10
Fp10C will be
stored
Star Fragmentation (mapping)
With the expanded fragment
B C
F G
LHI
p5 p6
p7 p9p8
p10{
subject: F,
p6: G,
p7: I,
p8: H,
p10: C,
p9: L,
p5: {
object: B
}}
We translate it to a JSON
document:
Document
database
Chain Fragmentation (n-hop expansion)
Given the graph and this state
on Dataset Characterizer
A
B
C
M F G
LHI
p1 p2p3
p4p5
p6
p7 p9p8
J p11
D
C G
p3
Chain-shaped
... ...
p3 {Bp2C,
Cp3?}
Dataset
Characterizer
Star-shaped
... ...
F {Fp6G,
Fp9L,
Fp8?}
p3 tends to be in chain queries with
max-diameter 1, so we expand the
triple Cp3G to a 1-hop fragment
B
C
F G
p2p3
p6
D
p3
Cp3G will be
stored
Chain Fragmentation (mapping)
With the expanded fragment We translate it to a set of
columnar tables:
B
C
F G
p2p3
p6
D
p3p2
Obj Subj
B C
p3
Obj Subj
C D
C Gp6
Obj Subj
F G
Columnar
database
Partitioning
A
B
C
E F G
LHI
p1 p2p3
p4p5
p6
p7 p9p8
P2
P1
J p11
D
P3
p10
If a graph is bigger than a
server capabilities, the
Rendezvous DBA can
create multiple partitions
Columnar
database
Document
database
P3
P1 P2
Each NoSQL server can hold one or
more partitions and each partition is
in only one server.
Partitioning (boundary replication)
Fragments hash
(F p10 C)
Size: 2
{P1,
P2}
(C p3 D)
Size: 2
{P3}
(L p12 H)
Size: 1
{P2}
P3 Elements
S P O
C p3 D
... ... ...
P1 Elements
S P O
A p1 B
F p10 C
...
Dictionary
P2 Elements
S P O
F p10 C
L p12 J
... ... H
(vi)
Columnar
database
Columnar
database
Document
database
P3 PnP1 P2
A
B
C
E F G
LHI
p1 p2p3
p4p5
p6
p7 p9p8
P2
P1
J
p11
D
P3
p10
If a triple is on the edge of two
partitions, it will be replicated in
both partitions. The size of this
boundary is defined by the
DBA.
Rendezvous manages the
partitions by saving it on the
dictionary
Partitioning (data placement)
Fragments hash
(F p10 C)
Size: 2
{P1,
P2}
(C p3 D)
Size: 2
{P3}
(L p12 H)
Size: 1
{P2}
P3 Elements
S P O
C p3 D
... ... ...
P1 Elements
S P O
A p1 B
F p10 C
...
Dictionary
P2 Elements
S P O
F p10 C
L p12 J
... ... H
(vi)
Columnar
database
Columnar
database
Document
database
P3 PnP1 P2
A
B
C
E F G
LHI
p1 p2p3
p4p5
p6
p7 p9p8
P2
P1
J p11
D
P3
p10
The fragment hash will help on the
data placement. Based on the triple
and the size of the fragment,
Rendezvous will find the best
partition to store a triple.
Rendezvous: Querying
● Query evaluation
● Update workload
● Query decomposition
● Caching
● Query evaluation
Querying evaluation
Given the graph:
A
B
C
M F G
LHI
p1 p2p3
p4p5
p6
p7 p9p8
J p11
D
If the following query is issued:
Q: SELECT ?x WHERE
{
w? p6 G .
w? p7 I .
w? p8 H .
x? p1 y? .
y? p2 z? .
z? p3 w?
} P2
P1
P3
1. It will search for:
1.1. Simple queries
1.2. Star queries
1.3. Chain queries
2. Updates the Dataset
Characterizer
Chain:
Qc: SELECT ?x
WHERE {
x? p1 y? .
y? p2 z? .
z? p3 w? .
}
Star:
Qs: SELECT ?x
WHERE {
w? p6 G .
w? p7 I .
w? p8 H
}
Querying decomposition
Given the graph:
A
B
C
M F G
LHI
p1 p2p3
p4p5
p6
p7 p9p8
J p11
D
P2
P1
P3
Chain:
Q2c: SELECT ?x
WHERE {
x? p1 y? .
y? p2 z? .
z? p3 w? .}
Star:
Qs: SELECT ?x
WHERE {
w? p6 G .
w? p7 I .
w? p8 H }
D: db.partition2.find({
{p6:{$exists:true}, object:G},
{p7:{$exists:true}, object:I},
{p8:{$exists:true}, object:H},
})
Partition 1:
Cp1: SELECT S1, O1 FROM p1
Cp1: SELECT S2, O2 FROM p2
WHERE Obj=S1
Partition 3:
Cp3: SELECT S3,O3 FROM p3
WHERE Obj=S2
Find the right partition using the
dictionary and translates the SPARQL
query to the final query to be
processed by the NoSQL database.
Caching (two level cache)
Given the graph:
A
B
C
M F G
LHI
p1 p2p3
p4p5
p6
p7 p9p8
J p11
D
After the last query was issued:
Q: SELECT ?x WHERE
{
w? p6 G .
w? p7 I .
w? p8 H .
x? p1 y? .
y? p2 z? .
z? p3 w?
y? p5 w?
}
P2
P1
P3
Near cache
(in-memory tree map)
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
Remote cache
(key/value NoSQL database)
...
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
...
B:p5:F {B:p5:F, F:p9:D}
Normally, the near
cache is smaller
than the remote
cache.
Caching (querying)
Given the graph:
A
B
C
M F G
LHI
p1 p2p3
p4p5
p6
p7 p9p8
J p11
D
If the following query is issued:
Q: SELECT ?x WHERE
{
x? p1 y? .
y? p2 z? .
z? p3 w? .
y? p5 F
}
P2
P1
P3
Near cache
(in-memory tree map)
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
Remote cache
(key/value NoSQL database)
...
A:p1:B {A:p1:B, B:p2:C}
B:p2:C {B:p2:C, C:p3:D}
...
B:p5:F {B:p5:F, F:p9:D}This query will be
solved only with
triples from cache
Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
● Rendezvous
○ Storing: Fragmentation, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
Evaluation
● LUBM: ontology for the University domain, synthetic RDF data scalable to any
size, and 14 extensional queries representing a variety of properties
● Generated dataset with 4000 universities (around 100 GB and contains
around 500 million triples)
● 12 queries with joins, all of them have at least one subject-subject join, and
six of them also have at least one subject-object join
● Apache Jena version 3.2.0 with Java 1.8, and we use Redis 3.2, MongoDB
3.4.3, and Apache Cassandra 3.10
● Amazon m3.xlarge spot with 7.5 GB of memory and 1 x 32 SSD capacity
Evaluation: Rendezvous performance
The bigger the number of hops (the replication),
the bigger (exponentially) the size of the dataset
and the loading time. However, as the joins are
avoided the response time decreases.
Conclusions
● Rendezvous contributes on:
○ Graph partitioning problem via fragments
○ Better query response time through n-hop and partition boundary
replication
○ Better query response time via two-level caching
○ Scalable RDF storage provided by NoSQL databases (polyglot
persistence)
● About the preliminary evaluation:
○ Fragments are scalable
Current state● Accepted papers
○ SBBD 2017: “Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive
RDF Graphs stored in NoSQL Databases”
● Submissions
○ SAC 2018: “Large scale RDF storage using multiple NoSQL databases”
○ TKDE: “Persistence of RDF Data into NoSQL Databases: A Survey and a Unified Reference
Architecture”
○ VLDB 2018: TBD
Schedule (in semesters)
Main topic 2/2017 1/2018 2/2018 1/2019
VLDB and TKDE
Query mapping formalization
Domain-based experiments
Thesis writing
PhD presentation
Obrigado!
Simpósio Brasileiro de Banco de Dados (SBBD)
Uberlândia, Outubro/2017
Luiz Henrique Zambom Santana
Prof. Dr. Ronaldo dos Santos Mello
State of the Art - NoSQL Triplestores
RDFJoin, RDFKB, Jena+HBase, Hive+HBase, CumulusRDF,
Rya, Stratustore, MAPSIN, H2RDF, AMADA, Trinity.RDF,
H2RDF+, MonetDBRDF, xR2RML, W3C RDF/JSON,
Rainbow, Sempala, PrestoRDF, RDFChain, Tomaszuk,
Bouhali, and Laurent, Papailiou et al., and, ScalaRDF.