Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with
description
Transcript of Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with
Martin Theobald
Max Planck Institute for Computer ScienceStanford University
Joint work withRalf Schenkel, Gerhard Weikum
TopXEfficient & Versatile
Top-k Query Processing for Semistructured Data
“Native XML data base systems can store schemaless data ... ”
“Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ”
“XML-QL: A Query Language for XML.”
“Native XML Data Bases.”
“Proc. Query Languages Workshop, W3C,1998.”
“XML queries with an expres- sive power similar to that of Datalog …”
sec
article
sec
par
bib
par
title “Current Approaches to XML Data Manage-ment”
itempar
title inproc
title
//article[.//bib[about(.//item, “W3C”)] ]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)]
“What does XML add for retrieval? It adds formal ways …”
“w3c.org/xml”
sec
article
sec
par “Sophisticated technologies developed by smart people.”
par
title “TheXML Files”
par
title “TheOntology Game”
title“TheDirty LittleSecret”
bib
“There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …”
title
item
url“XML”
RANKINGVAGUENESS
PRUNING
• Extend existing threshold algorithms for inverted lists [Güntzer, Balke & Kießling, VLDB’00; Fagin, PODS ‘01]
to XML data and XPath-like full-text search• Non-schematic, heterogeneous data sources• Efficiently support IR-style vague search
• Combined inverted index for content & structure • Avoid full index scans, postpone expensive random accesses to large disk-resident data structures• Exploit cheap disk space for redundant index structures
Goal: Efficiently retrieve the best (top-k) results of a similarity query
XML-IR: History and Related WorkIR on structured docs (SGML):
1995
2000
2005
IR on XML:
Commercial software: MarkLogic, Verity?, IBM?, Oracle?, ...
XML query languages:
XQuery 1.0 (W3C)XPath 2.0 (W3C)
NEXI (INEX Benchmark)
XPath 2.0 &XQuery 1.0
Full-Text(W3C)
XPath 1.0 (W3C)XML-QL (AT&T Labs)
Web query languages:
Lorel (Stanford U)Araneus (U Roma)W3QS (Technion Haifa)
TeXQuery (AT&T Labs)
WebSQL (U Toronto)
XIRQL & HyRex (U Dortmund)XXL & TopX (U Saarland / MPII)ApproXQL (U Berlin / U Munich)ELIXIR (U Dublin)JuruXML (IBM Haifa )XSearch (Hebrew U)Timber (U Michigan)XRank & Quark (Cornell U)FleXPath (AT&T Labs)XKeyword (UCSD)
OED etc. (U Waterloo)HySpirit (U Dortmund)HyperStorM (GMD Darmstadt)WHIRL (CMU)
Ontology/Large Thesaurus
WordNet,OpenCyc, etc.
SA
DBMS / Inverted ListsUnified Text & XML Schema
Random Access
Probabilistic Candidate
Pruning
Probabilistic Index AccessScheduling
Dynamic Query
Expansion
Top-kXPath
Processing Top-kQueue
Scan Threads
Auxiliary Predicates
CandidateCache
CandidateQueue
Inde
Tim
eQ
uery
Pro
cess
ing
Tim
e
Indexer/Crawler
Frontends• Web Interface • Web Service • API
• Selectivities• Histograms• Correlations
Index Metadata
TopX Query Processor
Sequential Access
RA
RA
1
2
3
4
Probabilistic Candidate
Pruning
Probabilistic Index AccessScheduling
Dynamic Query
Expansion
Top-kXPath
Processing
1
2
3
4
Experiments:TREC & INEXBenchmarks
5
Data Model
XML trees (no XLinks or ID/IDref attributes) Pre-/postorder node labels Redundant full-content text nodes (w/stemming, no stopwords)
<article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data. </par> </sec></article>
“xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ “native xml data base
native xml data base system store
schemaless data“
“xml data manage”
article
title abs sec
“xml manage system vary
wide expressivepower“
“native xml data base”
“native xml data base system store schemaless data“
title par
1 6
2 1 3 2 4 5
5 3 6 4
“xml data manage xml manage system vary wide expressive power native xml native xml data base system store schemaless data“
ftf (“xml”, article1 ) = 4
ftf (“xml”, sec4 ) = 2
“native xml data base native xml data
base system store
schemaless data“
Scoring Model [INEX ’06/’07]
XML-specific extension to Okapi BM25 (originating from probabilistic IR on unstructured text)
ftf instead of tf ef instead of df Element-type specific length normalization Tunable parameters k1 and b
bib[“transactions”]vs.
par[“transactions”]
Naive “Merge-then-Sort” approach in between O(mn) and O(mn2) runtimeand O(mn) access cost
Fagin’s NRA [PODS ´01] at a Glance
Inverted Index
s(t1, d10) = 0.8s(t2,d10) = 0.6s(t3,d10) = 0.7
…
Corpus: d1,…,dn
Query: q = (t1,t2,t3)
Rank
Doc
Worst-score
Best-score
1 d78
0.9 2.4
2 d64
0.8 2.4
3 d10
0.7 2.4
Rank
Doc
Worst-score
Best-score
1 d78
1.4 2.0
2 d23
1.4 1.9
3 d64
0.8 2.1
4 d10
0.7 2.1
Rank
Doc
Worst-score
Best-score
1 d10
2.1 2.1
2 d78
1.4 2.0
3 d23
1.4 1.8
4 d64
1.2 2.0
…
…
Scan depth 1
Scan depth 2
Scan depth 3
k = 1t1d780.9
d10.7
d880.2
d130.2
d780.1
d990.2
d340.1
d230.8
d100.8
t2d640.8
d230.6
d100.6
t3d100.7
d780.5
d640.4
Find the top-k documents that maximize s(t1,dj ) + s(t2,dj ) + ... + s(tm,dj )
non-conjunctive (“andish”) evaluations
1. NRA(q,L):2. scan all lists Li (i = 1..m) in parallel & consider doc d at pos i
3. E(d) := E(d) {i};4. highi = s(ti,d); 5. worstscore(d) := ∑ s(t,d) | E(d);6. bestscore(d) := worstscore(d) + ∑ high | E(d);7. if worstscore(d) > min-k then 8. add d to top-k 9. min-k := min{worstscore(d’) | d’ top-k};10. else if bestscore(d) > min-k then11. candidates := candidates {d}; 12. if max {bestscore(d’) | d’ candidates} min-k then13. return top-k;
STOP!
Inverted Block-Index for Content & Structure
Mostly Sorted (=sequential) Access to large element blocks on disk Group elements in descending order of (maxscore, docid) Block-scan all elements per doc for a given (tag, term) key
Stored as inverted files or database tables Two B+tree indexes over the full range of attributes (IOTs in Oracle)
eid docid score pre post max-score
46 2 0.9 2 15 0.99 2 0.5 10 8 0.9
171 5 0.85 1 20 0.8584 3 0.1 1 12 0.1
sec[“xml”] title[“native”] par[“retrieval”]eid docid score pre post max-
score216 17 0.9 2 15 0.972 3 0.8 14 10 0.851 2 0.5 4 12 0.5671 31 0.4 12 23 0.4
eid docid score pre post max-score
3 1 1.0 1 21 1.028 2 0.8 8 14 0.8182 5 0.75 3 7 0.7596 4 0.75 6 4 0.75SA SA SA
RARA
RA
//sec[about(.//, “XML”) and about(.//title, “native”] //par[about(.//, “retrieval”)]
Navigational Element Index
eid docid pre post
46 2 2 159 2 10 8
171 5 1 2084 3 1 12
sec
Additional index for tag paths RAs on B+tree index using (docid, tag) as key Few & judiciously scheduled “expensive predicate” probes
Schema-oblivious indexing & querying Non-schematic XML data (no DTD required) Supports full NEXI syntax & all 13 XPath axes (+level)
title[“native”] par[“retrieval”]eid docid score pre post max-
score216 17 0.9 2 15 0.972 3 0.8 14 10 0.851 2 0.5 4 12 0.5671 31 0.4 12 23 0.4
eid docid score pre post max-score
3 1 1.0 1 21 1.028 2 0.8 8 14 0.8
182 5 0.75 3 7 0.7596 4 0.75 6 4 0.75
RA
RARA
SA SA
//sec[about(.//title, “native”] //par[about(.//, “retrieval”)]
1.0
worst=0.9best=2.9
46 worst=0.5best=2.5
9
TopX Query Processing Example
eid docid score pre post46 2 0.9 2 15
9 2 0.5 10 8
171 5 0.85 1 20
84 3 0.1 1 12
eid docid score pre post216 17 0.9 2 15
72 3 0.8 14 10
51 2 0.5 4 12
671 31 0.4 12 23
eid docid score pre post3 1 1.0 1 21
28 2 0.8 8 14
182 5 0.75 3 7
96 4 0.75 6 4
worst=1.0best=2.83
worst=0.9best=2.8
216
171 worst=0.85best=2.75
72
worst=0.8best=2.65
worst=0.9best=2.8
46
2851
worst=0.5best=2.4
9doc2 doc17 doc1worst=0.9best=2.75
216
doc5worst=1.0best=2.753
doc3
worst=0.9best=2.7
46
2851
worst=0.5best=2.3
9 worst=0.85best=2.65
171worst=1.7best=2.5
46
28
worst=0.5best=1.3
9 worst=0.9best=2.55
216
worst=1.0best=2.653
worst=0.85best=2.45
171
worst=0.8best=2.45
72
worst=0.8best=1.6
72
worst=0.1best=0.9
84
worst=0.9best=1.8
216
worst=1.0best=1.93
worst=2.2best=2.2
46
2851
worst=0.5best=0.5
9 worst=1.0best=1.63
worst=0.85best=2.15
171 worst=1.6best=2.1
171
182
worst=0.9best=1.0
216
worst=0.0best=2.9
Pseudo-doc
worst=0.0best=2.8worst=0.0best=2.75worst=0.0best=2.65worst=0.0best=2.45worst=0.0best=1.7worst=0.0best=1.4worst=0.0best=1.35
sec[“xml”] title[“native”]
Top-2 resultsworst=0.946 worst=0.59worst=0.9
216
worst=1.746
28
worst=1.0
3
worst=1.6171
182
par[“retrieval”]1.0 1.0 1.00.9
0.850.1
0.90.80.5
0.80.75
min-2=0.0min-2=0.5min-2=0.9min-2=1.6
Candidate queue
worst=2.246
2851
min-2=1.0
//sec[about(.//, “XML”) and about(.//title, “native”] //par[about(.//, “retrieval”)]
Probabilistic Candidate
Pruning
Probabilistic Index AccessScheduling
Dynamic Query
Expansion
Top-kXPath
Processing
1
2
3
4
Experiments:TREC & INEXBenchmarks
5
… … …
1.0
0.9
0.8
0.8
1.0
0.9
0.9
0.2
1.0
0.9
0.7
0.6
SA Scheduling Look-ahead Δi through precomputed
score histograms Knapsack-based optimization of
Score Reduction
RA Scheduling 2-phase probing: Schedule RAs “late & last”, i.e.,
cleanup the queue if
Extended probabilistic cost model for integrating SA & RA scheduling
Index Access Scheduling [VLDB ’06]
RA
InvertedBlock Index
Δ3,3 = 0.2Δ1,3 = 0.8
SASA SA
Probabilistic Candidate
Pruning
Probabilistic Index AccessScheduling
Dynamic Query
Expansion
Top-kXPath
Processing
1
2
3
4
Experiments:TREC & INEXBenchmarks
5
Probabilistic Candidate Pruning [VLDB ’04]
sampling
eid … max-score
216 0.972 0.851 0.5
eid … max-score
3 1.028 0.8182 0.75
title[“native”]
par[“retrieval”]0
f1
1 high1
f2
high21 0
2 0δ(d)
Convolutions of score distributions (assuming independence)
Indexing Time Query Processing Time
Probabilistic candidate pruning:Drop d from the candidate queue if
P [d gets in the final top-k] < εWith probabilistic guarantees for precision & recall
P [d gets in the final top-k] =
Probabilistic Candidate
Pruning
Probabilistic Index AccessScheduling
Dynamic Query
Expansion
Top-kXPath
Processing
1
2
3
4
Experiments:TREC & INEXBenchmarks
5
Dynamic Query Expansion [SIGIR ’05]
Incremental merging of inverted lists for expansion ti,1...ti,m
in descending order of s(tij, d)
Best-match score aggregation
Specialized expansion operators Incremental Merge operator Nested Top-k operator (efficient phrase matching) Boolean (but ranked) retrieval mode Supports any sorted inverted index for
text, structured records & XML
d42d11
d92...
d21
d78d10
d11...
d1
d37d42
d32...
d87
disaster
accident
fire
transportd66
d93d95
...d101
tunneld95
d17d11
...d99
Top-k (transport, tunnel,
~disaster)
d42 d11 d92 d37 …~disaster
Incr. Merge
TREC Robust Topic #363
SA
SA SA
Incremental Merge Operator
~t
Large corpusterm correlations
sim(t, t1 ) = 1.0
~t = { t1, t2, t3 }
sim(t, t2 ) = 0.9
sim(t, t3 ) = 0.5
t1 ...d780.9
d10.4
d880.3
d230.8
d100.8
0.4
t3 ...d990.7
d340.6
d110.9
d780.9
d640.7
d780.9
d230.8
d100.8
d640.72
d230.72
d100.63
d110.45
d780.45
d10.4 ...
SA
...d120.2
d780.1
d640.8
d230.8
d100.7t2
0.9
0.72
0.350.45
Thesaurus lookups/Relevance feedback
Index list metadata(e.g., histograms)
d880.3
Expansion terms
Expansion similarities
Initial high-scores
0.18
Meta histogramsseamlessly integrate Incremental Merge operatorsinto probabilistic scheduling and candidate pruning
Probabilistic Candidate
Pruning
Probabilistic Index AccessScheduling
Dynamic Query
Expansion
Top-kXPath
Processing
1
2
3
4
Experiments:TREC & INEXBenchmarks
5
TREC Terabyte Benchmark ’05/’06 Extensive crawl over the .gov domain (2004) 25 Mio documents—426 GB text data
50 ad-hoc-style keyword queries reintroduction of gray wolves Massachusetts textile mills
Primary cost metricsCost = #SA + cR/cS #RAWall clock runtime
TREC Terabyte Cost comparison of scheduling strategies [VLDB 06]
TREC TerabyteWall clock runtimes [VLDB ‘06/TREC ’06]
INEX Benchmark ‘06/’07 New XMLified Wikipedia corpus
660,000 documents w/ 130,000,000 elements—6.6 GB XML data 125 NEXI queries, each as content-only (CO) and content-and-structure (CAS) formulation
CO: +“state machine” figure Mealy Moore
CAS: //article[about(., “state machine” )] //figure[about(., Mealy ) or about(., Moore )]
Primary cost metricCost = #SA + cR/cS #RA
TopX vs. Full-Merge
0
5
10
15
20
25
30
35
40
10 20 50 100 500 1,000
(Mill
ions
)
k
Cos
tCAS - Full MergeCO - Full MergeCAS - TopX - ε=0.0CO - TopX - ε=0.0CAS - TopX - ε=0.1CO - TopX - ε=0.1
Significant cost savings for large ranges of k CAS cheaper than CO !
Static vs. Dynamic Expansions Query expansions with up to
m=292 keywords & phrases
Balanced amount of sorted vs. random disk access
Adaptive scheduling wrt.
cR/cS cost ratio
Dynamic expansions outperform static expansions & full-merge in both efficiency & effectiveness
0
20
40
60
80
100
120
CAS -Full
Merge
CAS -TopX -Static
CAS -TopX -
Dynamic
(Mill
ions
)
# RA
# SA
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
ε
CAS - Rel. PrecisionCO - Rel. PrecisionCAS - Rel. CostCO - Rel. Cost
Efficiency vs. Effectiveness
Very good precision/runtime ratio for probabilistic pruning
Official INEX ’06 ResultsRetrieval effectiveness (rank 3-5 out of ~60 submitted runs)
Conclusions & Outlook Scalable XML-IR and vague search
Mature system, reference engine for INEX topic development & interactive tracks
Efficient and versatile Java prototype for text, XML, and structured data (Oracle backend)
Very efficient prototype reimplementation for text data in C++ (over own file structures) C++ version for XML currently in production at MPI
More features Graph top-k, proximity search, XQuery subset,…