Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with

29
Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum TopX Efficient & Versatile Top-k Query Processing for Semistructured Data

description

TopX Efficient & Versatile Top-k Query Processing for Semistructured Data. Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel , Gerhard Weikum. article. article. title. title. “ Current Approaches - PowerPoint PPT Presentation

Transcript of Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Page 1: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Martin Theobald

Max Planck Institute for Computer ScienceStanford University

Joint work withRalf Schenkel, Gerhard Weikum

TopXEfficient & Versatile

Top-k Query Processing for Semistructured Data

Page 2: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

“Native XML data base systems can store schemaless data ... ”

“Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ”

“XML-QL: A Query Language for XML.”

“Native XML Data Bases.”

“Proc. Query Languages Workshop, W3C,1998.”

“XML queries with an expres- sive power similar to that of Datalog …”

sec

article

sec

par

bib

par

title “Current Approaches to XML Data Manage-ment”

itempar

title inproc

title

//article[.//bib[about(.//item, “W3C”)] ]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)]

“What does XML add for retrieval? It adds formal ways …”

“w3c.org/xml”

sec

article

sec

par “Sophisticated technologies developed by smart people.”

par

title “TheXML Files”

par

title “TheOntology Game”

title“TheDirty LittleSecret”

bib

“There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …”

title

item

url“XML”

RANKINGVAGUENESS

PRUNING

Page 3: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

• Extend existing threshold algorithms for inverted lists [Güntzer, Balke & Kießling, VLDB’00; Fagin, PODS ‘01]

to XML data and XPath-like full-text search• Non-schematic, heterogeneous data sources• Efficiently support IR-style vague search

• Combined inverted index for content & structure • Avoid full index scans, postpone expensive random accesses to large disk-resident data structures• Exploit cheap disk space for redundant index structures

Goal: Efficiently retrieve the best (top-k) results of a similarity query

Page 4: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

XML-IR: History and Related WorkIR on structured docs (SGML):

1995

2000

2005

IR on XML:

Commercial software: MarkLogic, Verity?, IBM?, Oracle?, ...

XML query languages:

XQuery 1.0 (W3C)XPath 2.0 (W3C)

NEXI (INEX Benchmark)

XPath 2.0 &XQuery 1.0

Full-Text(W3C)

XPath 1.0 (W3C)XML-QL (AT&T Labs)

Web query languages:

Lorel (Stanford U)Araneus (U Roma)W3QS (Technion Haifa)

TeXQuery (AT&T Labs)

WebSQL (U Toronto)

XIRQL & HyRex (U Dortmund)XXL & TopX (U Saarland / MPII)ApproXQL (U Berlin / U Munich)ELIXIR (U Dublin)JuruXML (IBM Haifa )XSearch (Hebrew U)Timber (U Michigan)XRank & Quark (Cornell U)FleXPath (AT&T Labs)XKeyword (UCSD)

OED etc. (U Waterloo)HySpirit (U Dortmund)HyperStorM (GMD Darmstadt)WHIRL (CMU)

Page 5: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Ontology/Large Thesaurus

WordNet,OpenCyc, etc.

SA

DBMS / Inverted ListsUnified Text & XML Schema

Random Access

Probabilistic Candidate

Pruning

Probabilistic Index AccessScheduling

Dynamic Query

Expansion

Top-kXPath

Processing Top-kQueue

Scan Threads

Auxiliary Predicates

CandidateCache

CandidateQueue

Inde

xing

Tim

eQ

uery

Pro

cess

ing

Tim

e

Indexer/Crawler

Frontends• Web Interface • Web Service • API

• Selectivities• Histograms• Correlations

Index Metadata

TopX Query Processor

Sequential Access

RA

RA

1

2

3

4

Page 6: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Probabilistic Candidate

Pruning

Probabilistic Index AccessScheduling

Dynamic Query

Expansion

Top-kXPath

Processing

1

2

3

4

Experiments:TREC & INEXBenchmarks

5

Page 7: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Data Model

XML trees (no XLinks or ID/IDref attributes) Pre-/postorder node labels Redundant full-content text nodes (w/stemming, no stopwords)

<article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data. </par> </sec></article>

“xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ “native xml data base

native xml data base system store

schemaless data“

“xml data manage”

article

title abs sec

“xml manage system vary

wide expressivepower“

“native xml data base”

“native xml data base system store schemaless data“

title par

1 6

2 1 3 2 4 5

5 3 6 4

“xml data manage xml manage system vary wide expressive power native xml native xml data base system store schemaless data“

ftf (“xml”, article1 ) = 4

ftf (“xml”, sec4 ) = 2

“native xml data base native xml data

base system store

schemaless data“

Page 8: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Scoring Model [INEX ’06/’07]

XML-specific extension to Okapi BM25 (originating from probabilistic IR on unstructured text)

ftf instead of tf ef instead of df Element-type specific length normalization Tunable parameters k1 and b

bib[“transactions”]vs.

par[“transactions”]

Page 9: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Naive “Merge-then-Sort” approach in between O(mn) and O(mn2) runtimeand O(mn) access cost

Fagin’s NRA [PODS ´01] at a Glance

Inverted Index

s(t1, d10) = 0.8s(t2,d10) = 0.6s(t3,d10) = 0.7

Corpus: d1,…,dn

Query: q = (t1,t2,t3)

Rank

Doc

Worst-score

Best-score

1 d78

0.9 2.4

2 d64

0.8 2.4

3 d10

0.7 2.4

Rank

Doc

Worst-score

Best-score

1 d78

1.4 2.0

2 d23

1.4 1.9

3 d64

0.8 2.1

4 d10

0.7 2.1

Rank

Doc

Worst-score

Best-score

1 d10

2.1 2.1

2 d78

1.4 2.0

3 d23

1.4 1.8

4 d64

1.2 2.0

Scan depth 1

Scan depth 2

Scan depth 3

k = 1t1d780.9

d10.7

d880.2

d130.2

d780.1

d990.2

d340.1

d230.8

d100.8

t2d640.8

d230.6

d100.6

t3d100.7

d780.5

d640.4

Find the top-k documents that maximize s(t1,dj ) + s(t2,dj ) + ... + s(tm,dj )

non-conjunctive (“andish”) evaluations

1. NRA(q,L):2. scan all lists Li (i = 1..m) in parallel & consider doc d at pos i

3. E(d) := E(d) {i};4. highi = s(ti,d); 5. worstscore(d) := ∑ s(t,d) | E(d);6. bestscore(d) := worstscore(d) + ∑ high | E(d);7. if worstscore(d) > min-k then 8. add d to top-k 9. min-k := min{worstscore(d’) | d’ top-k};10. else if bestscore(d) > min-k then11. candidates := candidates {d}; 12. if max {bestscore(d’) | d’ candidates} min-k then13. return top-k;

STOP!

Page 10: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Inverted Block-Index for Content & Structure

Mostly Sorted (=sequential) Access to large element blocks on disk Group elements in descending order of (maxscore, docid) Block-scan all elements per doc for a given (tag, term) key

Stored as inverted files or database tables Two B+tree indexes over the full range of attributes (IOTs in Oracle)

eid docid score pre post max-score

46 2 0.9 2 15 0.99 2 0.5 10 8 0.9

171 5 0.85 1 20 0.8584 3 0.1 1 12 0.1

sec[“xml”] title[“native”] par[“retrieval”]eid docid score pre post max-

score216 17 0.9 2 15 0.972 3 0.8 14 10 0.851 2 0.5 4 12 0.5671 31 0.4 12 23 0.4

eid docid score pre post max-score

3 1 1.0 1 21 1.028 2 0.8 8 14 0.8182 5 0.75 3 7 0.7596 4 0.75 6 4 0.75SA SA SA

RARA

RA

//sec[about(.//, “XML”) and about(.//title, “native”] //par[about(.//, “retrieval”)]

Page 11: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Navigational Element Index

eid docid pre post

46 2 2 159 2 10 8

171 5 1 2084 3 1 12

sec

Additional index for tag paths RAs on B+tree index using (docid, tag) as key Few & judiciously scheduled “expensive predicate” probes

Schema-oblivious indexing & querying Non-schematic XML data (no DTD required) Supports full NEXI syntax & all 13 XPath axes (+level)

title[“native”] par[“retrieval”]eid docid score pre post max-

score216 17 0.9 2 15 0.972 3 0.8 14 10 0.851 2 0.5 4 12 0.5671 31 0.4 12 23 0.4

eid docid score pre post max-score

3 1 1.0 1 21 1.028 2 0.8 8 14 0.8

182 5 0.75 3 7 0.7596 4 0.75 6 4 0.75

RA

RARA

SA SA

//sec[about(.//title, “native”] //par[about(.//, “retrieval”)]

Page 12: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

1.0

worst=0.9best=2.9

46 worst=0.5best=2.5

9

TopX Query Processing Example

eid docid score pre post46 2 0.9 2 15

9 2 0.5 10 8

171 5 0.85 1 20

84 3 0.1 1 12

eid docid score pre post216 17 0.9 2 15

72 3 0.8 14 10

51 2 0.5 4 12

671 31 0.4 12 23

eid docid score pre post3 1 1.0 1 21

28 2 0.8 8 14

182 5 0.75 3 7

96 4 0.75 6 4

worst=1.0best=2.83

worst=0.9best=2.8

216

171 worst=0.85best=2.75

72

worst=0.8best=2.65

worst=0.9best=2.8

46

2851

worst=0.5best=2.4

9doc2 doc17 doc1worst=0.9best=2.75

216

doc5worst=1.0best=2.753

doc3

worst=0.9best=2.7

46

2851

worst=0.5best=2.3

9 worst=0.85best=2.65

171worst=1.7best=2.5

46

28

worst=0.5best=1.3

9 worst=0.9best=2.55

216

worst=1.0best=2.653

worst=0.85best=2.45

171

worst=0.8best=2.45

72

worst=0.8best=1.6

72

worst=0.1best=0.9

84

worst=0.9best=1.8

216

worst=1.0best=1.93

worst=2.2best=2.2

46

2851

worst=0.5best=0.5

9 worst=1.0best=1.63

worst=0.85best=2.15

171 worst=1.6best=2.1

171

182

worst=0.9best=1.0

216

worst=0.0best=2.9

Pseudo-doc

worst=0.0best=2.8worst=0.0best=2.75worst=0.0best=2.65worst=0.0best=2.45worst=0.0best=1.7worst=0.0best=1.4worst=0.0best=1.35

sec[“xml”] title[“native”]

Top-2 resultsworst=0.946 worst=0.59worst=0.9

216

worst=1.746

28

worst=1.0

3

worst=1.6171

182

par[“retrieval”]1.0 1.0 1.00.9

0.850.1

0.90.80.5

0.80.75

min-2=0.0min-2=0.5min-2=0.9min-2=1.6

Candidate queue

worst=2.246

2851

min-2=1.0

//sec[about(.//, “XML”) and about(.//title, “native”] //par[about(.//, “retrieval”)]

Page 13: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Probabilistic Candidate

Pruning

Probabilistic Index AccessScheduling

Dynamic Query

Expansion

Top-kXPath

Processing

1

2

3

4

Experiments:TREC & INEXBenchmarks

5

Page 14: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

… … …

1.0

0.9

0.8

0.8

1.0

0.9

0.9

0.2

1.0

0.9

0.7

0.6

SA Scheduling Look-ahead Δi through precomputed

score histograms Knapsack-based optimization of

Score Reduction

RA Scheduling 2-phase probing: Schedule RAs “late & last”, i.e.,

cleanup the queue if

Extended probabilistic cost model for integrating SA & RA scheduling

Index Access Scheduling [VLDB ’06]

RA

InvertedBlock Index

Δ3,3 = 0.2Δ1,3 = 0.8

SASA SA

Page 15: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Probabilistic Candidate

Pruning

Probabilistic Index AccessScheduling

Dynamic Query

Expansion

Top-kXPath

Processing

1

2

3

4

Experiments:TREC & INEXBenchmarks

5

Page 16: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Probabilistic Candidate Pruning [VLDB ’04]

sampling

eid … max-score

216 0.972 0.851 0.5

eid … max-score

3 1.028 0.8182 0.75

title[“native”]

par[“retrieval”]0

f1

1 high1

f2

high21 0

2 0δ(d)

Convolutions of score distributions (assuming independence)

Indexing Time Query Processing Time

Probabilistic candidate pruning:Drop d from the candidate queue if

P [d gets in the final top-k] < εWith probabilistic guarantees for precision & recall

P [d gets in the final top-k] =

Page 17: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Probabilistic Candidate

Pruning

Probabilistic Index AccessScheduling

Dynamic Query

Expansion

Top-kXPath

Processing

1

2

3

4

Experiments:TREC & INEXBenchmarks

5

Page 18: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Dynamic Query Expansion [SIGIR ’05]

Incremental merging of inverted lists for expansion ti,1...ti,m

in descending order of s(tij, d)

Best-match score aggregation

Specialized expansion operators Incremental Merge operator Nested Top-k operator (efficient phrase matching) Boolean (but ranked) retrieval mode Supports any sorted inverted index for

text, structured records & XML

d42d11

d92...

d21

d78d10

d11...

d1

d37d42

d32...

d87

disaster

accident

fire

transportd66

d93d95

...d101

tunneld95

d17d11

...d99

Top-k (transport, tunnel,

~disaster)

d42 d11 d92 d37 …~disaster

Incr. Merge

TREC Robust Topic #363

SA

SA SA

Page 19: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Incremental Merge Operator

~t

Large corpusterm correlations

sim(t, t1 ) = 1.0

~t = { t1, t2, t3 }

sim(t, t2 ) = 0.9

sim(t, t3 ) = 0.5

t1 ...d780.9

d10.4

d880.3

d230.8

d100.8

0.4

t3 ...d990.7

d340.6

d110.9

d780.9

d640.7

d780.9

d230.8

d100.8

d640.72

d230.72

d100.63

d110.45

d780.45

d10.4 ...

SA

...d120.2

d780.1

d640.8

d230.8

d100.7t2

0.9

0.72

0.350.45

Thesaurus lookups/Relevance feedback

Index list metadata(e.g., histograms)

d880.3

Expansion terms

Expansion similarities

Initial high-scores

0.18

Meta histogramsseamlessly integrate Incremental Merge operatorsinto probabilistic scheduling and candidate pruning

Page 20: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Probabilistic Candidate

Pruning

Probabilistic Index AccessScheduling

Dynamic Query

Expansion

Top-kXPath

Processing

1

2

3

4

Experiments:TREC & INEXBenchmarks

5

Page 21: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

TREC Terabyte Benchmark ’05/’06 Extensive crawl over the .gov domain (2004) 25 Mio documents—426 GB text data

50 ad-hoc-style keyword queries reintroduction of gray wolves Massachusetts textile mills

Primary cost metricsCost = #SA + cR/cS #RAWall clock runtime

Page 22: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

TREC Terabyte Cost comparison of scheduling strategies [VLDB 06]

Page 23: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

TREC TerabyteWall clock runtimes [VLDB ‘06/TREC ’06]

Page 24: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

INEX Benchmark ‘06/’07 New XMLified Wikipedia corpus

660,000 documents w/ 130,000,000 elements—6.6 GB XML data 125 NEXI queries, each as content-only (CO) and content-and-structure (CAS) formulation

CO: +“state machine” figure Mealy Moore

CAS: //article[about(., “state machine” )] //figure[about(., Mealy ) or about(., Moore )]

Primary cost metricCost = #SA + cR/cS #RA

Page 25: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

TopX vs. Full-Merge

0

5

10

15

20

25

30

35

40

10 20 50 100 500 1,000

(Mill

ions

)

k

Cos

tCAS - Full MergeCO - Full MergeCAS - TopX - ε=0.0CO - TopX - ε=0.0CAS - TopX - ε=0.1CO - TopX - ε=0.1

Significant cost savings for large ranges of k CAS cheaper than CO !

Page 26: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Static vs. Dynamic Expansions Query expansions with up to

m=292 keywords & phrases

Balanced amount of sorted vs. random disk access

Adaptive scheduling wrt.

cR/cS cost ratio

Dynamic expansions outperform static expansions & full-merge in both efficiency & effectiveness

0

20

40

60

80

100

120

CAS -Full

Merge

CAS -TopX -Static

CAS -TopX -

Dynamic

(Mill

ions

)

# RA

# SA

Page 27: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

ε

CAS - Rel. PrecisionCO - Rel. PrecisionCAS - Rel. CostCO - Rel. Cost

Efficiency vs. Effectiveness

Very good precision/runtime ratio for probabilistic pruning

Page 28: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Official INEX ’06 ResultsRetrieval effectiveness (rank 3-5 out of ~60 submitted runs)

Page 29: Martin  Theobald Max Planck Institute for Computer Science Stanford University Joint work with

Conclusions & Outlook Scalable XML-IR and vague search

Mature system, reference engine for INEX topic development & interactive tracks

Efficient and versatile Java prototype for text, XML, and structured data (Oracle backend)

Very efficient prototype reimplementation for text data in C++ (over own file structures) C++ version for XML currently in production at MPI

More features Graph top-k, proximity search, XQuery subset,…