Post on 18-Jan-2017
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
Multi Tier Annotation SearchMTAS
Matthijs Brouwer
Meertens Institute
December 8, 2015
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
1 Introduction
2 Lucene
3 MTAS
4 Tokenizer FoLiA
5 Search using CQL
6 Results
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
Text and MetadataAnnotated TextRequirements
Provide Search on Combination of Text and Metadata
Example data
Author Eduard Douwes DekkerPlace of birth AmsterdamDate of birth 1820, March 2Pseudonym Max HavelaarTitle MultatuliPublished 1860
Text Ik ben makelaar in ko�een woon op de Lauriergrachtno 37 . . .
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
Text and MetadataAnnotated TextRequirements
Solution based on Apache Solr
Reverse Index
Apache Solr (based on Apache Lucene)
Index on both Text and Metadata
Advantages
Search
Facets
Scalable
Custom plugin (join)
Actively developed
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
Text and MetadataAnnotated TextRequirements
Search Text
’Ik ben makelaar in ko�e, en woon op de Lauriergracht no 37.’
We can search for
”Makelaar”
”Makelaar in ko�e”
”Makel.* in ko�e”
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
Text and MetadataAnnotated TextRequirements
Annotations
’Ik ben makelaar in ko�e, en woon op de Lauriergracht no 37.’
text lemma pos/featuresIk ik VNW(pers,pron,nomin,vol,1,ev)ben zijn WW(pv,tgw,ev)makelaar makelaar N(soort,ev,basis,zijd,stan)in in VZ(init)ko�e ko�e N(soort,ev,basis,zijd,stan), , LET(). . . . . . . . .
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
Text and MetadataAnnotated TextRequirements
FoLiA
<text xml:id=”untitled.text”><p xml:id=”untitled.p.1”><s xml:id=”untitled.p.1.s.1”><w xml:id=”untitled.p.1.s.1.w.1” class=”WORD”><t>Ik</t><pos class=”VNW(pers,pron,nomin,vol,1,ev)” confidence=”0.999791”head=”VNW”><feat class=”pers” subset=”vwtype”/><feat class=”pron” subset=”pdtype”/><feat class=”nomin” subset=”naamval”/><feat class=”vol” subset=”status”/><feat class=”1” subset=”persoon”/><feat class=”ev” subset=”getal”/>
</pos><morphology><morpheme><t o↵set=”0”>ik</t>
</morpheme></morphology><lemma class=”ik”/>
</w>
. . .
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
Text and MetadataAnnotated TextRequirements
Required functionality
Extend current Solr solution
Search on annotations like pos, lemma, features, . . .
Search on sentences, paragraphs, chapters, . . .
Search on entities and chunks
Search on dependencies
Statistics, grouping, facets, . . .
Important
Maintaining functionality and scalability
Upgradeable to new releases Solr/Lucene
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
TokenizationReverse IndexLimitationsAlternatives
Tokenization
Something about Lucene internals
Focus on textTokenization
Text is split up into tokens
value, e.g. ”ko�e”position, e.g. 4o↵set, e.g. 19� 24payload, e.g. 1.000
’Ik ben makelaar in ko�e, en woon op de Lauriergracht no 37.’
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
TokenizationReverse IndexLimitationsAlternatives
Reverse Index
Tokenstream used to construct Reverse Index
text document position o↵set payloadben 0 1 3� 5 0.500de 0 9 38� 39 0.200en 0 6 27� 28 0.250in 0 3 16� 17 0.350ko�e 0 4 19� 24 0.900makelaar 0 2 7� 14 0.800. . . . . . . . . . . . . . .
This enables fast search, since the locations of matching terms canbe found very quickly.
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
TokenizationReverse IndexLimitationsAlternatives
Limitations
Limitations of this approach
Heavily based on grouping by documentCollecting statisticsGrouping results
Not possible to includeStructural information: sentences, paragraphs, . . .Annotations: pos, lemma’s, . . .Relations: dependencies, chunking, . . .
No real forward indexFinding all tokens for a given position
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
TokenizationReverse IndexLimitationsAlternatives
Alternatives
Alternative solutions
Graph DatabaseExperiments Neo4j: problems scalability and performanceToo general, doesn’t use sequential nature of textual data
BlackLabBased on Lucene, no integration with SolrDi↵erent fields for each annotation layer
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
GeneralPrefixesPayloadForward IndexesAdditional requirements
Extension provided by MTAS
Store multiple tokens on the same position, and use prefixesto distinguish between di↵erent layers of annotations
Use the payload to encode additional information on eachtoken
Construct forward indexes by extending the Lucene Codec
Implementation
Extension based on the Lucene Library
Provide query handlers for extended data structures
Provide Solr Plugin using the MTAS extension
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
GeneralPrefixesPayloadForward IndexesAdditional requirements
Prefixes
Store multiple tokens on the same position, and use prefixes todistinguish between di↵erent layers of annotations
text document positionlemma:de 0 9lemma:zijn 0 1. . . . . . . . .pos:LID 0 9pos:WW 0 1. . . . . . . . .t:ben 0 1t:de 0 9. . . . . . . . .
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
GeneralPrefixesPayloadForward IndexesAdditional requirements
Payload
Use the payload to encode additional information on each token
mtas id integer identifying token within a documentposition type of position: single, range or set
additional information for range or seto↵set start and end o↵setreal o↵set start and end real o↵setparent reference to another token by its mtas idpayload original payload
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
GeneralPrefixesPayloadForward IndexesAdditional requirements
Forward Indexes
Construct forward indexes by extending the Lucene Codec
Position Given the position within the document,return references to all objects on that position.
Parent Id Given the mtas id, return referencesto all objects referring to this mtas id as parent
Object Id Given the id, return a reference to the objectPrefix/Position Given prefix and position, return the value
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
GeneralPrefixesPayloadForward IndexesAdditional requirements
Usage new structure
The additions make it possible to quickly retrieve the requiredinformation for queries and results based on the annotated text.
To take advantage of these additions to the Lucene structure, weneed
Tokenizer mapping the original annotated data (FoLiA) on thenew structure
Query handlers, and query language: CQL
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
FoLiA
<text xml:id=”untitled.text”><p xml:id=”untitled.p.1”><s xml:id=”untitled.p.1.s.1”><w xml:id=”untitled.p.1.s.1.w.1” class=”WORD”><t>Ik</t><pos class=”VNW(pers,pron,nomin,vol,1,ev)” confidence=”0.999791”head=”VNW”><feat class=”pers” subset=”vwtype”/><feat class=”pron” subset=”pdtype”/><feat class=”nomin” subset=”naamval”/><feat class=”vol” subset=”status”/><feat class=”1” subset=”persoon”/><feat class=”ev” subset=”getal”/>
</pos><morphology><morpheme><t o↵set=”0”>ik</t>
</morpheme></morphology><lemma class=”ik”/>
</w>
. . .
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
Tokenizer FoLiA
Several elements can be distinguished:
Words : <w/>
Annotations on Words : <pos/>, <t/>, <lemma/>
Groups of Words : <p/>, <s/>, <div/>
Annotations on Groups : <lang/>
References : <wref/>
Relations : <entity/>
The configurable FoLiA tokenizer enables to define these items andmap them onto the new index structure.
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
Search using CQL
For new MTAS data structure
Query handlers provided
Support Corpus Query Language (CQL)
Enables to define conditions on annotations
Confusion about the exact interpretation and implementation
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
Search using CQL
the big green shiny appleLID ADJ ADJ ADJ N
Ambiguities illustrated by examples
[pos = ”LID”|word = ”the”] (1)
[word = ”b. ⇤ ”|word = ”. ⇤ g”] (2)
[pos = ”ADJ”]{2} (3)
[pos = ”ADJ”]? [pos = ”N”] (4)
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
Search using CQL
Within MTAS
Results should be considered as equal if and only if thepositions of both results exactly match.
Di↵ers from the default query interpretation of Lucene andthe CQL interpretation as used in other applications
No options to refer to parts of the matched pattern to e.g.sort, group or collect statistics
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
Size indexesPerformanceTODO
Size indexes
Collection # FoLiA Zipped Size Index PositionsDBNL T 9, 465 29GB 198GB 677,476,310DBNL DT 131, 177 95GB 395,530,191SONAR 2, 063, 880 22GB 127GB 504,393,711
Search on combined indexes using Solr sharding
# FoLiA 2, 204, 522# Positions 1, 577, 400, 212# Sentences 92, 584, 655
There are approximately 10 tokens on each position.
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
Size indexesPerformanceTODO
Performance
Virtual Machine, Ubuntu, 8 cores, 48GB (40GB Solr)
Computing stats (sum, mean, median, standarddeviation, etc.) onfull set of 2, 204, 522 documents and 1, 577, 400, 212 positions.
CQL Time Hits Docs[t = ”de”] 3, 023 ms 57, 531, 353 1, 801, 583[t = ”de” & pos = ”LID”] 7, 877 ms 56, 704, 921 1, 799, 499[t = ”de” & !pos = ”LID”] 3, 105 ms 826, 432 132, 722< s > [t = ”De”] 11, 568 ms 6, 085, 643 1, 090, 127[pos = ”N”] 6, 200 ms 259, 942, 340 2, 189, 750[pos = ”ADJ”] [pos = ”N”] 42, 977 ms 45, 366, 603 1, 821, 716[pos = ”ADJ”]? [pos = ”N”] 207, 795 ms 305, 308, 943 2, 189, 750
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
Size indexesPerformanceTODO
TODO
Group results
Facets
Performance
. . .
Matthijs Brouwer Multi Tier Annotation Search
IntroductionLuceneMTAS
Tokenizer FoLiASearch using CQL
Results
Size indexesPerformanceTODO
The end
Matthijs Brouwer Multi Tier Annotation Search