MTAS Henny Brugman

IntroductionLuceneMTAS

Tokenizer FoLiASearch using CQL

Results

Multi Tier Annotation SearchMTAS

Matthijs Brouwer

Meertens Institute

December 8, 2015

Matthijs Brouwer Multi Tier Annotation Search

Results

1 Introduction

2 Lucene

3 MTAS

4 Tokenizer FoLiA

5 Search using CQL

6 Results

Results

Text and MetadataAnnotated TextRequirements

Provide Search on Combination of Text and Metadata

Example data

Author Eduard Douwes DekkerPlace of birth AmsterdamDate of birth 1820, March 2Pseudonym Max HavelaarTitle MultatuliPublished 1860

Text Ik ben makelaar in ko�een woon op de Lauriergrachtno 37 . . .

Results

Solution based on Apache Solr

Reverse Index

Apache Solr (based on Apache Lucene)

Index on both Text and Metadata

Advantages

Search

Facets

Scalable

Custom plugin (join)

Actively developed

Results

Search Text

’Ik ben makelaar in ko�e, en woon op de Lauriergracht no 37.’

We can search for

”Makelaar”

”Makelaar in ko�e”

”Makel.* in ko�e”

Results

Annotations

text lemma pos/featuresIk ik VNW(pers,pron,nomin,vol,1,ev)ben zijn WW(pv,tgw,ev)makelaar makelaar N(soort,ev,basis,zijd,stan)in in VZ(init)ko�e ko�e N(soort,ev,basis,zijd,stan), , LET(). . . . . . . . .

Results

</pos><morphology><morpheme><t o↵set=”0”>ik</t>

</morpheme></morphology><lemma class=”ik”/>

Results

Required functionality

Extend current Solr solution

Search on annotations like pos, lemma, features, . . .

Search on sentences, paragraphs, chapters, . . .

Search on entities and chunks

Search on dependencies

Statistics, grouping, facets, . . .

Important

Maintaining functionality and scalability

Upgradeable to new releases Solr/Lucene

Results

TokenizationReverse IndexLimitationsAlternatives

Tokenization

Something about Lucene internals

Focus on textTokenization

Text is split up into tokens

value, e.g. ”ko�e”position, e.g. 4o↵set, e.g. 19� 24payload, e.g. 1.000

Results

Reverse Index

Tokenstream used to construct Reverse Index

text document position o↵set payloadben 0 1 3� 5 0.500de 0 9 38� 39 0.200en 0 6 27� 28 0.250in 0 3 16� 17 0.350ko�e 0 4 19� 24 0.900makelaar 0 2 7� 14 0.800. . . . . . . . . . . . . . .

This enables fast search, since the locations of matching terms canbe found very quickly.

Results

Limitations

Limitations of this approach

Heavily based on grouping by documentCollecting statisticsGrouping results

Not possible to includeStructural information: sentences, paragraphs, . . .Annotations: pos, lemma’s, . . .Relations: dependencies, chunking, . . .

No real forward indexFinding all tokens for a given position

Results

Alternatives

Alternative solutions

Graph DatabaseExperiments Neo4j: problems scalability and performanceToo general, doesn’t use sequential nature of textual data

BlackLabBased on Lucene, no integration with SolrDi↵erent fields for each annotation layer

Results

GeneralPrefixesPayloadForward IndexesAdditional requirements

Extension provided by MTAS

Store multiple tokens on the same position, and use prefixesto distinguish between di↵erent layers of annotations

Use the payload to encode additional information on eachtoken

Construct forward indexes by extending the Lucene Codec

Implementation

Extension based on the Lucene Library

Provide query handlers for extended data structures

Provide Solr Plugin using the MTAS extension

Results

Prefixes

Store multiple tokens on the same position, and use prefixes todistinguish between di↵erent layers of annotations

text document positionlemma:de 0 9lemma:zijn 0 1. . . . . . . . .pos:LID 0 9pos:WW 0 1. . . . . . . . .t:ben 0 1t:de 0 9. . . . . . . . .

Results

Payload

Use the payload to encode additional information on each token

mtas id integer identifying token within a documentposition type of position: single, range or set

additional information for range or seto↵set start and end o↵setreal o↵set start and end real o↵setparent reference to another token by its mtas idpayload original payload

Results

Forward Indexes

Construct forward indexes by extending the Lucene Codec

Position Given the position within the document,return references to all objects on that position.

Parent Id Given the mtas id, return referencesto all objects referring to this mtas id as parent

Object Id Given the id, return a reference to the objectPrefix/Position Given prefix and position, return the value

Results

Usage new structure

The additions make it possible to quickly retrieve the requiredinformation for queries and results based on the annotated text.

To take advantage of these additions to the Lucene structure, weneed

Tokenizer mapping the original annotated data (FoLiA) on thenew structure

Query handlers, and query language: CQL

Results

</pos><morphology><morpheme><t o↵set=”0”>ik</t>

</morpheme></morphology><lemma class=”ik”/>

Results

Tokenizer FoLiA

Several elements can be distinguished:

Words : <w/>

Annotations on Words : <pos/>, <t/>, <lemma/>

Groups of Words : <p/>, <s/>, <div/>

Annotations on Groups : <lang/>

References : <wref/>

Relations : <entity/>

The configurable FoLiA tokenizer enables to define these items andmap them onto the new index structure.

Results

Search using CQL

For new MTAS data structure

Query handlers provided

Support Corpus Query Language (CQL)

Enables to define conditions on annotations

Confusion about the exact interpretation and implementation

Results

Search using CQL

the big green shiny appleLID ADJ ADJ ADJ N

Ambiguities illustrated by examples

[pos = ”LID”|word = ”the”] (1)

[word = ”b. ⇤ ”|word = ”. ⇤ g”] (2)

[pos = ”ADJ”]{2} (3)

[pos = ”ADJ”]? [pos = ”N”] (4)

Results

Search using CQL

Within MTAS

Results should be considered as equal if and only if thepositions of both results exactly match.

Di↵ers from the default query interpretation of Lucene andthe CQL interpretation as used in other applications

No options to refer to parts of the matched pattern to e.g.sort, group or collect statistics

Results

Size indexesPerformanceTODO

Size indexes

Collection # FoLiA Zipped Size Index PositionsDBNL T 9, 465 29GB 198GB 677,476,310DBNL DT 131, 177 95GB 395,530,191SONAR 2, 063, 880 22GB 127GB 504,393,711

Search on combined indexes using Solr sharding

# FoLiA 2, 204, 522# Positions 1, 577, 400, 212# Sentences 92, 584, 655

There are approximately 10 tokens on each position.

Results

Performance

Virtual Machine, Ubuntu, 8 cores, 48GB (40GB Solr)

Computing stats (sum, mean, median, standarddeviation, etc.) onfull set of 2, 204, 522 documents and 1, 577, 400, 212 positions.

CQL Time Hits Docs[t = ”de”] 3, 023 ms 57, 531, 353 1, 801, 583[t = ”de” & pos = ”LID”] 7, 877 ms 56, 704, 921 1, 799, 499[t = ”de” & !pos = ”LID”] 3, 105 ms 826, 432 132, 722< s > [t = ”De”] 11, 568 ms 6, 085, 643 1, 090, 127[pos = ”N”] 6, 200 ms 259, 942, 340 2, 189, 750[pos = ”ADJ”] [pos = ”N”] 42, 977 ms 45, 366, 603 1, 821, 716[pos = ”ADJ”]? [pos = ”N”] 207, 795 ms 305, 308, 943 2, 189, 750

Results

Group results

Facets

Performance

Results

The end

MTAS Henny Brugman

Science

Transcript of MTAS Henny Brugman

Popular MTAs EXIM & POSTFIX

Modelo HCW-3/5/8 Henny Penny Henny Penny Mostrador ...

Presentasi Kasus Henny

Modello 500/600 Henny Penny Henny Penny Friggitrici a ...

JURDING HENNY PPT.pptx

Henny Hilgerdenaar van CTAC

Pasien Pribadi Henny

Henny Penny 600

BAB 5 Henny

Brugman από την Καραμάνος Υγραέριο Μυτιλήνη

INSPIRATION - Brugman · 2020. 9. 2. · INSPIRATION. PL W roku 1965 Brugman wprowadził na rynek swoje pierwsze grzejniki. Dziś, ponad pięćdziesiąt lat później, Brugman to

Takwim KO KU MTAS 2016 Finalist

Bart Brugman

Referat Saraf Henny

HENNY PENNY€¦Henny Penny Modèle PFE- 590 1-4. SECURITE La friteuse pression Henny Penny est équipé de nombreux éléments de sécurité.

REFERAT Ppt Henny

created by Henny R

HJ. Henny Hendarti, DKK.pdf

Henny Perkerasan

Definisi Kualitas Henny Susilohati