Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub...

41
Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub University of Magdeburg Knowledge Based Systems and Document Processing

Transcript of Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub...

Experiences with UIMA from a User’s Perspective

Dietmar Rösner,

Manuela Kunze,

Hany Mahgoub

University of Magdeburg Knowledge Based Systems and Document Processing

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 2

Overview

• Introduction

• GATE

• UIMA

• Conclusion

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 3

Introduction

"IBM’s Unstructured Information Management Architecture (UIMA) is an architecture and software framework for

creating, discovering, composing and deploying a broad range of multi-modal analysis capabilities and integrating

them with search technologies."

• November 2005; Version 1.2.3 of UIMA is available

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 4

Introduction

really?

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 5

Introduction

• similarity/comparison of GATE and UIMA– frameworks– results are documents + annotations– pipeline processing

• steps:– task definition– one corpus

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 6

Evaluation Topics/Points

• ease of getting acquainted with system?:

– quality of docus: completeness, clarity, up-to-date, …?

– tutorials, use cases, …?

• processing and linguistic resources?

– lexica, Gazetteer lists, tools

• tools for resource maintenance and extension?

– quality: selfexplanatory, robust, comfortable

• speed of processing?

• single docs vs. large corpora?

• limitations, suggestions for improvement?

• support for im-/export of a variety of document formats?

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 7

Task of the Experiment

• process a corpus of websites– to detect and extract information relevant for tourists

• opening times of museum, prices of hotels,…

• corpus:– 30 tourism web sites of Egypt– additional 20 web sites of Washington, New York, London

• output: – Prolog facts for a reasoner– Questions:

• Which museum is now open?• …

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 8

Excerpts from the Corpus

• The Egyptian Museum is open the hours: 9am-5pm daily

• The Military Museum is open the hours: Summer: 8am-5:30pm; winter: 8am-4:30pm

• Palace Museum is open the hours: 8am-5:30pm (summer) 8am-4:30pm (winter)

• 10am-2pm, 6pm-9pm Sat-Wed; 6pm-9pm Fri

• …

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 9

Overview

• Introduction

• GATE

• UIMA

• Conclusion

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 10

GATE: General Architecture for Text Engineering

• a suite of tools for language processing and information extraction

• rule-based modular IE system (ANNIE)

• language and domain-independent processing resources

• open and extensible architecture

• aims to provide uniform access to various linguistic and ontological resources

• http://gate.ac.uk/

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 11

• a software infrastructure for NLP researchers; based on three main elements:

– an architecture • describing the components composing a language processing

system

– a framework • could be used as a basis for building such systems

– a graphical development environment• a set of tools and

• components for language engineers

GATE: General Architecture for Text Engineering

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 12

• GATE distributed with IE system called ANNIE – relies on finite state algorithms and the Java Annotation Pattern

Engine (JAPE) language

– comprising a set of core Processing Resources (PRs):• Tokeniser• Gazetteers• POS tagger• Sentence Splitter• Semantic Tagger (JAPE transducer)• Orthomatcher (orthographic coreference)• …

GATE: General Architecture for Text Engineering

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 13

GATE: ANNIE

[Cunningham et al.: Developing Language Processing Components with GATE; Version 3 (a User Guide)]

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 14

Gate Application

• several Processing Resources: Tokenizer, Hash Gazetteer (with new/extended Gazetteer lists), JAPE Transducer

Gazetteer lists

JAPE Transducer

... * The Military Museum*

Summer: 8am-5:30pm; Winter: 9pm-5pm …

names of museums, fragments of times and restrictions

JAPE rules: to annotate • interval of times and restrictions• museum

ANNIE English Tokenizer

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 15

Museum information in JAPE

Rule: egyptmuseums(

({SpaceToken}) ({Token.kind == word}) ({SpaceToken}) {Lookup.majorType ==org_base} // from gazetteer lists

({SpaceToken})?(({Token.kind==punctuation})|({Token.kind==word})|({SpaceToken}))*

({timeinfo}) // annotation by jape transducer):museum--> :museum.sight = {rule ="egyptmuseums"}

timeinfo defined by JAPE rules detects patterns like:• 9am-5pm, 6pm-9pm• 8am-4:30pm, 8:30am-4:30pm, 8:30am-4pm• 5:00PM-7:00PM, 10:00am-5:00pm• ….

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 16

GATE: Presentation of ResultsType and location of every extracted

annotation on document

AnnotationsMuseumsInformation

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 17

GATE: Results

• information annotated in the documents:– names of museums, hotels– names of tourist places in Egypt– times, time intervals– time restrictions– prices, intervals of prices (hotel prices and museum prices)– names of pharaohs, queens

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 18

GATE: Evaluation

documentation?

processing and linguistic resources?

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- good

- illustrative examples (tutorial) but not enough specialy about JAPE rules

- can deal with it without know of Java programming

- but is advantage to have experinces with Java programming to use it in JAPE rules

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 19

GATE: Evaluation

documentation?

processing and linguistic resources?

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- many processing resources available (ANNIE)

- tokenisers

- POS taggers

- parsers

- gazetteers

- sentence splitter

- …

- additional PRs :- gazetteer collector

- PRs for Machine Learning

- various exporters

- annotation set transfer etc...

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 20

GATE: Evaluation

documentation?

processing and linguistic resources?

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- editor for gazetteer list

- corpus manager

- text editor and debugger for JAPE rules

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 21

GATE: Evaluation

documentation?

processing and linguistic resources?

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- there is no measurement of processing time in the GATE tool

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 22

GATE: Evaluation

documentation?

processing and linguistic resources?

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- corpus pipeline vs document pipeline

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 23

GATE: Evaluation

documentation?

processing and linguistic resources?

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- no limitations:- all is possible but it is not necessary to

implement by yourself

- for beginning:- processing and linguistic resources

available within the distribution

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 24

GATE: Evaluation

documentation?

processing and linguistic resources?

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- import: - supports a variety of document

formats: HTML, rtf, email, SGML and plain text

- In all cases the format is analysed and converted into a single unified model of annotation

- export:- documents, corpora and annotations in

databases of various sorts

- required: Java application (CREOLE)

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 25

Overview

• Introduction

• GATE

• UIMA

• Conclusion

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 26

UIMA: Unstructured Information Management Architecture

• a software architecture for developing and deploying unstructured information management (UIM) applications

• UIM application: a software system – analyse large volumes of unstructured information to

• discover, • organize, and • deliver relevant knowledge to the end user

• software architecture which specifies – component interfaces, data representations, …

• http://www.research.ibm.com/UIMA/

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 27

UIMA: Unstructured Information Management Architecture

… interfaces to a collection of data items (e.g., documents) to beanalyzed. Collection Readers return CASes that contain the documents toanalyze, possibly along with additional metadata.

… takes a CAS, analyzes its contents, and produces an enrichedCAS. Analysis Engines can be recursively composed of other Analysis Engines(called an Aggregate Analysis Engine). Aggregates may also contain CASConsumers.

… may be used by a Collection Reader to populate a CAS from a document. An example of a CAS Initializer is an HTML parser that de-tags an HTML document and also inserts paragraph annotations (determined from <P> tags in the original HTML) into the CAS.

CAS: Common Analysis StructureCPM: Collecting Processing Manager… consume the enriched CAS that was produced by the sequence of Analysis

Engines before it, and produce an application-specific data structure, such as a search engine index or database.

[Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 28

• Analysis Engine (AE):– a component that analyzes artifacts (e.g. documents) and infers

information about them

– consists of two parts:• Java classes (typically packaged as one or more JAR files) and • AE descriptors (one or more XML files)

– the configuration settings for the Analysis Engine as well as – a description of the AE’s input and output requirements.

UIMA: Unstructured Information Management Architecture

[Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 29

UIMA Application

• several annotators (like a pipeline)

museum pattern

time pattern

interval of times

restrictions

museum information

... *Fraunces Tavern Museum*54 Pearl St. - 1-212-425-1778Tuesday-Friday, 12pm?5pm; …

regular expressions

regular expressions

regular expressions

window covering two time intervals and a restriction

window covering a museum and opening hours

Prolog facts: museumopen('Fraunces Tavern Museum ',

'2005-12-01T12:00:00', '2005-12-01T17:00:00').museumopen('Fraunces Tavern Museum ',

'2005-12-02T12:00:00', '2005-12-02T17:00:00').museumopen('Fraunces Tavern Museum ',

'2005-12-03T12:00:00', '2005-12-03T17:00:00').

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 30

UIMA: Results

• information annotated in the documents:– names of museums, hotels

– times, time intervals

– time restrictions

– prices, intervals of prices (hotel prices)

– keywords for museum category

– names of pharaohs (annotated with a correction of mispellings)

• hotel and museum information are exported into Prolog facts and into a short textual summary – templates filled with the detected information

• hotels: Price information about Cosmopolitan Hotel : $157• museums:

*** *Fraunces Tavern Museum* ***

Open from 12:00:00 to 17:00:00;

Restriction: Tuesday-Friday

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 31

UIMA: Evaluation

documentation?

processing and linguistic resources?

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- good

- illustrative examples (tutorial)

- completeness: sometimes it is very shortly described

- prior knowledge about Java and Eclipse is helpful

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 32

UIMA: Evaluation

documentation?

processing and linguistic resources?

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- annotators only from tutorial- sentence annotation

- word annotation

- date/time annotators

- examples for using regular expressions etc.

- external resources can be integrated:- lexical resources as external resources

(text files)

- existing processing resources- implementation of an interface is

necessary

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 33

UIMA: Evaluation

documentation?

processing and linguistic resources?

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- specific Eclipse component editors or - simple text Editors

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 34

UIMA: Evaluation

documentation

processing and linguistic resources

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- faster than GATE?- in CPE detailed information about

processing time for each module

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 35

UIMA: Evaluation

documentation

processing and linguistic resources

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- Collection Reader- document(s) from a directory

- adapt extensions into Preprocessing (CAS Initializer)

- e.g., extraction of text fragments from a HTML document

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 36

UIMA: Evaluation

documentation

processing and linguistic resources

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

• no limitations: – all is possible, but implementation or

interfacing by user

• wish: – more processing and linguistic

resources within the distribution

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 37

UIMA: Evaluation

documentation

processing and linguistic resources

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- import: CAS Initializer

- export: CAS Consumer- transform annotations in any other

format- export of

- document + annotations

- only annotations

- required: Java application

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 38

Overview

• Introduction

• GATE

• UIMA

• Conclusion

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 39

Conclusion

• intended use

– GATE: academic/scientific application• tools available• comfortable GUI

– UIMA: more commercial• plain framework• simplified definition of (complex) results structures • simplified pre- and postprocessing of annotations

• in sum: incommensurable

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 40

Conclusion

• both are extensible

• no final judgement about: use GATE or UIMA– depends on

• your task– task description– expected results– which processing resources are necessary

• your preferences for interface– prefer the Eclispe environment (or other Java editors)– prefer a comfortable GUI

• or use both

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 41

Conclusion

• found in the UIMA Forum:I see UIMA and GATE as complementary rather than competitive, and each can gain from the strengths of the other.

GATE was originally developed as a research tool, and has features suited to rapid prototyping of text processing code, like JAPE (a language for defining finite-state transducers over annotations on a document).

UIMA is more targetted at robust deployment of applications, with strong typing of feature structures and better support for distributed processing.

We're currently working on writing a translation layer to allow UIMA analysis components to be used in GATE and vice-versa. It's not in a releasable state just yet, but we hope to release something in the near future. Keep your eye on http://gate.ac.uk/ for details.

Ian Roberts (GATE developer)