Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub...

Experiences with UIMA from a User’s Perspective

Dietmar Rösner,

Manuela Kunze,

Hany Mahgoub

University of Magdeburg Knowledge Based Systems and Document Processing

Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective 2

Overview

• Introduction

• GATE

• UIMA

• Conclusion


Introduction

"IBM’s Unstructured Information Management Architecture (UIMA) is an architecture and software framework for

creating, discovering, composing and deploying a broad range of multi-modal analysis capabilities and integrating

them with search technologies."

• November 2005; Version 1.2.3 of UIMA is available


Introduction

really?


Introduction

• similarity/comparison of GATE and UIMA– frameworks– results are documents + annotations– pipeline processing

• steps:– task definition– one corpus


Evaluation Topics/Points

• ease of getting acquainted with system?:

– quality of docus: completeness, clarity, up-to-date, …?

– tutorials, use cases, …?

• processing and linguistic resources?

– lexica, Gazetteer lists, tools

• tools for resource maintenance and extension?

– quality: selfexplanatory, robust, comfortable

• speed of processing?

• single docs vs. large corpora?

• limitations, suggestions for improvement?

• support for im-/export of a variety of document formats?


Task of the Experiment

• process a corpus of websites– to detect and extract information relevant for tourists

• opening times of museum, prices of hotels,…

• corpus:– 30 tourism web sites of Egypt– additional 20 web sites of Washington, New York, London

• output: – Prolog facts for a reasoner– Questions:

• Which museum is now open?• …


Excerpts from the Corpus

• The Egyptian Museum is open the hours: 9am-5pm daily

• The Military Museum is open the hours: Summer: 8am-5:30pm; winter: 8am-4:30pm

• Palace Museum is open the hours: 8am-5:30pm (summer) 8am-4:30pm (winter)

• 10am-2pm, 6pm-9pm Sat-Wed; 6pm-9pm Fri

• …


Overview

• Introduction

• GATE

• UIMA

• Conclusion


GATE: General Architecture for Text Engineering

• a suite of tools for language processing and information extraction

• rule-based modular IE system (ANNIE)

• language and domain-independent processing resources

• open and extensible architecture

• aims to provide uniform access to various linguistic and ontological resources

• http://gate.ac.uk/


• a software infrastructure for NLP researchers; based on three main elements:

– an architecture • describing the components composing a language processing

system

– a framework • could be used as a basis for building such systems

– a graphical development environment• a set of tools and

• components for language engineers



• GATE distributed with IE system called ANNIE – relies on finite state algorithms and the Java Annotation Pattern

Engine (JAPE) language

– comprising a set of core Processing Resources (PRs):• Tokeniser• Gazetteers• POS tagger• Sentence Splitter• Semantic Tagger (JAPE transducer)• Orthomatcher (orthographic coreference)• …



GATE: ANNIE

[Cunningham et al.: Developing Language Processing Components with GATE; Version 3 (a User Guide)]


Gate Application

• several Processing Resources: Tokenizer, Hash Gazetteer (with new/extended Gazetteer lists), JAPE Transducer

Gazetteer lists

JAPE Transducer

... * The Military Museum*

Summer: 8am-5:30pm; Winter: 9pm-5pm …

names of museums, fragments of times and restrictions

JAPE rules: to annotate • interval of times and restrictions• museum

ANNIE English Tokenizer


Museum information in JAPE

Rule: egyptmuseums(

({SpaceToken}) ({Token.kind == word}) ({SpaceToken}) {Lookup.majorType ==org_base} // from gazetteer lists

({SpaceToken})?(({Token.kind==punctuation})|({Token.kind==word})|({SpaceToken}))*

({timeinfo}) // annotation by jape transducer):museum--> :museum.sight = {rule ="egyptmuseums"}

timeinfo defined by JAPE rules detects patterns like:• 9am-5pm, 6pm-9pm• 8am-4:30pm, 8:30am-4:30pm, 8:30am-4pm• 5:00PM-7:00PM, 10:00am-5:00pm• ….


GATE: Presentation of ResultsType and location of every extracted

annotation on document

AnnotationsMuseumsInformation


GATE: Results

• information annotated in the documents:– names of museums, hotels– names of tourist places in Egypt– times, time intervals– time restrictions– prices, intervals of prices (hotel prices and museum prices)– names of pharaohs, queens


GATE: Evaluation

documentation?

processing and linguistic resources?

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- good

- illustrative examples (tutorial) but not enough specialy about JAPE rules

- can deal with it without know of Java programming

- but is advantage to have experinces with Java programming to use it in JAPE rules


GATE: Evaluation

documentation?







- many processing resources available (ANNIE)

- tokenisers

- POS taggers

- parsers

- gazetteers

- sentence splitter

- …

- additional PRs :- gazetteer collector

- PRs for Machine Learning

- various exporters

- annotation set transfer etc...


GATE: Evaluation

documentation?







- editor for gazetteer list

- corpus manager

- text editor and debugger for JAPE rules


GATE: Evaluation

documentation?







- there is no measurement of processing time in the GATE tool


GATE: Evaluation

documentation?







- corpus pipeline vs document pipeline


GATE: Evaluation

documentation?







- no limitations:- all is possible but it is not necessary to

implement by yourself

- for beginning:- processing and linguistic resources

available within the distribution


GATE: Evaluation

documentation?







- import: - supports a variety of document

formats: HTML, rtf, email, SGML and plain text

- In all cases the format is analysed and converted into a single unified model of annotation

- export:- documents, corpora and annotations in

databases of various sorts

- required: Java application (CREOLE)


Overview

• Introduction

• GATE

• UIMA

• Conclusion


UIMA: Unstructured Information Management Architecture

• a software architecture for developing and deploying unstructured information management (UIM) applications

• UIM application: a software system – analyse large volumes of unstructured information to

• discover, • organize, and • deliver relevant knowledge to the end user

• software architecture which specifies – component interfaces, data representations, …

• http://www.research.ibm.com/UIMA/



… interfaces to a collection of data items (e.g., documents) to beanalyzed. Collection Readers return CASes that contain the documents toanalyze, possibly along with additional metadata.

… takes a CAS, analyzes its contents, and produces an enrichedCAS. Analysis Engines can be recursively composed of other Analysis Engines(called an Aggregate Analysis Engine). Aggregates may also contain CASConsumers.

… may be used by a Collection Reader to populate a CAS from a document. An example of a CAS Initializer is an HTML parser that de-tags an HTML document and also inserts paragraph annotations (determined from <P> tags in the original HTML) into the CAS.

CAS: Common Analysis StructureCPM: Collecting Processing Manager… consume the enriched CAS that was produced by the sequence of Analysis

Engines before it, and produce an application-specific data structure, such as a search engine index or database.

[Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]


• Analysis Engine (AE):– a component that analyzes artifacts (e.g. documents) and infers

information about them

– consists of two parts:• Java classes (typically packaged as one or more JAR files) and • AE descriptors (one or more XML files)

– the configuration settings for the Analysis Engine as well as – a description of the AE’s input and output requirements.


[Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]


UIMA Application

• several annotators (like a pipeline)

museum pattern

time pattern

interval of times

restrictions

museum information

... *Fraunces Tavern Museum*54 Pearl St. - 1-212-425-1778Tuesday-Friday, 12pm?5pm; …

regular expressions

regular expressions

regular expressions

window covering two time intervals and a restriction

window covering a museum and opening hours

Prolog facts: museumopen('Fraunces Tavern Museum ',

'2005-12-01T12:00:00', '2005-12-01T17:00:00').museumopen('Fraunces Tavern Museum ',

'2005-12-02T12:00:00', '2005-12-02T17:00:00').museumopen('Fraunces Tavern Museum ',

'2005-12-03T12:00:00', '2005-12-03T17:00:00').


UIMA: Results

• information annotated in the documents:– names of museums, hotels

– times, time intervals

– time restrictions

– prices, intervals of prices (hotel prices)

– keywords for museum category

– names of pharaohs (annotated with a correction of mispellings)

• hotel and museum information are exported into Prolog facts and into a short textual summary – templates filled with the detected information

• hotels: Price information about Cosmopolitan Hotel : $157• museums:

*** *Fraunces Tavern Museum* ***

Open from 12:00:00 to 17:00:00;

Restriction: Tuesday-Friday


UIMA: Evaluation

documentation?







- good

- illustrative examples (tutorial)

- completeness: sometimes it is very shortly described

- prior knowledge about Java and Eclipse is helpful


UIMA: Evaluation

documentation?







- annotators only from tutorial- sentence annotation

- word annotation

- date/time annotators

- examples for using regular expressions etc.

- external resources can be integrated:- lexical resources as external resources

(text files)

- existing processing resources- implementation of an interface is

necessary


UIMA: Evaluation

documentation?







- specific Eclipse component editors or - simple text Editors


UIMA: Evaluation

documentation

processing and linguistic resources






- faster than GATE?- in CPE detailed information about

processing time for each module


UIMA: Evaluation

documentation







- Collection Reader- document(s) from a directory

- adapt extensions into Preprocessing (CAS Initializer)

- e.g., extraction of text fragments from a HTML document


UIMA: Evaluation

documentation







• no limitations: – all is possible, but implementation or

interfacing by user

• wish: – more processing and linguistic

resources within the distribution


UIMA: Evaluation

documentation







- import: CAS Initializer

- export: CAS Consumer- transform annotations in any other

format- export of

- document + annotations

- only annotations

- required: Java application


Overview

• Introduction

• GATE

• UIMA

• Conclusion


Conclusion

• intended use

– GATE: academic/scientific application• tools available• comfortable GUI

– UIMA: more commercial• plain framework• simplified definition of (complex) results structures • simplified pre- and postprocessing of annotations

• in sum: incommensurable


Conclusion

• both are extensible

• no final judgement about: use GATE or UIMA– depends on

• your task– task description– expected results– which processing resources are necessary

• your preferences for interface– prefer the Eclispe environment (or other Java editors)– prefer a comfortable GUI

• or use both


Conclusion

• found in the UIMA Forum:I see UIMA and GATE as complementary rather than competitive, and each can gain from the strengths of the other.

GATE was originally developed as a research tool, and has features suited to rapid prototyping of text processing code, like JAPE (a language for defining finite-state transducers over annotations on a document).

UIMA is more targetted at robust deployment of applications, with strong typing of feature structures and better support for distributed processing.

We're currently working on writing a translation layer to allow UIMA analysis components to be used in GATE and vice-versa. It's not in a releasable state just yet, but we hope to release something in the near future. Keep your eye on http://gate.ac.uk/ for details.

Ian Roberts (GATE developer)

Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub...

Documents

Transcript of Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub...