경 영 목 표 (안) · 2018-08-07 · 한국개발연구원 원장 경 영 목 표 (안) 2018. 7. 3.
Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수...
-
date post
21-Dec-2015 -
Category
Documents
-
view
219 -
download
1
Transcript of Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수...
Empirical Methods in Information Extraction[Cardie97]
2
Contents
Introduction
The Architecture of an Information Extraction System
The Role of Corpus-Based Language Learning Algorithms
Learning Extraction Patterns
Coreference Resolution and Template Generation
Future Directions
Empirical Methods in Information Extraction[Cardie97]
3
Introduction(1/2)
Information Extraction System inherently domain specific takes as input an unrestricted text and summarizes the text with respect
to a prespecified topic or domain of interest. (Figure 1) skim a text to find relevant sections and then focus only on these sectio
ns. MUC performance evaluation
recall precision
applications analyzing…
terrorist activities, business joint ventures, medical patient records, … building…
KB from web pages, job listing DB from newsgroups / web sites / advertisements, weather forecast DB from web pages, ...
(# correct slot-fillers in output template) / (# slot-fillers in answer key)
(# correct slot-fillers in output template) / (# slot-fillers in output template)
Empirical Methods in Information Extraction[Cardie97]
4
Introduction(2/2)
Problems in today’s IE systems accuracy
the errors of an automated IE system are … due to its relative shallow understanding of the input text difficult to track down and to correct
portability domain-specific nature manually modifying and adding domain-specific linguistic knowledge to
an existing NLP system is slow and error-prone.
We will see that empirical methods for IE are corpus-based, machine learning algorithms.
Empirical Methods in Information Extraction[Cardie97]
5
The Architecture of an IE System(1/2)
Approaches to IE in the early days traditional NLP techniques vs. keyword matching techniques
Standard architecture for IE systems (Figure 2) tokenization and tagging
tag each word with respect to POS and possibly semantic class sentence analysis
one or more stages of syntactic analysis identify…
noun/verb groups, prepositional phrases, subjects, objects, conjunctions, … semantic entities relevant to the extraction topic
the system need only perform partial parsing looks for fragments of text that can be reliably recognized the ambiguity resolution decisions can be postponed
Empirical Methods in Information Extraction[Cardie97]
6
The Architecture of an IE System(2/2)
Standard architecture for IE systems (continued) extraction
the first entirely domain-specific component identifies domain-specific relations among relevant entities in the text
merging coreference resolution, or anaphora resolution
determines whether it refers to an existing entity or whether it is new determine the implicit subjects of all verb phrases
discourse-level inference
template generation determines the number of distinct events in the text maps the individually extracted pieces of information onto each event produces output templates the best place to apply domain-specific constraint some slots require set fills, or require normalization of their fillers.
Empirical Methods in Information Extraction[Cardie97]
7
The Role of Corpus-Based Language Learning Algorithms(1/3)
Q: How have researchers used empirical methods in NLP to i
mprove the accuracy and portability of IE systems? A: corpus-based language learning algorithms have been used to impr
ove individual components of the IE system.
For language tasks that are domain-independent and syntactic annotated corpora already exist POS tagging, partial parsing, WSD
the importance of WSD for IE task remains unclear.
NL learning techniques are more difficult to apply to subsequent stages of IE. learning extraction patterns, coreference resolution, template generatio
n
Empirical Methods in Information Extraction[Cardie97]
8
The Role of Corpus-Based Language Learning Algorithms(2/3)
The problems of applying empirical methods no corpora annotated with the appropriate semantic & domain-specific
supervisory information corpus for IE = <text, output template> the output templates …
say nothing about which occurrence of the string is responsible for the extraction
provide no direct means for learning patterns to extract symbols not necessarily appearing anywhere in the text(set fills)
the semantic & domain-specific language-processing skills require the output of earlier levels of analysis(tagging & partial parsing).
complicate to generate the training examples whenever the behavior of these earlier modules changes,
new training examples must be generated the learning algorithms for later stages must be retrained
learning algorithms must deal with noise caused by errors from earlier components new algorithms need to be developed
Empirical Methods in Information Extraction[Cardie97]
9
The Role of Corpus-Based Language Learning Algorithms(3/3)
Data-driven nature of corpus-based approaches accuracy
when the training data is derived from the same type of texts that the IE system is to process,
the acquired language skills are automatically tuned to that corpus, increasing the accuracy of the system.
portability because each NLU skill is learned automatically rather than being
manually coded, that skill can be moved quickly from one IE system to another by retraining
the appropriate component.
Empirical Methods in Information Extraction[Cardie97]
10
Learning Extraction Patterns(1/5)
The role for empirical methods in the Extraction phase knowledge acquisition: to automate the acquisition of good extraction pat
terns
AutoSlog[Riloff 1993] learns extraction patterns in the form of domain-specific concept node de
finitions for use with the CIRCUS parser. (Figure 3) learns concept node definitions via a one-shot learning algorithm background knowledge
a small set of general linguistic patterns (approximately 13)
requires human feedback loop, which filters bad extraction patterns accuracy: 98%, portability: 5 hours critical step towards building IE systems that are trainable entirely by en
d-users
(Figure 4)
Empirical Methods in Information Extraction[Cardie97]
11
Learning Extraction Patterns(2/5)
Given: a noun phrase to be extracted
1. Find the sentence from which the noun phrase originated.
2. Present the sentence to the partial parser for processing.
3. Apply the linguistic patterns in order.
4. When a pattern applies, generate a concept node definition from the matched constituents, their context, the concept type provided in the annotation for the target noun phrase, and the predefined semantic class for the filler.
<active-voice-verb> followed by <target-np>=<direct object>
Concept = <<concept> of <target-np>>
Trigger = “<<verb> of <active-voice-verb>>”
Position = direct-object
Constraints = ((<<semantic class> of <concept>>))
Enabling Conditions = ((active-voice))
AutoSlog’s Learning AlgorithmAutoSlog’s Learning Algorithm
Empirical Methods in Information Extraction[Cardie97]
12
Learning Extraction Patterns(3/5)
PALKA[Kim & Moldovan 1995] background knowledge
concept hierarchy a set of keywords that can be used to trigger each pattern comprises a set of generic semantic case frame definitions for each type of info
rmation to be extracted semantic class lexicon
CRYSTAL[Soderland 1995] triggers comprise a much more detailed specification of linguistic cont
ext employs a covering algorithm medical diagnosis domain precision: 50-80% , recall: 45-75%
Empirical Methods in Information Extraction[Cardie97]
13
Learning Extraction Patterns(4/5)
1. Begin by generating the most specific concept node possible for every phrase to be extracted in the training texts.
2. For each concept node C
2.1. Find the most similar concept node C’.
2.2. Relax the constrains of each just enough to unify C and C’.
2.3. Test the new extraction pattern P against the training corpus.
If (error rate < threshold)
then Add P; Replace C and C’
else stop.
CRYSTAL’s Learning AlgorithmCRYSTAL’s Learning Algorithm
Empirical Methods in Information Extraction[Cardie97]
14
Learning Extraction Patterns(5/5)
Comparison AutoSlog
general to specific human feedback
PALKA generalization & specialization automated feedback require more background knowle
dge
CRYSTAL specific to general(covering algo
rithm) automated feedback require more background knowle
dge
Research issues handling set fills type of the extracted information evaluation determining which method for
learning extraction patterns will give the best results in a new extraction domain
Empirical Methods in Information Extraction[Cardie97]
15
Coreference Resolution and Template Generation(1/3)
Discourse processing is a major weakness of existing IE system generating good heuristics is challenging assume as input fully parsed sentences must take into account the accumulated errors must be able to handle the myriad forms of coreference across different
domains
Coreference problem as a classification task (Figure 5) given two phrases and the context in which they occur, classify the phrases with respect to whether or not they refer to the sam
e object
Empirical Methods in Information Extraction[Cardie97]
16
Coreference Resolution and Template Generation(2/3)
MLR[Aone & Bennett 1995] use C4.5 decision tree induction system tested on the Japanese corpus for the business joint ventures use automatically generated data set 66 domain-independent features evaluated using data sets derived from 250 texts
recall: 67-70 %, precision: 83-88%
RESOLVE[McCarthy & Lehnert 1995] use C4.5 decision tree induction system tested on the English corpus for the business joint ventures(MUC-5) use manually generated, noise-free data set include domain-specific features evaluated using data sets derived from 50 texts
recall: 80-85%, precision: 87-92%
Empirical Methods in Information Extraction[Cardie97]
17
Coreference Resolution and Template Generation(3/3)
The results for coreference resolution are promising possible to develop automatically trainable coreference systems that ca
n compete favorably with manually designed systems specially designed learning algorithms need not be developed
symbolic ML techniques offer a mechanism for evaluating the usefulness of different knowledge sources
Still, much research remains to be done additional types of anaphors using a variety of feature sets the role of domain-specific information for coreference resolution the relative effect of errors from the preceding phases of text analysis
Trainable systems that tackle Merging & Template Generation TTG[Dolan 1991], Wrap-Up[Soderland & Lehnert 1994]
generate a series of decision tree
Empirical Methods in Information Extraction[Cardie97]
18
Future Directions
Unsupervised learning algorithms a means for sidestepping the lack of large, annotated corpora
Techniques that allow end-users to quickly train IE systems through interaction with the system over time without intervention by NLP system developers
Empirical Methods in Information Extraction[Cardie97]
19
IE System in the Domain of Natural Disasters
Empirical Methods in Information Extraction[Cardie97]
21
Concept Node for Extracting “Damage” Information
Concept Node Definition: domain-specific semantic case frame (one slot per frame)
Concept: the type of concept to be recognized
Trigger: the word that activates the pattern
Position: the syntactic position where the concept is expected to be found
Constraint: selectional restrictions that apply to any potential instance of the concept
Enabling Conditions: constraints on the linguistic context of the triggering word that must be satisfied before the pattern is activated