Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse...
-
Upload
lynn-bruce -
Category
Documents
-
view
212 -
download
0
Transcript of Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse...
Elizabeth D. Liddy
Center for Natural Language ProcessingSchool of Information Studies
Syracuse University
Developing & Evaluating Metadatafor Improved Information Access
Background
• Breaking the Metadata Generation Bottleneck– 1st NSDL project (2000-2002)– Adapted Natural Language Processing technology for
automatic metadata generation– 15 Dublin Core + 8 Gem education elements
• Project had a modest evaluation study– Results suggested that automatically generated
metadata was qualitatively nearly equal to manually generated metadata
Types of Features:• Linguistic
• Root forms of words• Part-of-speech tags • Phrases (Noun, Verb, Proper Noun)• Categories (Person, Geographic, Organization)• Concepts (sense disambiguated words / phrases)• Semantic Relations• Events
• Non-linguistic• Length of document• HTML and XML tags
NLP-Based Metadata Generation
Potential
Keyword data
Html Document
ConfigurationHTML
Converter
MetadataRetrieval Module
Cataloger
Catalog Date
Rights
Publisher
Format
Language
Resource Type
eQueryExtraction
ModuleCreator
Grade/Level
Duration
Date
Pedagogy
Audience
Standard
XML Document with Metadata
PreProcessorTf/idf
Keywords
Title
Description
Essential Resources
Relation
Output Gathering Program
MetaExtract
HTML Document
MetaTest Research Questions
• Do we need metadata for information access?–Why?
• How much metadata do we need?–For what purposes?
• Which elements do we need?–For which digital library tasks?
• How is metadata utilized by information-seekers?–When browsing / searching / previewing?
• Can automatically generated metadata perform as well as manually assigned metadata?
–For browsing / searching / previewing?
Three Types of Evaluation of Metadata
1. Human expert qualitative review
2. Eye-tracking in searching & browsing tasks
3. Quantitative information retrieval experiment with 3 conditions
1. Automatically assigned metadata
2. Manually assigned metadata
3. Full-text indexing
Evaluation Methodology
1. System automatically meta-tagged a Digital Library
collection that had already been manually tagged.
2. Solicited subject pool of teachers via listservs.
3. Had users qualitatively evaluate metadata tags.
4. Conducted searching & browsing experiments.
5. Monitored with eye-tracking & post-search interviews.
6. Observed relative utility of each meta-data element for
both tasks.
7. Are now preparing for an IR experiment to compare 2
types of metadata generation + full-text indexing.
Who Were the Respondents?
Type of Educator
Elementary Teacher 6%
Middle School Teacher 6%
High School Teacher 66%
Higher Education Teacher 6%
Instructional Designer 3%
School Media 3%
Other 11%
<1 Year 6%
1-3 Years 29%
3-9 Years 29%
10+ Years 37%
Experience with Lesson PlansScience 69%
Math 6%
Engineering 3%
Combination 11%
Other 11%
Subject Taught
MetaData Element Coverage• For 35 lesson plans & learning activities from the GEM
Gateway • Metadata elements present on automatic vs. manually
generated records
0
5
10
15
20
25
30
35
40
Title
Subjec
t/Key
words
Descr
iptio
n
Gra
de Le
vel
Relat
ions
Pedag
ogy-
Meth
od
Durat
ion
Mat
erials
Pedag
ogy-
Gro
uping
Pedag
ogy-
Proce
ss
Pedag
ogy-
Asses
s
Metadata Element
No
. of
Ed
uca
tio
nal
Res
ou
rces
Manually Generated
Automatically Generated
Qualitative Statistical Analysis
• 35 subjects evaluated 7 resources + metadata records
• 234 total cases• Ordinal level data measuring metadata quality
– Unsure, Very Poorly, Poorly, Well, Very Well
• Mann-Whitney Test of Independent Pairs– Non-parametric test– Accepts Ordinal data – Does not require normal distribution, homogeneity
of variance, or same sample size
Medians of Metadata Element Quality
Title Description
Grade
Keyword Duration
Material
Pedagogy Method
Pedagogy Process
Pedagogy Assessment
Pedagogy Group
Manual Quality
3
2-4
132
3
3-4
122
3
2-4
73
3
3-4
127
3
2.75-4
29
3.5
3-4
49
3
0.5-3
30
-- -- 3
1.5-3
14
Automatic Quality
3
1-4
105
3
2-4
113
3
3-4
80
3
2-4
99
3
2-3.25
25
3
2-4
39
3
2-4
33
3
1-3
53
3
2-3.5
9
3
2.5-4
18
No statistical difference for 8 of 10 elements
Minimally statistically significant better manual metadata for Title and Keyword elements
Median Score
Inter-Quartile Range
Mean Rank
• How users of Digital Libraries use and process metadata?– Test on three conditions
• Records with descriptions• Records with Metadata• Records with both descriptions and
metadata
Eye Tracking in Digital Libraries
What the Eyes Can Tell Us
• Indices of ocular behavior are used to infer cognitive processing, e.g.,– Attention– Decision making– Knowledge organization– Encoding and access
• The longer an individual fixates on an area, the more difficult or complex that information is to process.
• The first few fixations indicate areas of particular importance or informativeness.
User Study: Data Collection
User wears an eye-tracking device while browsing or searching STEM educational resources
The eye fixations (stops) and saccades (gaze paths) are recorded.
Fixations enable a person to gather information. No information can be acquired during saccades.
The colors represent different intervals of time (from green through red).
Methods
• Pre-exposure search attempt– 3 trials of entering search terms using free
text, modifiers, boolean expressions etc.• Exposure to test stimuli
– Information in 1 of 3 formats • Metadata only• Description only• Metadata and Description
– Eye tracking during exposure• Post- exposure search & follow-up interview
Scanpath of Metadata Only Condition
Graphically Understanding the Data
Contour map shows the aggregate of eye fixations.
Peak fixation areas are Description element, with some interest in URL and subject elements.
Note dominance of upper left side.
LookZone shows amount of time spent in each zone of record.
User spent 27 seconds or 54% of time looking at escription metadata element.
Very little time was spent on other elements.
Preliminary Findings: Eye Tracking
• Narrative resources are viewed in linear order, but metadata is not.
• Titles and sources are the most-viewed metadata.
• First few sentences in resource are read more carefully; the rest is skimmed.
• Before selecting a resource, users re-visit the record for confirmation.
• Subjects focus on narrative descriptions when both descriptions & metadata are on same page.
Preliminary Findings: Interview Data
• 65% changed their initial search terms after exposure to test stimuli.
• 20% indicated they would use their chosen document for the intended purpose.
• 60% said they learned something from retrieved document that helped them restructure their next search.
• 100% indicated they use Google when searching for lecture / lesson information.
• Less than half of the participants knew what metadata was.
Preliminary Findings: Search Attempts
• On post-exposure search attempts, mean number of search terms increased by 25% for those in the combined condition.
• Number of search terms decreased for both of the other conditions.
• Men used more search terms on their first query attempts, while women used more on their 2nd query attempts.
• Men were more likely to use modifiers and full text queries, while women tended to use more Boolean expressions.
Upcoming Retrieval Experiment
• Real Users – STEM Teachers– Queries– Relevance Assessments
• Information retrieval experiment with 3, possibly 4 conditions
1. Automatically assigned metadata2. Manually assigned metadata3. Full-text indexing4. Fielded searches
Concluding Thoughts
• Provocative findings– Need replication on other document types
• Digital Library is a networked structure, but not captured in linear world of metadata– Rich annotation by users is a type of metadata that
is not currently but could be captured automatically • Consider information extraction technologies
– Entities, Relations, and Events• Metadata can be useful in multiple ways
– Not just for discovery– Have not experimented with management aspects
of use• Results of retrieval experiment will be key to
understanding need for metadata for access