Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse...

23
Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved Information Access

Transcript of Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse...

Page 1: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

Elizabeth D. Liddy

Center for Natural Language ProcessingSchool of Information Studies

Syracuse University

Developing & Evaluating Metadatafor Improved Information Access

Page 2: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

Background

• Breaking the Metadata Generation Bottleneck– 1st NSDL project (2000-2002)– Adapted Natural Language Processing technology for

automatic metadata generation– 15 Dublin Core + 8 Gem education elements

• Project had a modest evaluation study– Results suggested that automatically generated

metadata was qualitatively nearly equal to manually generated metadata

Page 3: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

Types of Features:• Linguistic

• Root forms of words• Part-of-speech tags • Phrases (Noun, Verb, Proper Noun)• Categories (Person, Geographic, Organization)• Concepts (sense disambiguated words / phrases)• Semantic Relations• Events

• Non-linguistic• Length of document• HTML and XML tags

NLP-Based Metadata Generation

Page 4: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

Potential

Keyword data

Html Document

ConfigurationHTML

Converter

MetadataRetrieval Module

Cataloger

Catalog Date

Rights

Publisher

Format

Language

Resource Type

eQueryExtraction

ModuleCreator

Grade/Level

Duration

Date

Pedagogy

Audience

Standard

XML Document with Metadata

PreProcessorTf/idf

Keywords

Title

Description

Essential Resources

Relation

Output Gathering Program

MetaExtract

HTML Document

Page 5: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

MetaTest Research Questions

• Do we need metadata for information access?–Why?

• How much metadata do we need?–For what purposes?

• Which elements do we need?–For which digital library tasks?

• How is metadata utilized by information-seekers?–When browsing / searching / previewing?

• Can automatically generated metadata perform as well as manually assigned metadata?

–For browsing / searching / previewing?

Page 6: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

Three Types of Evaluation of Metadata

1. Human expert qualitative review

2. Eye-tracking in searching & browsing tasks

3. Quantitative information retrieval experiment with 3 conditions

1. Automatically assigned metadata

2. Manually assigned metadata

3. Full-text indexing

Page 7: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

Evaluation Methodology

1. System automatically meta-tagged a Digital Library

collection that had already been manually tagged.

2. Solicited subject pool of teachers via listservs.

3. Had users qualitatively evaluate metadata tags.

4. Conducted searching & browsing experiments.

5. Monitored with eye-tracking & post-search interviews.

6. Observed relative utility of each meta-data element for

both tasks.

7. Are now preparing for an IR experiment to compare 2

types of metadata generation + full-text indexing.

Page 8: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

Who Were the Respondents?

Type of Educator

Elementary Teacher 6%

Middle School Teacher 6%

High School Teacher 66%

Higher Education Teacher 6%

Instructional Designer 3%

School Media 3%

Other 11%

<1 Year 6%

1-3 Years 29%

3-9 Years 29%

10+ Years 37%

Experience with Lesson PlansScience 69%

Math 6%

Engineering 3%

Combination 11%

Other 11%

Subject Taught

Page 9: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.
Page 10: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

MetaData Element Coverage• For 35 lesson plans & learning activities from the GEM

Gateway • Metadata elements present on automatic vs. manually

generated records

0

5

10

15

20

25

30

35

40

Title

Subjec

t/Key

words

Descr

iptio

n

Gra

de Le

vel

Relat

ions

Pedag

ogy-

Meth

od

Durat

ion

Mat

erials

Pedag

ogy-

Gro

uping

Pedag

ogy-

Proce

ss

Pedag

ogy-

Asses

s

Metadata Element

No

. of

Ed

uca

tio

nal

Res

ou

rces

Manually Generated

Automatically Generated

Page 11: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

Qualitative Statistical Analysis

• 35 subjects evaluated 7 resources + metadata records

• 234 total cases• Ordinal level data measuring metadata quality

– Unsure, Very Poorly, Poorly, Well, Very Well

• Mann-Whitney Test of Independent Pairs– Non-parametric test– Accepts Ordinal data – Does not require normal distribution, homogeneity

of variance, or same sample size

Page 12: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

Medians of Metadata Element Quality

Title Description

Grade

Keyword Duration

Material

Pedagogy Method

Pedagogy Process

Pedagogy Assessment

Pedagogy Group

Manual Quality

3

2-4

132

3

3-4

122

3

2-4

73

3

3-4

127

3

2.75-4

29

3.5

3-4

49

3

0.5-3

30

-- -- 3

1.5-3

14

Automatic Quality

3

1-4

105

3

2-4

113

3

3-4

80

3

2-4

99

3

2-3.25

25

3

2-4

39

3

2-4

33

3

1-3

53

3

2-3.5

9

3

2.5-4

18

No statistical difference for 8 of 10 elements

Minimally statistically significant better manual metadata for Title and Keyword elements

Median Score

Inter-Quartile Range

Mean Rank

Page 13: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

• How users of Digital Libraries use and process metadata?– Test on three conditions

• Records with descriptions• Records with Metadata• Records with both descriptions and

metadata

Eye Tracking in Digital Libraries

Page 14: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

What the Eyes Can Tell Us

• Indices of ocular behavior are used to infer cognitive processing, e.g.,– Attention– Decision making– Knowledge organization– Encoding and access

• The longer an individual fixates on an area, the more difficult or complex that information is to process.

• The first few fixations indicate areas of particular importance or informativeness.

Page 15: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

User Study: Data Collection

User wears an eye-tracking device while browsing or searching STEM educational resources

The eye fixations (stops) and saccades (gaze paths) are recorded.

Fixations enable a person to gather information. No information can be acquired during saccades.

The colors represent different intervals of time (from green through red).

Page 16: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

Methods

• Pre-exposure search attempt– 3 trials of entering search terms using free

text, modifiers, boolean expressions etc.• Exposure to test stimuli

– Information in 1 of 3 formats • Metadata only• Description only• Metadata and Description

– Eye tracking during exposure• Post- exposure search & follow-up interview

Page 17: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

Scanpath of Metadata Only Condition

Page 18: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

Graphically Understanding the Data

Contour map shows the aggregate of eye fixations.

Peak fixation areas are Description element, with some interest in URL and subject elements.

Note dominance of upper left side.

LookZone shows amount of time spent in each zone of record.

User spent 27 seconds or 54% of time looking at escription metadata element.

Very little time was spent on other elements.

Page 19: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

Preliminary Findings: Eye Tracking

• Narrative resources are viewed in linear order, but metadata is not.

• Titles and sources are the most-viewed metadata.

• First few sentences in resource are read more carefully; the rest is skimmed.

• Before selecting a resource, users re-visit the record for confirmation.

• Subjects focus on narrative descriptions when both descriptions & metadata are on same page.

Page 20: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

Preliminary Findings: Interview Data

• 65% changed their initial search terms after exposure to test stimuli.

• 20% indicated they would use their chosen document for the intended purpose.

• 60% said they learned something from retrieved document that helped them restructure their next search.

• 100% indicated they use Google when searching for lecture / lesson information.

• Less than half of the participants knew what metadata was.

Page 21: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

Preliminary Findings: Search Attempts

• On post-exposure search attempts, mean number of search terms increased by 25% for those in the combined condition.

• Number of search terms decreased for both of the other conditions.

• Men used more search terms on their first query attempts, while women used more on their 2nd query attempts.

• Men were more likely to use modifiers and full text queries, while women tended to use more Boolean expressions.

Page 22: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

Upcoming Retrieval Experiment

• Real Users – STEM Teachers– Queries– Relevance Assessments

• Information retrieval experiment with 3, possibly 4 conditions

1. Automatically assigned metadata2. Manually assigned metadata3. Full-text indexing4. Fielded searches

Page 23: Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Developing & Evaluating Metadata for Improved.

Concluding Thoughts

• Provocative findings– Need replication on other document types

• Digital Library is a networked structure, but not captured in linear world of metadata– Rich annotation by users is a type of metadata that

is not currently but could be captured automatically • Consider information extraction technologies

– Entities, Relations, and Events• Metadata can be useful in multiple ways

– Not just for discovery– Have not experimented with management aspects

of use• Results of retrieval experiment will be key to

understanding need for metadata for access