E-MELD Working Group for Corpus Management and Metadata Preliminary report June 20, 2006.

29
E-MELD Working Group for Corpus Management and Metadata Preliminary report June 20, 2006

Transcript of E-MELD Working Group for Corpus Management and Metadata Preliminary report June 20, 2006.

E-MELD Working Group for Corpus Management and Metadata

Preliminary report

June 20, 2006

Participants

Heidi Johnson (chair) Michael Appleby (liaison) Gary Simons Joseph Grimes Shauna Eggars Alison Alvarez Anthony Aristar Charles Warner

Topics visited today

Transmittal of data to archives Metadata:

– Standards– Simple tools– How to get people to do it

Transmittal to archives

This means "electronic transmittal of digital data & metadata"

Group consensus: it's the way of the future and should be encouraged

Hows & whens depend on the institution in which the archive is embedded - not really an EMELD issue

Transmittal to archives II

Gary S. notes institutional repositories using D-SPACE are growing rapidly:

Mainly for archiving/disseminating e.g. preprints of articles

May be a role OLAC can play in developing standards & tools for supporting multimedia language data in D-SPACE repositories

Transmittal to archives III

Johnson wishes it noted that this is should be a low priority issue because:

D-SPACE isn't handy yet for most linguists; Uploading/downloading 100+ Mb audio

files is too slow for most field linguists & speakers;

Mailing stuff on dvds & cds works great and requires no development so people can just go ahead and do it now with no excuses

Metadata: the big picture

OAI-STER meta-catalog (?) now has over 5 million records;

Only list things with URLs (not things in archives that require logins);

OAI is just basic Dublin Core: very dumbed-down metadata;

OLAC's role continues to be providing the linguist's-eye view on this vast ocean of data.

Metadata - Standards

We have a standard for metadata: the OLAC schema.

This should be considered bottom-line, basic, required catalog information for all linguistic data.

Deeper schemas like IMDI and specialized subfield schemas are encouraged.

So, how do we get people to do it?

Johnson's complaint:

Putting up a page of XML code and saying "this is what metadata looks like" is guaranteed to drive the average field linguist away. It frustrates them and gives them an excuse to do nothing.

"XML is an interchange format, not an authoring format." Gary S.

Metadata - Solutions

1. All tools for producing and manipulating linguistic data should include sections for creating and maintaining (at least) OLAC metadata.

2. Tools that do this need to be more widely disseminated (e.g. WordCorr.)

3. New tools being developed must pay attention to standards so that they are interoperable with other tools.

A very simple solution We need templates for metadata catalogs for all

the popular tools: Excel, Word, FilemakerPro. These templates should be downloadable from the

School. Archives then need transformer scripts/tools to

convert these templates into XML. Archives can help maintain & distribute the

templates & transformers, but we are all underfunded so we need help from all you Perl wizards out there!

More about the simple solution

Templates for popular tools and transformer scripts don't require much development effort, but they will result in a HUGE improvement in the amount of metadata that is collected and transferred with data to archives.

From nothing to something in seconds flat!

A once-and-future standard?

Recording audio/video headers with the basic metadata.

All recordings from the 60's have them; few recordings from the 90's do. What happened?

This must be part of all field methods classes and taught on projects.

We should put some good examples - both audio & video - up at the School.

Metadata details for EL data

More demographic info about speakers:

Age, sex, family & social position & role, occupation/economic status/education level where relevant, native language, other languages, place of origin, keep speaker's name anonymous/nickname

Note: the IMDI schema includes these fields.

How do we get people to keep catalog information?"Never underestimate people's ability to get

out of doing something." Alison

1. Make their grade or degree depend on it.

2. Funder pressure: require all data collected w/grant to be archived.

3. Nag them relentlessly.

4. Suite of simple tools & templates that makes it easy and possibly even fun.

Tony's Extremely Excellent Idea

Departments have control over standards for dissertations: require that dissertations include in an appendix all the metadata for all the data on which the diss is based.

Make it an LSA resolution: who could object?

Gary S.: so now the tools should say "Which appendix is this catalog going into?"

Editorial pressure

Linguistic publications could demand (prefer) citation of properly archived data, so linguists would archive & cite in order to publish.

Step 1: a standard format for citing audio/video linguistic data as well as texts;

Step 2: widely disseminating this standard; Step 3: increasingly requiring citation.

Summary

We have a perfectly good metadata standard (in fact, we have two.)

What we need is to get people to use them!! We need:

– Plain language lists of required elements and examples available at the School;

– Simple templates & transformers for use with popular tools;

– An arsenal of "persuaders" (publication, diss reqs, funder pressures, shame & guilt, etc.)

The publication value chain

Gary S. notes that we are moving towards institutional repositories.

Which means that libraries are taking on more of the publishers' role in selecting, (re)formatting, & disseminatng.

So that it may be that depts will lean on authors to publish via institutional repositories, rather than or in addition to traditional publications

The perfect corpus management tool (PCMT)

The ideal tool helps people manage and integrate their workflow:

project initiation collection of primary data products of analysis publications based on the corpus

Features of the PCMT - I

1. designed from the start to support internationalization

2. has remarks fields at every level & especially wherever the primary input is from a controlled vocabulary (e.g. genre)

3. is easy to use, for linguists & speakers

Features of the PCMT - II

4. generates identifiers that relate related materials. You can use these identifiers as to label media and as file names.

5. keeps track of relationships among materials

6. allows you to treat materials as unrelated if you want, so you can e.g. archive all your recordings and then later add texts as they are done

Features of the PCMT - III

7. works offline, but lets you connect and upload your metadata to an archive or other repository

8. lets you to connect to the Ethnologue to look up language codes

9. divides metadata into sub-packages (objects) so that users don't have to re-enter any info (e.g for the language)

10. metadata objects are easy to share across tools

Features of the PCMT - IV

11. lets you copy info from one md object to another so you don't have to type the same info twice. Ex: all your consultants have the same socio-economic situation; all your recordings were elicited; this series of questionnaires was collected by the same person...

Features of the PCMT - V

12. objects are modelled graphically so you can e.g. drag this contributor who speaks this language into this project

13. lets you customize so you can nickname objects for quick reference

14. supports versioning of materials

15. cross-platform, unicode, outputs xml, etc

Features of the PCMT - VI

10. objects:1. project info

2. language info

3. contributor info

4. equipment info

5. data info: objects for recordings, texts, databases, spreadsheets, etc etc etc

Example

At the start of your project, you enter:– project data: your name & contact info,

director, funder, project period, description– language data: code, notes about the variety,

speaker population, etc– equipment: enter all the specs as you take

things out of the box

Example, cont.

In the field: Enter contributor data at the start of a

session with each new consultant Enter recording data each time you record:

– tool gives you an identifier/label– date, place, context, conditions– reference contributors– reference equipment– reference language

Example, cont.

Back at the university: Review/revise previously entered info Transmit recordings and metadata to

archive Enter info for analytical products:

– reference recording objects– reference contributors (new ones, like yourself)– create format objects for software, platform,

fonts, etc