Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata
-
Upload
peter-kiraly -
Category
Data & Analytics
-
view
133 -
download
2
Transcript of Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata
Multilinguality of Metadata Measuring the Multilingual Degree of Europeana‘s Metadata
Juliane Stiller1, Péter Király21 Berlin School of Library and Information Science, Humboldt-Universität zu Berlin
2 Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
ISI 2017, March 14, 2017
1
Languages by eltpics
Agenda
1. Multilinguality in Europeana2. Multilingual Score for Metadata3. Implementation4. Discussion & Future Work
2
○ Books, newspapers, letters, paintings, photographs, radio shows, films, etc.
○ Text, images, video, audio, sounds, 3D○ Over 54 million objects○ > 50 languages
Europeana - Facts
http://statistics.europeana.eu/europeana 4
Thumbnail
Metadata
Link to Provider
Metadata Multilinguality
6+ 40 other languages....
The Multilingual Problem
7
○ Mona Lisa 456 results○ La Gioconda 365 results ○ La Joconde 71 results
http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html
Metadata Enrichment
8
Quantify the Multilinguality of Data to
○Take measures to improve multilinguality in data
○Establish a sense of the multilingual reach of Europeana
○Distribution of languages
○Devise strategies for underrepresented languages
Multilingual Score for Metadata
10
Multilingual saturation of metadata
11
Text w/o language annotation (dc.subject: Germany)
Text w language annotation (dc.subject: Germany@en)
Text w several language annotations (dc.subject: Germany@en, Deutschland@de)
Link to (multilingual) vocabulary (http://www.geonames.org /2921044/ federal-republic-of-germany)
CalculationMissing fieldText string without language tag (language not known)
Text string with 2-3 different language tags
Text string with 4-9 different language tagsText string with more than 10 different language tagsLink to (multilingual) vocabulary
Text string with language tag (language known)
NA
0
1
2
2.3
2.6
3
Example score
13
Text w/o language annotation (dc.subject: Germany):
Text w language annotation (dc.subject: Germany@en)
Text w several language annotations (dc.subject: Germany@en, Deutschland@de)
Link to (multilingual) vocabulary (http://www.geonames.org /2921044/ federal-republic-of-germany)
0
1
2
3
Aggregation of property dc:subject
The Wittgenstein Archives at the University of Bergen: high saturation
National Library Portugal: low saturation
14http://144.76.218.178/europeana-qa/saturation.php?collectionId=all&field=proxy_dc_subject&type=average
Good examples"Die Mauer muß weg!"@de"Die Mauer muß weg! (The Wall must go!)"@en
15
"Kommentiertes Fotorama mit Bildern von 1989-1990 in Berlin"@de"Annotated images from 1989-1990 in Berlin"@en
dc:d
escr
ipti
ondc
:tit
le
"Brandenburger Tor"@de"Brandenburg Gate"@en
"Grenzübergang Potsdamer Platz"@de"Postdamer Platz border crossing"@en
"Reichstag"@de"Reichstag building"@en
Plac
e/sk
os:p
refL
abel
Descriptive fields Subject headings
Implementationsource codes: http://pkiraly.github.io/about/#source-codes
data source: http://hdl.handle.net/21.11101/0000-0001-781F-7(Europeana snapshot, 2015 december) 16
Data processing workflow
web interface
statistical analysis
measuringingestion
★ OAI-PMH★ Europeana
API★ Hadoop★ NoSQL
★ Spark★ Hadoop★ Java★ Apache Solr
★ Spark★ R
★ PHP★ D3.js★ highchart.js★ NoSQL
json csv json, png html, svg
17
Visualization
1818
APIs,abstractio
n,reusing
"Place/skos:altLabel": { "instances": [ {"TRANSLATION": 2.0}, {"TRANSLATION": 2.0}, {"TRANSLATION": 2.0}, ... {"TRANSLATION": 2.40}, {"STRING": 0.0}, ], "score": { "sum": 20.40, "average": 1.85454545, "normalized": 0.649681 }}
Discussion & Future Work
20
extension I. recalculation
The new metrics★ Distinct languages per object★ Language tags per object★ Literals per language★ Number of multilingual properties (a.k.a. fields)★ Number of multilingual statements (a.k.a. field
instances)★ Average number of languages per property with
language★ Average number of languages per proxy
21
extension II. record views
ex:providerProxy dc:subject "special relativity"@en ; dc:creator <http://vocab.getty.eu/ulan/500240971> ; dc:type <http://udcdata.info/001684> .
ex:europeanaProxy dc:subject <http://dbpedia.org/resource/Physics> .
<http://vocab.getty.edu/ulan/500240971> skos:prefLabel "Einstein, Albert"@de .
standard vocabulary
<http://dbpedia.org/resource/Physics> skos:prefLabel "Physics"@en .
<http://udcdata.info/001684> skos:prefLabel "Books in general"@en .
standard vocabulary
non-standard vocabulary
22
extension II. record views
source field link value ① ② ③ ④
ex:providerProxy dc:subject literal "special relativity"@en ① ② ③ ④
dc:creator standard "Einstein, Albert"@de ① ② ③ ④
dc:type non-std "Books in general"@en ② ④
ex:europeanaProxy
dc:subject standard "Physics"@en ③ ④
① data provider's proxy and standard enrichments② data provider's proxy and enrichments③ all proxies and standard enrichments④ all proxies and enrichments
23
Questions
○[email protected]@gwdg.de
○Metadata Quality Assurance Frameworkhttp://144.76.218.178/europeana-qa
○Europeana Data Quality Committeehttp://pro.europeana.eu/page/data-quality-committee
24
AppendixEuropeana data structure in 30 sec
provider proxy
Europeana proxy
Agent
Concept
Place
Timespan
descriptive fields
subject headings
sem
anti
c w
eb