Post on 23-Jan-2015
description
1
data.cnr.it and theSemantic Scout
CNR Semantic Technology LabISTC - SI
Aldo Gangemi, Alberto Salvati, Enrico Daga, Gianluca TroianiThanks to Claudio Baldassarre (UN-FAO) and Alfio Gliozzo (IBM-Watson)
http://stlab.istc.cnr.ithttp://data.cnr.it
http://bit.ly/semanticscout
data.cnr.it
2
Enhanced SPARQL endpoint
3
Ontologies
4
Sample class from ontology
5
6
The Semantic Scout• A framework for search, presentation, and analysis of entities and
their associated knowledge• Employs SW, LOD, NLP, IR• Scientific work goes back to 2006, first presented at ISWC2007• An evolving prototype for requirements of the EU IP IKS: semantic
search, hybrid IR/SW identity management, automatic document classification (against DBpedia)
• 2009 requirements from the technology transfer office of CNR for the NetwOrK initiative
The CNR
• CNR is the largest research institution in Italy– about 8000 permanent researchers (+14000)– 7 departments focused on the main scientific
research areas– 108 institutes spread all over Italy
• Subdivided into research units, labs, etc.
7
The CNR data sources
Curricula
DB
Frameworks,Programmes,
Workpackages
DB
Departments
DB
Institutes,Central admin,Publications
DB
Permanent employees
DB
Other research
employees,Externally
funded projects
DB
Accounting,Contracts,Invoicing
DB
Administrationdocumentation
File SystemOrganizational data
Personnel-related data
Activity-related data
Financial data
Only partly as open data!
8
The CNR tasks• Strategic objective: matching the research
demand to the research supply• Requirements
– Semantic interoperability between heterogeneous data sources
– Expert finding based on competence– Monitoring funding and evolution of different
research areas and units– Browsing and reporting capabilities
9
Architecture
10
11
12
Methods for data conversion, extraction, inference, integration, linking, publishing, and searching
13
Figures 28 modules
120 classes
300 relations
1200 axioms
>200K entities ≈3M facts (about 2M inferred or extracted) ≈240 datasets }
} CNR Ontology
CNR Data
14
Sources and lifting• Situation usually not as clean as using a
unique CMS for most organizational tasks• DB (e.g. SQL Server) + a lot of textual
records + HTML Web Site + textual corpus + linked open data
• DB + interaction schemata (XML templates and HTML scraping, needed because of schemata degradation and user perspective evolution)
15
Ontology design
• Starting from XML templates as module/pattern drafts• Reengineering XML and scraped templates• Reengineering DB schemata (system engineer
involved)• Obtained modular, pattern-based, task-based ontology• Textual DB records with identity: precondition for
hybridizing IR and SW (see later)• Alignments to FOAF, SIOC, SKOS, WordNet ontologies• Used patterns: situation, place, transitive reduction
The CNR ontology
16
17
Data design• Triplifiers based on SQL rules (automatic
scripting on JDBC drivers not enough because of legacy degradation of physical schemata)– Cf. also: Semion reengineering tool
• Inferences: OWL (Pellet, HermiT), SPARQL CONSTRUCT
• Extraction tool: Semiosearch, categorizer over Wikipedia categories– Next: deep parsing approach (facts, relations, entities)
18
Publishing and hybridizing• Publishing OWL-RDF datasets
– linked data approach (persistent URIs, triple stores for RDF dataset management, linking to common vocabularies: FOAF, DBpedia, Geonames, Bibo, ...)
– OWL ontologies for dataset generation, querying, inference (new enriched datasets)
• Subgraph extraction through SNA• Virtual semantic corpus
– IRW to distinguish information and non-information resources– SPARQL rules to generate virtual texts associated with entities
• Indexing– Lucene+LSA indexing of semantic corpus– “Semantic” Lucene extension to produce tight coupling of virtual texts with
entities– Multilinguality
19
Consuming• SPARQL endpoint, with interface enhancement• Keyword-based search
– Semantic browsing with SPARQL-based AJAX DHTML, RDF relation browser, or XML-based relation browser
• Category-based search– Keyword-based result focusing
20
21
23
Expert finding: Task-based testing
• It is based on the ability to materialize on demand a contextual network of relevant information.
• It is performed with a combination of tools in the toolkit to:– Identify the main topics of research– Recursively search the CNR data cloud
24
Identifying the main topics of research: project description
• “Reputation is a social knowledge, on which a number of social decisions are accomplished. Regulating society from the morning of mankind becomes more crucial with the pace of development of ICT technologies, dramatically enlarging the range of interaction and generating new types of aggregation. Despite its critical role, reputation generation, transmission and use are unclear. The project aims to an interdisciplinary theory of reputation and to modeling the interplay between direct evaluations and meta-evaluations in three types of decisions, epistemic (whether to form a given evaluation), strategic (whether and how interact with target), and memetic (whether and which evaluation to transmit).”– Project About: Social Knowledge for e-Governance.– Topics can be manually annotated, or automatically induced,
e.g.: ethics, sociology, collaboration, social network, reputation
25
Identifying the main topics of research: text categorization
• Query: “ethics, sociology, collaboration, social network, reputation”
26
Search the CNR data cloud: identify an entry point
• “Commessa” (programme): “Il Circuito dell’Integrazione: Mente, Relazioni e Reti Sociali. Simulazione Sociale e Strumenti di Governance”
27
Search the CNR data cloud: identify key people
• Ing. Jordi Sabater: Cognitive Science;• Dott. Mario Paolucci: Sociology, Psichology;• Gennaro di Tosto: Artificial Intelligence;• Walter Quattrociocchi: Interdisciplinary Fields;
• Giuseppe Castaldi: Ethics;• Aldo Gangemi: Semantic Web, Knowledge representation.
28
Expert Finding: Results
• The description of “eRep project” was adopted as a gold standard to evaluate the results when testing the Semantic Scout.
• 6 out of 10 CNR researchers, were correctly retrieved and a project member affiliated with another institution.– Project Coordinator: Dott. Mario Paolucci– External Member: Jordi Sabater Mir
29
Functional evaluation of Semantic Scout (example)
• Expert finding accuracy– All the 6 retrieved people scored among the first 10 in the
result from the search engine.• Benefit of integrated data cloud
– The user judged an “activity” to be relevant to his goal and used it as entry point to the CNR newtork of resources.
30
Functional evaluation of Semantic Scout
• Accessibility and Interaction– Multiple users interfaces guarantee the users an adaptive level
of interaction to each specific type of required information• Completeness of retrieval
– 4 people have not been included in our result set. – Antonietta Di Salvatore: scored below the first 10 people in the
list;(+1)– Giulia Andrighetto was not listed among the people relevant to
the query, but belongs to the social network of Dr. Rosaria Conte.(+1)
– Marco Capenni and Stefano Picascia: have a technician profile, hence they are neither reported among the people relevant to the search query, nor belong to the network of any of the other researchers.
Ongoing work• More data linking (e.g. DBLP,
Georeferencing)• Synchronization with data sources• More interaction paradigms• Privacy issues interlaced with hierarchical
and idiosyncratic practices
31
Conclusions• Hybridizing several semantic and retrieval
technologies provides added value to a research organization
• Scalability works for CNR figures• Interaction is a core selling point• Try it at http://bit.ly/semanticscout• @data_cnr_it, @semanticscout,
@aldogangemi
32