Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

193
Question Answering over Linked Data Challenges, Approaches & Trends André Freitas and Christina Unger Tutorial @ ESWC 2014

Transcript of Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Page 1: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Question Answering over Linked Data

Challenges, Approaches & Trends

André Freitas and Christina Unger

Tutorial @ ESWC 2014

Page 2: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Understand the changes on the database landscape in the direction of more heterogeneous data scenarios, especially the Semantic Web.

Understand how Question Answering (QA) fits into this new scenario.

Give the fundamental pointers for you to develop your own QA system from the state-of-the-art.

Focus on coverage.

2

Goal of this Talk

Page 3: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Motivation & Context Big Data, Linked Data & QA Challenges for QA over LD The Anatomy of a QA System QA over Linked Data (Case Studies) Evaluation of QA over Linked Data Do-it-yourself (DIY): Core Resources Trends Take-away Message

4

Outline

Page 4: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

5

Motivation & Context

Page 5: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Humans are built-in with natural language communication capabilities.

Very natural way for humans to communicate information needs.

The archetypal AI system.

6

Why Question Answering?

Page 6: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

A research field on its own.

Empirical bias: Focus on the development and evaluation of approaches and systems to answer questions over a knowledge base.

Multidisciplinary: ◦ Natural Language Processing◦ Information Retrieval◦ Knowledge Representation◦ Databases◦ Linguistics◦ Artificial Intelligence◦ Software Engineering◦ ...

7

What is Question Answering?

Page 7: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

8

What is Question Answering?

QA System

Knowledge Bases

Question: Who is the daughter of Bill

Clinton married to?

Answer: Marc Mezvinsky

Page 8: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

From the QA expert perspective◦ QA depends on mastering different semantic computing

techniques.

QA as a ‘Semantic Pentathlon’

9

Page 9: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Keyword Search:◦ User still carries the major efforts in interpreting the data.◦ Satisfying information needs may depend on multiple

search operations.◦ Answer-driven information access.◦ Input: Keyword search

Typically specification of simpler information needs.◦ Output: documents, structured data.

QA:◦ Delegates more ‘interpretation effort’ to the machines.◦ Query-driven information access.◦ Input: natural language query

Specification of complex information needs.◦ Output: direct answer.

10

QA vs IR

Page 10: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Structured Queries:◦ A priori user effort in understanding the schemas behind

databases.◦ Effort in mastering the syntax of a query language.◦ Satisfying information needs may depend on multiple

querying operations.◦ Input: Structured query◦ Output: data records, aggregations, etc

QA:◦ Delegates more ‘semantic interpretation effort’ to the

machine.◦ Input: natural language query◦ Output: direct natural language answer

11

QA vs Database Querying

Page 11: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Keyword search:◦ Simple information needs.◦ Predictable search behavior.◦ Vocabulary redundancy (large document

collections, Web). Structured queries:

◦ Demand for absolute precision/recall guarantees.◦ Small & centralized schemas.◦ More data volume/less semantic heterogeneity.

QA:◦ Specification of complex information needs.◦ Heterogeneous and schema-less data.◦ More automated semantic interpretation.

12

When to use?

Page 12: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

13

QA: Vision

Page 13: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

14

QA: Reality (FB Graph Search)

Page 14: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

15

QA: Reality (Watson)

Page 15: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

16

QA: Reality (Siri)

Page 16: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

QA: Reality (Google Knowledge Graph)

Page 17: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

QA is usually associated with the delegation of more of the ‘interpretation effort’ to the machines.

QA supports the specification of more complex information needs.

QA, Information Retrieval and Databases are complementary perspectives.

18

To Summarize

Page 18: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

19

Big Data, Linked Data

& QA

Page 19: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Big Data: More complete data-based picture of the world.

20

Big Data

Page 20: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Heterogeneous, complex and large-scale databases. Very-large and dynamic “schemas”.

21

From Rigid Schemas to Schema-less

10s-100s attributes

1,000s-1,000,000s attributescirca 2000

circa 2013

Page 21: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Multiple perspectives (conceptualizations) of the reality. Ambiguity, vagueness, inconsistency.

22

Multiple Interpretations

Page 22: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Franklin et al. (2005): From Databases to Dataspaces.

Helland (2011): If You Have Too Much Data, then “Good Enough” Is Good Enough.

Fundamental trends:◦ Co-existence of heterogeneous data.◦ Semantic best-effort behavior.◦ Pay-as-you go data integration.◦ Co-existing query/search services.

23

Big Data & Dataspaces

Page 23: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

24

Trend 1: Data size

Gantz & Reinsel, The Digital Universe in 2020 (2012).

Page 24: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

25

Trend 2: Connectedness

Eifrem, A NOSQL Overview And The Benefits Of Graph Database (2009).

Page 25: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Decentralization of content generation.

26

Trend 3

Eifrem, A NOSQL Overview And The Benefits Of Graph Database (2009)

Page 26: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Volume Velocity Variety

27

Big Data

- The most interesting but usually neglected dimension.

Page 27: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

The Semantic Web Vision: Software which is able to understand meaning (intelligent, flexible).

QA as a way to support this vision.

QA & Semantic Web

28

Page 28: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

From which university did the wife of Barack Obama graduate?

With Linked Data we are still in the DB world

QA & Linked Data

29

Page 29: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

With Linked Data we are still in the DB world

(but slightly worse)

QA & Linked Data

30

Page 30: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Addresses practical problems of data accessibility in a data heterogeneity scenario.

A fundamental part of the original Semantic Web vision.

31

QA & Linked Data

Page 31: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

32

Challenges for QA over Linked Data

Page 32: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

From questions to answers

33

Page 33: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Example: What is the currency of the Czech Republic?

SELECT DISTINCT ?uri WHERE { res:Czech_Republic dbo:currency ?uri . }

Main challenges:

The gap between natural language and linked data

Mapping natural language expressions to vocabulary elements (accounting for lexical and structural differences).

Handling meaning variations (e.g. ambiguous or vague expressions, anaphoric expressions).

34

Page 34: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

URIs are language independent identifiers.

Their only actual connection to natural language is by the labels that are attached to them.

dbo:spouse rdfs:label “spouse”@en , “echtgenoot”@nl .

Labels, however, do not capture lexical variation:

wife of husband of married to

...

Mapping natural language expressions to vocabulary elements

35

Page 35: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Mapping natural language expressions to vocabulary elements

Which Greek cities have more than 1 million inhabitants?

SELECT DISTINCT ?uri WHERE { ?uri rdf:type dbo:City . ?uri dbo:country res:Greece . ?uri dbo:populationTotal ?p . FILTER (?p > 1000000)}

36

Page 36: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Often the conceptual granularity of language does not coincide with that of the data schema.

Mapping natural language expressions to vocabulary elements

When did Germany join the EU?

SELECT DISTINCT ?date WHERE { res:Germany dbp:accessioneudate ?date .}

Who are the grandchildren of Bruce Lee?

SELECT DISTINCT ?uri WHERE { res:Bruce_Lee dbo:child ?c . ?c dbo:child ?uri .}

37

Page 37: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

In addition, there are expressions with a fixed, dataset-independent meaning.

Mapping natural language expressions to vocabulary elements

Who produced the most films?

SELECT DISTINCT ?uri WHERE { ?x rdf:type dbo:Film . ?x dbo:producer ?uri .}ORDER BY DESC(COUNT(?x))OFFSET 0 LIMIT 1

38

Page 38: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Different datasets usually follow different schemas, thus provide different ways of answering an information need.

Example:

Meaning variation: Ambiguity

39

Page 39: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

The meaning of expressions like the verbs to be, to have, and prepositions of, with, etc. strongly depends on the linguistic context.

Meaning variation: Semantically light expressions

Which museum has the most paintings?

?museum dbo:exhibits ?painting .

Which country has the most caves?

?cave dbo:location ?country .

40

Page 40: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

The number of non-English actors on the web is growing substantially. ◦ Accessing data.◦ Creating and publishing data.

Semantic Web: In principle very well suited for multilinguality, as URIs are

language-independent. But adding multilingual labels is not common practice (less

than a quarter of the RDF literals have language tags, and most of those tags are in English).

Multilinguality

41

Page 41: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Requirement: Completeness and accuracy (Wrong answers are worse than no answers)

In the context of the Semantic Web: QA systems need to deal with heterogeneous

and imperfect data.◦ Datasets are often incomplete.◦ Different datasets sometimes contain duplicate information,

often using different vocabularies even when talking about the same things.

◦ Datasets can also contain conflicting information and inconsistencies.

Data quality and data heterogeneity

42

Page 42: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Data is distributed among a large collection of interconnected datasets.

Distributed and linked datasets

Example: What are side effects of drugs used for thetreatment of Tuberculosis?

SELECT DISTINCT ?xWHERE { disease:1154 diseasome:possibleDrug ?d1. ?d1 a drugbank:drugs . ?d1 owl:sameAs ?d2. ?d2 sider:sideEffect ?x.}

43

Page 43: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Requirement: Real-time answers, i.e. low processing time.

In the context of the Semantic Web:

Datasets are huge. ◦ There are a lot of distributed datasets that might

be relevant for answering the question. ◦ Reported performance of current QA systems

amounts to ~20-30 seconds per question (on one dataset).

Performance and scalability

44

Page 44: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Bridge the gap between natural languages and data.

Deal with incomplete, noisy and heterogeneous datasets.

Scale to a large number of huge datasets. Use distributed and interlinked datasets. Integrate structured and unstructured data. Low maintainability costs (easily adaptable to

new datasets and domains).

Summary: Challenges

45

Page 45: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

The Anatomy of a QA System

46

Page 46: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Natural Language Interfaces (NLI) ◦ Input: Natural language queries◦ Output: Either processed or unprocessed answers

Processed: Direct answers (QA). Unprocessed: Database records, text snippets,

documents.

47

NLI & QAs

NLI

QA

Page 47: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Categorization of questions and answers. Important for:

◦ Understanding the challenges before attacking the problem.

◦ Scoping the QA system.

Based on:◦ Chin-Yew Lin: Question Answering.◦ Farah Benamara: Question Answering Systems: State

of the Art and Future Directions.

48

Basic Concepts & Taxonomy

Page 48: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

The part of the question that says what is being asked:◦ Wh-words:

who, what, which, when, where, why, and how

◦ Wh-words + nouns, adjectives or adverbs: “which party …”, “which actress …”, “how long …”, “how

tall …”.

49

Terminology: Question Phrase

Page 49: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Useful for distinguishing different processing strategies ◦ FACTOID:

PREDICATIVE QUESTIONS: “Who was the first man in space?” “What is the highest mountain in Korea?” “How far is Earth from Mars?” “When did the Jurassic Period end?” “Where is Taj Mahal?”

LIST: “Give me all cities in Germany.”

SUPERLATIVE: “What is the highest mountain?”

YES-NO: “Was Margaret Thatcher a chemist?”

50

Terminology: Question Type

Page 50: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Useful for distinguishing different processing strategies ◦ OPINION:

“What do most Americans think of gun control?” ◦ CAUSE & EFFECT:

“What is the most frequent cause for lung cancer?” ◦ PROCESS:

“How do I make a cheese cake?”◦ EXPLANATION & JUSTIFICATION:

“Why did the revenue of IBM drop?”◦ ASSOCIATION QUESTION:

“What is the connection between Barack Obama and Indonesia?”

◦ EVALUATIVE OR COMPARATIVE QUESTIONS: “What is the difference between impressionism and

expressionism?”51

Terminology: Question Type

Page 51: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

The class of object sought by the question: Abbreviation Entity: event, color, animal, plant,. . . Description, Explanation & Justification :

definition, manner, reason,. . . (“How, why …”) Human: group, individual,. . . (“Who …”) Location: city, country, mountain,. . . ( “Where

…”) Numeric: count, distance, size,. . . (“How many

how far, how long …”) Temporal: date, time, …(from “When …”)

52

Terminology: Answer Type

Page 52: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Question focus is the property or entity that is being sought by the question ◦ “In which city was Barack Obama born?” ◦ “What is the population of Galway?”

Question topic: What the question is generally about :◦ “What is the height of Mount Everest?”

(geography, mountains)◦ “Which organ is affected by the Meniere’s disease?”

(medicine)

53

Terminology: Question Focus & Topic

Page 53: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Structure level:◦ Structured data (databases).◦ Semi-structured data (e.g. XML).◦ Unstructured data.

Data source:◦ Single dataset (centralized).◦ Enumerated list of multiple, distributed datasets.◦ Web-scale.

54

Terminology: Data Source Type

Page 54: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Domain Scope:◦ Open Domain◦ Domain specific

Data Type:◦ Text◦ Image◦ Sound◦ Video

Multi-modal QA

55

Terminology: Domain Type

Page 55: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Long answers◦ Definition/justification based.

Short answers ◦ Phrases.

Exact answers◦ Named entities, numbers, aggregate, yes/no.

56

Terminology: Answer Format

Page 56: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Relevance: The level in which the answer addresses users information needs.

Correctness: The level in which the answer is factually correct.

Conciseness: The answer should not contain irrelevant information.

Completeness: The answer should be complete. Simplicity: The answer should be easy to

interpret. Justification: Sufficient context should be

provided to support the data consumer in the determination of the query correctness.

Answer Quality Criteria

57

Page 57: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Right: The answer is correct and complete. Inexact: The answer is incomplete or

incorrect. Unsupported: The answer does not have an

appropriate evidence/justification. Wrong: The answer is not appropriate for the

question.

Answer Assessment

58

Page 58: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Simple Extraction: Direct extraction of snippets from the original document(s) / data records.

Combination: Combines excerpts from multiple sentences, documents / multiple data records, databases.

Summarization: Synthesis from large texts / data collections.

Operational/functional: Depends on the application of functional operators.

Reasoning: Depends on the application of an inference process over the original data.

59

Answer Processing

Page 59: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Semantic Tractability (Popescu et al., 2003): Vocabulary overlap/distance between the query and the answer.

Answer Locality (Webber et al., 2003): Whether answer fragments are distributed across different document fragments / documents or datasets/dataset records.

Derivability (Webber et al, 2003): Dependent if the answer is explicit or implicit. Level of reasoning dependency.

Semantic Complexity: Level of ambiguity and discourse/data heterogeneity.

60

Complexity of the QA Task

Page 60: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

61

High-level Components

Page 61: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Data pre-processing: Pre-processes the database data (includes indexing, data cleaning, feature extraction).

Question Analysis: Performs syntactic analysis and detects/extracts the core features of the question (NER, answer type, etc).

Data Matching: Matches terms in the question to entities in the data.

Query Constuction: Generates structured query candidates considering the question-data mappings and the syntactic constraints in the query and in the database.

Scoring: Data matching and the query construction components output several candidates that need to be scored and ranked according to certain criteria.

Answer Retrieval & Extraction: Executes the query and extracts the natural language answer from the result set.

62

Main Components

Page 62: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

64

QA over Linked

Data(Case Studies)

Page 63: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

ORAKEL & Pythia (Cimiano et al, 2007; Unger & Cimiano, 2011)◦ Ontology-specific question answering

TBSL (Unger et al., 2012)◦ Template-based question answering

Aqualog & PowerAqua (Lopez et al. 2006)◦ Querying on a Semantic Web scale

Treo (Freitas et al. 2011, 2014)◦ Schema-agnostic querying using distributional

semantics IBM Watson (Ferrucci et al., 2010)

◦ Large-scale evidence-based model for QA

65

QA Systems (Case Studies)

Page 64: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

QuestIO & Freya (Damljanovic et al. 2010) QAKIS (Cabrio et al. 2012) Yahya et al., 2013

66

QA Systems (Not covered)

Page 65: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

67

ORAKEL (Cimiano et al, 2007) & Pythia

(Unger & Cimiano, 2011)Ontology-specific question answering

Page 66: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Key contributions: ◦ Using ontologies to interpret user questions◦ Relies on a deep linguistic analysis that

returns semantic representations aligned to the ontology vocabulary and structure

Evaluation: Geobase

68

ORAKEL (Cimiano et al. 2007) &Pythia (Unger & Cimiano 2011)

Page 67: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

69

ORAKEL (Cimiano et al. 2007) &Pythia (Unger & Cimiano 2011)

Ontology-based QA:◦ Ontologies play a central role in interpreting user questions ◦ Output is a meaning representation that is aligned to the

ontology underlying the dataset that is queried◦ ontological knowledge is used for drawing inferences, e.g.

for resolving ambiguities

Grammar-based QA:◦ Rely on linguistic grammars that assign a syntactic and

semantic representation to lexical units◦ Advantage: can deal with questions of arbitrary complexity◦ Drawback: brittleness (fail if question cannot be parsed

because expressions or constructs are not covered by the grammar)

Page 68: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

70

Grammars Ontology-independent entries

◦ mostly function words quantifiers (some, every, two) wh-words (who, when, where, which, how many) negation (not)

◦ manually specified and re-usable for all domains Ontology-specific entries

◦ content words and phrases corresponding to concepts and properties in the ontology

◦ automatically generated from an ontology lexicon

Page 69: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Aim: capture rich and structured linguistic information about how ontology elements are lexicalized in a particular language

Ontology Lexicon

lemon (Lexicon Model for Ontologies)http://lemon-model.net

◦ meta-model for describing ontology lexica with RDF

◦ declarative (abstracting from specific syntactic and semantic theories)

◦ separation of lexicon and ontology

71

Page 70: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Semantics by reference:◦ The meaning of lexical entries is specified by

pointing to elements in the ontology.

Example:

Ontology Lexicon

72

Page 71: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Architecture

73

Page 72: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Which cities have more than three universities?

Example

SELECT DISTINCT ?x WHERE { ?x rdf:type dbo:City . ?y rdf:type dbo:University . ?y dbo:city ?x .}GROUP BY ?yHAVING (COUNT(?y) > 3)

74

Page 73: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

75

TBSL (Unger et al., 2012)Template-based question answering

Page 74: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Key contributions: ◦ Constructs a query template that directly

mirrors the linguistic structure of the question◦ Instantiates the template by matching natural

language expressions with ontology concepts

Evaluation: QALD 2012

76

TBSL (Unger et al., 2012)

Page 75: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

In order to understand a user question, we need to understand:

Motivation

The words (dataset-specific)Abraham Lincoln → res:Abraham Lincoln

died in → dbo:deathPlace

The semantic structure (dataset-independent)who → SELECT ?x WHERE { … }

the most N → ORDER BY DESC(COUNT(?N)) LIMIT 1more than i N → HAVING COUNT(?N) > i

77

Page 76: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Goal: An approach that combines both an analysis of the semantic structure and a mapping of words to URIs.

Two-step approach: ◦ 1. Template generation

Parse question to produce a SPARQL template that directly mirrors the structure of the question, including filters and aggregation operations.

◦ 2. Template instantiation Instantiate SPARQL template by matching natural

language expressions with ontology concepts using statistical entity identification and predicate detection.

Template-based QA

78

Page 77: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Example: Who produced the most films?

SPARQL template:

SELECT DISTINCT ?x WHERE { ?y rdf:type ?c . ?y ?p ?x .}ORDER BY DESC(COUNT(?y))OFFSET 0 LIMIT 1

?c CLASS [films]?p PROPERTY [produced]

Instantiations:

?c = <http://dbpedia.org/ontology/Film>?p = <http://dbpedia.org/ontology/producer>

79

Page 78: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Architecture

80

Page 79: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Step 1: Template generationLinguistic processing

81

Page 80: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

1. Natural language question is tagged with part-of-speech information.

2. Based on POS tags, grammar entries are built on the fly. ◦ Grammar entries are pairs of:

tree structures (Lexicalized Tree Adjoining Grammar) semantic representations (ext. Discourse Representation

Structures)

3. These lexical entries, together with domain-independent lexical entries, are used for parsing the question (cf. Pythia).

4. The resulting semantic representation is translated into a SPARQL template.

Step 1: Template generationLinguistic processing

82

Page 81: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Domain-independent: who, the most Domain-dependent: produced/VBD, films/NNS

Example: Who produced the most films?

SPARQL template 1:

SELECT DISTINCT ?x WHERE { ?x ?p ?y . ?y rdf:type ?c .}ORDER BY DESC(COUNT(?y)) LIMIT 1

?c CLASS [films]?p PROPERTY [produced]

SPARQL template 2:

SELECT DISTINCT ?x WHERE { ?x ?p ?y .}ORDER BY DESC(COUNT(?y)) LIMIT 1

?p PROPERTY [films]

83

Page 82: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Step 2: Template instantiationEntity identification and predicate detection

84

Page 83: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

1. For resources and classes, a generic approach to entity detection is applied:

◦ Identify synonyms of the label using WordNet.◦ Retrieve entities with a label similar to the slot label based on

string similarities (trigram, Levenshtein and substring similarity).

2. For property labels, the label is additionally compared to natural language expressions stored in the BOA pattern library.

3. The highest ranking entities are returned as candidates for filling the query slots.

Step 2: Template instantiationEntity identification and predicate detection

85

Page 84: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Example: Who produced the most films?

?c CLASS [films]

<http://dbpedia.org/ontology/Film><http://dbpedia.org/ontology/FilmFestival>...

?p PROPERTY [produced]

<http://dbpedia.org/ontology/producer><http://dbpedia.org/property/producer>

<http:// dbpedia.org/ontology/wineProduced>

86

Page 85: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Step 3: Query ranking and selection

87

Page 86: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

1. Every entity receives a score considering string similarity and prominence.

2. The score of a query is then computed as the average of the scores of the entities used to fill its slots.

3. In addition, type checks are performed: ◦ For all triples ?x rdf:type <class>, all query triples ?x p

e and e p ?x are checked w.r.t. whether domain/range of p is consistent with <class>.

4. Of the remaining queries, the one with highest score that returns a result is chosen.

Step 3: Query ranking and selection

88

Page 87: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Example: Who produced the most films?

SELECT DISTINCT ?x WHERE { ?x <http://dbpedia.org/ontology/producer> ?y . ?y rdf:type <http://dbpedia.org/ontology/Film> .}ORDER BY DESC(COUNT(?y)) LIMIT 1

Score: 0.76

SELECT DISTINCT ?x WHERE { ?x <http://dbpedia.org/ontology/producer> ?y . ?y rdf:type <http://dbpedia.org/ontology/FilmFestival>.}ORDER BY DESC(COUNT(?y)) LIMIT 1

Score: 0.60

89

Page 88: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

The created template structure does not always coincide with how the data is actually modelled.

Considering all possibilities of how the data could be modelled leads to a big amount of templates (and even more queries) for one question.

Main drawback

90

Page 89: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

91

Aqualog & PowerAqua (Lopez et al.

2006)Querying on a Semantic Web scale

Page 90: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Key contributions: ◦ Pioneer work on the QA over Semantic Web

data.◦ Semantic similarity mapping.

Terminological Matching: ◦ WordNet-based◦ Ontology Based◦ String similarity◦ Sense-based similarity matcher

Evaluation: QALD (2011). Extends the AquaLog system.

92

PowerAqua (Lopez et al. 2006)

Page 91: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

93

WordNet

Page 92: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Two words are strongly similar if any of the following holds:◦ 1. They have a synset in common (e.g. “human” and

“person”)◦ 2. A word is a hypernym/hyponym in the taxonomy of

the other word.◦ 3. If there exists an allowable “is-a” path connecting a

synset associated with each word.◦ 4. If any of the previous cases is true and the

definition (gloss) of one of the synsets of the word (or its direct hypernyms/hyponyms) includes the other word as one of its synonyms, we said that they are highly similar.

94

Sense-based similarity matcher

Lopez et al. 2006

Page 93: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

95

PowerAqua (Lopez et al. 2006)

Page 94: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

96

Treo (Freitas et al. 2011, 2014)Schema-agnostic querying using

distributional semantics

Page 95: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Key contributions: ◦ Distributional semantic relatedness matching

model.◦ Compositional-distributional model for QA.

Terminological Matching: ◦ Explicit Semantic Analysis (ESA)◦ String similarity + node cardinality

Evaluation: QALD (2011)

97

Treo (Freitas et al. 2011, 2014)

Page 96: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Treo (Irish): Direction

98

Page 97: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Simple Queries (Video)

99

Page 98: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

More Complex Queries (Video)

100

Page 99: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Vocabulary Search (Video)

101

Page 100: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Treo Answers Jeopardy Queries (Video)

102

Page 101: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Semantics for a Complex World• Most semantic models have dealt with particular types of

constructions, and have been carried out under very simplifying assumptions, in true lab conditions.

• If these idealizations are removed it is not clear at all that modern semantics can give a full account of all but the simplest models/statements.

Sahlgren, 2013

Formal World Real World

103

Baroni et al. 2013

Page 102: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Distributional Semantics“Words occurring in similar (linguistic) contexts are semantically related.”

If we can equate meaning with context, we can simply record the contexts in which a word occurs in a collection of texts (a corpus).

This can then be used as a surrogate of its semantic representation.

104

Page 103: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Distributional Semantic Model

c1

child

husbandspouse

cn

c2

function (number of times that the words occur in c1)

0.7

0.5

Commonsense is here

105

Page 104: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Semantic Relatedness

θ

c1

child

husbandspouse

cn

c2

Works as a semantic ranking function

106

Page 105: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Approach Overview

Query Planner

Ƭ-Space

Large-scale unstructured data

Commonsense knowledge

Database

Distributional semantics

Core semantic approximation &

composition operations

Query AnalysisQuery Query Features

Query Plan

107

Page 106: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Approach Overview

Query Planner

Ƭ-Space

Wikipedia

Commonsense knowledge

RDF

Explicit Semantic Analysis

Core semantic approximation &

composition operations

Query AnalysisQuery Query Features

Query Plan

108

Page 107: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Ƭ-Space

e

p

r

109

Page 108: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Ƭ-Space The vector space is segmented by the instances

110

Page 109: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Core OperationsQuery

111

Page 110: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Core Operations

Search & Composition Operations

Query

112

Page 111: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Search and Composition Operations Instance search

◦ Proper nouns◦ String similarity + node cardinality

Class (unary predicate) search◦ Nouns, adjectives and adverbs◦ String similarity + Distributional semantic relatedness

Property (binary predicate) search◦ Nouns, adjectives, verbs and adverbs◦ Distributional semantic relatedness

Navigation

Extensional expansion◦ Expands the instances associated with a class.

Operator application◦ Aggregations, conditionals, ordering, position

Disjunction & Conjunction Disambiguation dialog (instance, predicate)

113

Page 112: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Core Principles

Minimize the impact of Ambiguity, Vagueness, Synonymy.

Address the simplest matchings first (heuristics).

Semantic Relatedness as a primitive operation.

Distributional semantics as commonsense knowledge.

114

Page 113: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Transform natural language queries into triple patterns.

“Who is the daughter of Bill Clinton married to?”

Query Pre-Processing (Question Analysis)

115

Page 114: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Step 1: POS Tagging◦ Who/WP◦ is/VBZ◦ the/DT◦ daughter/NN◦ of/IN◦ Bill/NNP◦ Clinton/NNP◦ married/VBN◦ to/TO◦ ?/.

Query Pre-Processing (Question Analysis)

116

Page 115: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Step 2: Core Entity Recognition ◦ Rules-based: POS Tag + TF/IDF

Who is the daughter of Bill Clinton married to?(PROBABLY AN INSTANCE)

Query Pre-Processing (Question Analysis)

117

Page 116: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Step 3: Determine answer type◦ Rules-based.

Who is the daughter of Bill Clinton married to? (PERSON)

Query Pre-Processing (Question Analysis)

118

Page 117: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Step 4: Dependency parsing◦ dep(married-8, Who-1) ◦ auxpass(married-8, is-2) ◦ det(daughter-4, the-3) ◦ nsubjpass(married-8, daughter-4) ◦ prep(daughter-4, of-5) ◦ nn(Clinton-7, Bill-6) ◦ pobj(of-5, Clinton-7) ◦ root(ROOT-0, married-8) ◦ xcomp(married-8, to-9)

Query Pre-Processing (Question Analysis)

119

Page 118: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Step 5: Determine Partial Ordered Dependency Structure (PODS)◦ Rules based.

Remove stop words. Merge words into entities. Reorder structure from core entity position.

Query Pre-Processing (Question Analysis)

Bill Clinton daughter married to

(INSTANCE)

Person

ANSWER TYPE

QUESTION FOCUS

120

Page 119: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Step 5: Determine Partial Ordered Dependency Structure (PODS)◦ Rules based.

Remove stop words. Merge words into entities. Reorder structure from core entity position.

Query Pre-Processing (Question Analysis)

Bill Clinton daughter married to

(INSTANCE)

Person

(PREDICATE) (PREDICATE) Query Features

121

Page 120: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Map query features into a query plan. A query plan contains a sequence of:

◦ Search operations.◦ Navigation operations.

Query Planning

(INSTANCE) (PREDICATE) (PREDICATE) Query Features

(1) INSTANCE SEARCH (Bill Clinton) (2) DISAMBIGUATE ENTITY TYPE (3) GENERATE ENTITY FACETS (4) p1 <- SEARCH RELATED PREDICATE (Bill Clintion, daughter)

(5) e1 <- GET ASSOCIATED ENTITIES (Bill Clintion, p1)

(6) p2 <- SEARCH RELATED PREDICATE (e1, married to)

(7) e2 <- GET ASSOCIATED ENTITIES (e1, p2)

(8) POST PROCESS (Bill Clintion, e1, p1, e2, p2)

Query Plan

122

Page 121: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Core Entity Search

Bill Clinton daughter married to Person

:Bill_Clinton

Query:

Linked Data:

Entity Search

123

Page 122: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Distributional Semantic Search

Bill Clinton daughter married to Person

:Bill_Clinton

Query:

Linked Data:

:Chelsea_Clinton

:child

:Baptists:religion

:Yale_Law_School

:almaMater

...(PIVOT ENTITY)

(ASSOCIATED TRIPLES)

124

Page 123: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Distributional Semantic Search

Bill Clinton daughter married to Person

:Bill_Clinton

Query:

Linked Data:

:Chelsea_Clinton

:child

:Baptists:religion

:Yale_Law_School

:almaMater

...

sem_rel(daughter,child)=0.054

sem_rel(daughter,child)=0.004

sem_rel(daughter,alma mater)=0.001

Which properties are semantically related to ‘daughter’?

125

Page 124: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Distributional Semantic Search

Bill Clinton daughter married to Person

:Bill_Clinton

Query:

Linked Data:

:Chelsea_Clinton

:child

126

Page 125: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Computation of a measure of “semantic proximity” between two terms.

Allows a semantic approximate matching between query terms and dataset terms.

It supports a commonsense reasoning-like behavior based on the knowledge embedded in the corpus.

Distributional Semantic Relatedness

127

Page 126: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Use distributional semantics to semantically match query terms to predicates and classes.

Distributional principle: Words that co-occur together tend to have related meaning.◦ Allows the creation of a comprehensive semantic model

from unstructured text.◦ Based on statistical patterns over large amounts of text.◦ No human annotations.

Distributional semantics can be used to compute a semantic relatedness measure between two words.

Distributional Semantic Search

128

Page 127: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Distributional Semantic Search

Bill Clinton daughter married to Person

:Bill_Clinton

Query:

Linked Data:

:Chelsea_Clinton

:child

(PIVOT ENTITY)

129

Page 128: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Distributional Semantic Search

Bill Clinton daughter married to Person

:Bill_Clinton

Query:

Linked Data:

:Chelsea_Clinton

:child:Mark_Mezvinsky

:spouse

130

Page 129: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Architecture

133

Page 130: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Index Structure

134

Page 131: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Results

135

Page 132: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Post-Processing

136

Page 133: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

What is the highest mountain?

Second Query Example

(CLASS) (OPERATOR) Query Features

mountain - highest PODS

137

Page 134: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Entity Search

Mountain highest

:Mountain

Query:

Linked Data:

:typeOf

(PIVOT ENTITY)

138

Page 135: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Extensional Expansion

Mountain highest

:Mountain

Query:

Linked Data:

:Everest

:typeOf

(PIVOT ENTITY)

:K2:typeOf

...

139

Page 136: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Distributional Semantic Matching

Mountain highest

:Mountain

Query:

Linked Data:

:Everest:typeOf

(PIVOT ENTITY)

:K2:typeOf

...

:elevation

:location...

:deathPlaceOf

140

Page 137: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Get all numerical values

Mountain highest

:Mountain

Query:

Linked Data:

:Everest:typeOf

(PIVOT ENTITY)

:K2:typeOf

...

:elevation

:elevation

8848 m

8611 m

141

Page 138: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Apply operator functional definition

Mountain highest

:Mountain

Query:

Linked Data:

:Everest

:typeOf

(PIVOT ENTITY)

:K2:typeOf

...

:elevation

:elevation

8848 m

8611 m

SORT

TOP_MOST

142

Page 139: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Results

143

Page 140: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Semantic approximation in databases (as in any IR system): semantic best-effort.

Need some level of user disambiguation, refinement and feedback.

As we move in the direction of semantic systems we should expect the need for principled dialog mechanisms (like in human communication).

Pull the the user interaction back into the system.

From Exact to Approximate

144

Page 141: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

From Exact to Approximate

145

Page 142: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

User Feedback

146

Page 143: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

147

IBM Watson (Ferrucci et al., 2010)Large-scale evidence-based model for

QA

Page 144: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Key contributions: ◦ Evidence-based QA system.◦ Complex and high performance QA Pipeline.◦ The system uses more than 50 scoring

components that produce scores which range from probabilities and counts to categorical features.

◦ Major cultural impact: Before: QA as AI vision, academic exercise. After: QA as an attainable software architecture in the

short term.

Evaluation: Jeopardy! Challenge

148

IBM Watson

Page 145: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Lexical Answer Type

Ferrucci et al. 2010149

Page 146: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Question analysis: includes shallow and deep parsing, extraction of logical forms, semantic role labelling, coreference resolution, relations extraction, named entity recognition, among others.

Question decomposition: decomposition of the question into separate phrases, which will generate constraints that need to be satisfied by evidence from the data.

Query processing

150

Page 147: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Query processing

Ferrucci et al. 2010

Question

Question & Topic Analysis

QuestionDecomposition

151

Page 148: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Hypothesis generation:◦ Primary search

Document and passage retrieval SPARQL queries are used over triple stores.

◦ Candidate answer generation (maximizing recall). Information extraction techniques are applied to the

search results to generate candidate answers.

Soft filtering: Application of lightweight (less resource intensive) scoring algorithms to a larger set of initial candidates to prune the list of candidates before the more intensive scoring components.

Hypothesis Generation

152

Page 149: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Hypothesis Generation

Ferrucci et al. 2010

Question

PrimarySearch

CandidateAnswer

Generation

Hypothesis

Generation

Answer Sources

Question & Topic Analysis

QuestionDecomposition

HypothesisGeneration

Hypothesis and Evidence Scoring

153

Page 150: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Hypothesis and evidence scoring: ◦ Supporting evidence retrieval

Seeks additional evidence for each candidate answer from the data sources while the deep evidence scoring step determines the degree of certainty that the retrieved evidence supports the candidate answers.

◦ Deep evidence scoring Scores are then combined into an overall evidence

profile which groups individual features into aggregate evidence dimensions.

Evidence Scoring

154

Page 151: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Evidence Scoring

Ferrucci et al. 2010

Answer Scoring

Question

Evidence Sources

PrimarySearch

CandidateAnswer

Generation

Hypothesis

Generation

Hypothesis and Evidence Scoring

Answer Sources

Question & Topic Analysis

QuestionDecomposition

EvidenceRetrieval

Deep Evidence Scoring

HypothesisGeneration

Hypothesis and Evidence Scoring

155

Page 152: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Answer merging: is a step that merges answer candidates (hypotheses) with different surface forms but with related content, combining their scores.

Ranking and confidence estimation: ranks the hypotheses and estimate their confidence based on the scores, using machine learning approaches over a training set. Multiple trained models cover different question types.

Answer Generation

156

Page 153: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Answer Generation Ferrucci et al. 2010Farrell, 2011

Answer Scoring

Models

Answer & Confidence

Question

Evidence Sources

Models

Models

Models

Models

ModelsPrimarySearch

CandidateAnswer

Generation

Hypothesis

Generation

Hypothesis and Evidence Scoring

Final Confidence Merging & Ranking

Synthesis

Answer Sources

Question & Topic Analysis

QuestionDecomposition

EvidenceRetrieval

Deep Evidence Scoring

HypothesisGeneration

Hypothesis and Evidence Scoring

Learned Modelshelp combine and

weigh the Evidence

157

www.medbiq.org/sites/default/files/presentations/2011/Farrell.pp

Page 154: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Architecture

Question100s Possible

Answers

1000’s of Pieces of Evidence

Multiple Interpretations

100,000’s scores from many simultaneous Text Analysis Algorithms100s sources

HypothesisGeneration

Hypothesis and Evidence Scoring

Final Confidence Merging & Ranking

SynthesisQuestion & Topic Analysis

QuestionDecomposition

HypothesisGeneration

Hypothesis and Evidence Scoring

Answer & Confidence

Ferrucci et al. 2010, Farrell, 2011

158 www.medbiq.org/sites/default/files/presentations/2011/Farrell.pp

UIMA for interoperability

UIMA-AS for scale-out and speed

Page 155: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Evidence Profile

Ferrucci et al. 2010159

Page 156: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

160

Evaluation of QA over Linked Data

Page 157: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Test Collection◦ Questions◦ Datasets◦ Answers (Gold-standard)

Evaluation Measures

Evaluation Campaigns

161

Page 158: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Measures how complete is the answer set. The fraction of relevant instances that are

retrieved.

Which are the Jovian planets in the Solar System?◦ Returned Answers:

Mercury Jupiter Saturn

Recall

Gold-standard:– Jupiter– Saturn– Neptune– Uranus

Recall = 2/4 = 0.5

162

Page 159: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Measures how accurate is the answer set. The fraction of retrieved instances that are

relevant.

Which are the Jovian planets in the Solar System?◦ Returned Answers:

Mercury Jupiter Saturn

Precision

Gold-standard:– Jupiter– Saturn– Neptune– Uranus

Recall = 2/3 = 0.67

163

Page 160: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Measures the ranking quality. The Reciprocal-Rank (1/r) of a query can be defined as the rank r

at which a system returns the first relevant result.

Mean Reciprocal Rank (MRR)

Which are the Jovian planets in the Solar System? Returned Answers:

– Mercury– Jupiter– Saturn

Gold-standard:– Jupiter– Saturn– Neptune– Uranus

rr = 1/2 = 0.5

164

Page 161: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Query execution time Indexing time Index size Dataset adaptation effort (Indexing time) Semantic enrichment/disambiguation

◦ # of operations/time

Additional measures

Page 162: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Question Answering over Linked Data (QALD-CLEF)

INEX Linked Data Track BioASQ SemSearch

Test Collections

168

Page 163: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

QALDhttp://www.sc.cit-ec.uni-bielefeld.de/qald/

169

Page 164: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

QALD is a series of evaluation campaigns on question answering over linked data.◦ QALD-1 (ESWC 2011)◦ QALD-2 as part of the workshop

(Interacting with Linked Data (ESWC 2012))◦ QALD-3 (CLEF 2013)◦ QALD-4 (CLEF 2014)

It is aimed at all kinds of systems that mediate between a user, expressing his or her information need in natural language, and semantic data.

QALD

170

Page 165: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

QALD-4 is part of the Question Answering track at CLEF 2014:

http://nlp.uned.es/clef-qa/

Tasks:◦ 1. Multilingual question answering over DBpedia◦ 2. Biomedical question answering on interlinked

data◦ 3. Hybrid question answering

QALD-4

171

Page 166: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Task: Given a natural language question or keywords,

either retrieve the correct answer(s) from a given RDF repository, or provide a SPARQL query that retrieves these answer(s).

◦ Dataset: DBpedia 3.9 (with multilingual labels)◦ Questions: 200 training + 50 test◦ Seven languages:

English, Spanish, German, Italian, French, Dutch, Romanian

Task 1: Multilingual question answering

172

Page 167: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Example

<question id = "36" answertype = "resource" aggregation = "false" onlydbo = "true" >

Through which countries does the Yenisei river flow?Durch welche Länder fließt der Yenisei?¿Por qué países fluye el río Yenisei?...

PREFIX res: <http://dbpedia.org/resource/>PREFIX dbo: <http://dbpedia.org/ontology/>SELECT DISTINCT ?uri WHERE { res:Yenisei_River dbo:country ?uri .}

173

Page 168: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Datasets: SIDER, Diseasome, Drugbank Questions: 25 training + 25 test require integration of information from different

datasets

Example: What is the side effects of drugs used for Tuberculosis?

Task 2: Biomedical question answering on interlinked data

SELECT DISTINCT ?x WHERE { disease:1154

diseasome:possibleDrug ?v2 . ?v2 a drugbank:drugs . ?v3 owl:sameAs ?v2 . ?v3 sider:sideEffect ?x .}

174

Page 169: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Dataset: DBpedia 3.9 (with English abstracts) Questions: 25 training + 10 test require both structured data and free text from the

abstract to be answered

Example: Give me the currencies of all G8 countries.

Task 3: Hybrid question answering

SELECT DISTINCT ?uri WHERE { ?x text:"member of" text:"G8" . ?x dbo:currency ?uri .}

175

Page 170: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Focuses on the combination of textual and structured data. Datasets:

◦ English Wikipedia (MediaWiki XML Format)◦ DBpedia 3.8 & YAGO2 (RDF)◦ Links among the Wikipedia, DBpedia 3.8, and YAGO2 URI's.

Tasks:◦ Ad-hoc Task: return a ranked list of results in response to a search

topic that is formulated as a keyword query (144 search topics).◦ Jeopardy Task: Investigate retrieval techniques over a set of

natural-language Jeopardy clues (105 search topics – 74 (2012) + 31 (2013)).

INEX Linked Data Track (2013)

https://inex.mmci.uni-saarland.de/tracks/lod/

181

Page 171: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

INEX Linked Data Track (2013)

182

Page 172: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Focuses on entity search over Linked Datasets. Datasets:

◦ Sample of Linked Data crawled from publicly available sources (based on the Billion Triple Challenge 2009).

Tasks:◦ Entity Search: Queries that refer to one particular

entity. Tiny sample of Yahoo! Search Query. ◦ List Search: The goal of this track is select objects

that match particular criteria. These queries have been hand-written by the organizing committee.

SemSearch Challenge

http://semsearch.yahoo.com/datasets.php#

183

Page 173: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

List Search queries:◦ republics of the former Yugoslavia ◦ ten ancient Greek city ◦ kingdoms of Cyprus ◦ the four of the companions of the prophet ◦ Japanese-born players who have played in MLB

where the British monarch is also head of state ◦ nations where Portuguese is an official language ◦ bishops who sat in the House of Lords ◦ Apollo astronauts who walked on the Moon

SemSearch Challenge

184

Page 174: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Entity Search queries:◦ 1978 cj5 jeep ◦ employment agencies w. 14th street ◦ nyc zip code ◦ waterville Maine ◦ LOS ANGELES CALIFORNIA ◦ ibm ◦ KARL BENZ ◦ MIT

SemSearch Challenge

185

Page 175: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Baseline

Balog & Neumayer, A Test Collection for Entity Search in DBpedia (2013).

186

Page 176: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Datasets:◦ PubMed documents

Tasks:◦ 1a: Large-Scale Online Biomedical Semantic

Indexing Automatic annotation of PubMed documents. Training data is provided.

◦ 1b: Introductory Biomedical Semantic QA 300 questions and related material (concepts, triples

and golden answers).

187

Page 177: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Metrics, Statistics, Tests - Tetsuya Sakai (IR)◦ http://www.promise-noe.eu/documents/10156/26e7f254-1fe

b-4169-9204-1c53cc1fd2d7

Building test Collections (IR Evaluation - Ian Soboroff)◦ http://www.promise-noe.eu/documents/10156/951b6dfb-a40

4-46ce-b3bd-4bbe6b290bfd

Going Deeper

188

Page 178: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

189

Do-it-yourself (DIY): Core Resources

Page 179: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

DBpedia◦ http://dbpedia.org/

YAGO◦ http://www.mpi-inf.mpg.de/yago-naga/yago/

Freebase◦ http://www.freebase.com/

Wikipedia dumps◦ http://dumps.wikimedia.org/

ConceptNet◦ http:// conceptnet5.media.mit.edu/

Common Crawl◦ http://commoncrawl.org/

Where to use:◦ As a commonsense KB or as a data source

Datasets

190

Page 180: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

High domain coverage:◦ ~95% of Jeopardy! Answers.◦ ~98% of TREC answers.

Wikipedia is entity-centric. Curated link structure. Complementary tools:

◦ Wikipedia Miner

Where to use:◦ Construction of distributional semantic models.◦ As a commonsense KB

Wikipedia

191

Page 181: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

WordNet◦ http://wordnet.princeton.edu/

Wiktionary◦ http://www.wiktionary.org/◦ API: https://www.mediawiki.org/wiki/API:Main_page

FrameNet◦ https://framenet.icsi.berkeley.edu/fndrupal/

VerbNet◦ http://verbs.colorado.edu/~mpalmer/projects/verbnet.html

English lexicon for DBpedia 3.8 (in the lemon format)◦ http://lemon-model.net/lexica/dbpedia_en/

PATTY (collection of semantically-typed relational patterns)◦ http://www.mpi-inf.mpg.de/yago-naga/patty/

BabelNet◦ http://babelnet.org/

Where to use:◦ Query expansion◦ Semantic similarity◦ Semantic relatedness◦ Word sense disambiguation

Lexical Resources

192

Page 182: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Lucene & Solr◦ http://lucene.apache.org/

Terrier◦ http://terrier.org/

Where to use:◦ Answer Retrieval◦ Scoring◦ Query-Data matching

Indexing & Search Engines

193

Page 183: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

GATE (General Architecture for Text Engineering)◦ http://gate.ac.uk/

NLTK (Natural Language Toolkit)◦ http://nltk.org/

Stanford NLP◦ http://www-nlp.stanford.edu/software/index.shtml

LingPipe◦ http://alias-i.com/lingpipe/index.html

Where to use:◦ Question Analysis

Text Processing Tools

194

Page 184: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

MALT◦ http://www.maltparser.org/◦ Languages (pre-trained): English, French, Swedish

Stanford parser◦ http://nlp.stanford.edu/software/lex-parser.shtml◦ Languages: English, German, Chinese, and others

CHAOS◦ http://art.uniroma2.it/external/chaosproject/◦ Languages: English, Italian

C&C Parser◦ http://svn.ask.it.usyd.edu.au/trac/candc

Where to Use:◦ Question Analysis

Parsers

195

Page 185: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

NERD (Named Entity Recognition and Disambiguation)◦ http://nerd.eurecom.fr/

Stanford Named Entity Recognizer◦ http://nlp.stanford.edu/software/CRF-NER.shtml

FOX (Federated Knowledge Extraction Framework)◦ http://fox.aksw.org

DBpedia Spotlight◦ http://spotlight.dbpedia.org

Where to use:◦ Question Analysis◦ Query-Data Matching

Named Entity Recognition/Linking

196

Page 186: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Wikipedia Miner◦ http://wikipedia-miner.cms.waikato.ac.nz/

WS4J (Java API for several semantic relatedness algorithms)◦ https://code.google.com/p/ws4j/

SecondString (string matching)◦ http://secondstring.sourceforge.net

EasyESA (distributional semantic relatedness)◦ http://treo.deri.ie/easyESA

Sspace (distributional semantics framework)◦ https://github.com/fozziethebeat/S-Space

Where to use:◦ Query-Data matching◦ Semantic relatedness & similiarity◦ Word Sense Disambiguation

String similarity and semantic relatedness

197

Page 187: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

DIRT◦ Paraphrase Collection:

http://aclweb.org/aclwiki/index.php?title◦ DIRT_Paraphrase_Collection

Demo: http://demo.patrickpantel.com/demos/lexsem/paraphrase.htm

PPDB (The Paraphrase Database)◦ http://www.cis.upenn.edu/~ccb/ppdb/

Where to use:◦ Query-Data matching

Text Entailment

198

Page 188: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Apache UIMA◦ http://uima.apache.org/

Open Advancement of Question Answering Systems (OAQA)◦ http://oaqa.github.io/

OKBQA◦ http://www.okbqa.org/documentation

https://github.com/okbqa

Where to use:◦ Components integration

NLP Pipelines

199

Page 189: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

200

Trends

Page 190: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Querying distributed linked data Integration of structured and unstructured data User interaction and context mechanisms Integration of reasoning (deductive, inductive,

counterfactual, abductive ...) on QA approaches and test collections

Measuring confidence and answer uncertainty Multilinguality Machine Learning Reproducibility in QA research

Research Topics/Opportunities

201

Page 191: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Big/Linked Data demand new principled semantic approaches to cope with the scale and heterogeneity of data.

Part of the Semantic Web/AI vision can be addressed today with a multi-disciplinary perspective:◦ Linked Data, IR and NLP

The multidiscipinarity of the QA problem can serve show what semantic computing have achieved and can be transported to other information system types.

Challenges are moving from the construction of basic QA systems to more sophisticated semantic functionalities.

Very active research area.

Take-away Message

202

Page 192: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

Unger, C., Freitas, A., Cimiano, P., Introduction to Question Answering for Linked Data, Reasoning Web Summer School, 2014.

Introductory Reference

203

Page 193: Question Answering over Linked Data: Challenges, Approaches & Trends (Tutorial @ ESWC 2014)

[1] Eifrem, A NOSQL Overview And The Benefits Of Graph Database, 2009

[2] Idehen, Understanding Linked Data via EAV Model, 2010.

[3] Kaufmann & Bernstein, How Useful are Natural Language Interfaces to the Semantic Web for Casual End-users?, 2007

[4] Chin-Yew Lin, Question Answering.

[5] Farah Benamara, Question Answering Systems: State of the Art and Future Directions.

[6] Yahya et al Robust Question Answering over the Web of Linked Data, CIKM, 2013.

[7] Freitas et al., Querying Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches and Trends, 2012.

[8] Freitas et al., Answering Natural Language Queries over Linked Data Graphs: A Distributional Semantics Approach,, 2014. 

[9] Freitas et al., Querying Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches and Trends, 2012.

[10] Freitas & Curry, Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach, IUI, 2014.

[11] Cimiano et al., Towards portable natural language interfaces to knowledge bases, 2008.

[12] Lopez et al., PowerAqua: fishing the semantic web, 2006.

[13] Damljanovic et al., Natural Language Interfaces to Ontologies: Combining Syntactic Analysis and Ontology-based Lookup through the User Interaction, 2010

[14] Unger et al. Template-based Question Answering over RDF Data, 2012.

[15] Cabrio et al., QAKiS: an Open Domain QA System based on Relational Patterns, 2012.

[16] How Useful Are Natural Language Interfaces to the Semantic Web for Casual End-Users?, 2007.

[17] Popescu et al.,Towards a theory of natural language interfaces to databases., 2003.

[18] Farrel, IBM Watson A Brief Overview and Thoughts for Healthcare Education and Performance Improvement

References

204