Information Communication Theory Kentaro Inui ( 乾 健太郎 ) Naoaki Okazaki ( 岡崎 直観 )...

Post on 12-Jan-2016

232 views 0 download

Transcript of Information Communication Theory Kentaro Inui ( 乾 健太郎 ) Naoaki Okazaki ( 岡崎 直観 )...

InformationCommunication

Theory

Kentaro Inui ( 乾 健太郎 )

Naoaki Okazaki ( 岡崎 直観 )

2011-10-04 Information Communication Theory ( 情報伝達学 ) 1

(情報伝達学)

Course Plan• Part I ( Okazaki )

• 10/04: Introduction• 10/11: Classification• 10/18: Part-of-speech tagging• 10/25: Syntactic parsing• 11/01: Statistical parsing

• Part II ( Inui )• 11/08: Features and unification• 11/15: Representation of meaning• 11/22: Computational semantics• 11/29: Computational lexical semantics• 12/06: (no class)

2011-10-04 Information Communication Theory ( 情報伝達学 ) 2

• Part III ( Inui, Okazaki, TAs ) • 12/13, 12/20, 2013/01/10, 01/17, 01/24• Programming exercises and project from Natural Language

Processing with Python ( by Steven Bird )• Lectures given at 計算機大演習室( New Student Laboratory

Building for Information Engineering, 情報新棟 1 階)

Course Format

• Text ( optional )• Jurafsky, Daniel and Martin, James H. Speech and

Language Processing. Prentice-Hall, 2009 ( 2nd Edition )• ~ \6,000 available at amazon.co.jp

• Bird, Steven et al. Natural Language Processing with Python. Oreilly & Associates Inc., 2009• 萩原 正人,中山 敬広,水野 貴明 訳 『入門 自然言語処

理』 O'Reilly Japan, 2010

• Grading• Exercises ( given in lectures ) : 40%• Final report ( programming project )

2011-10-04 Information Communication Theory ( 情報伝達学 ) 3

Handouts

• If necessary, please print out a handout and bring it to the class by yourself• Alternatively, browse it on your laptop

• Handouts will be available at (before dawn):• http://

www.cl.ecei.tohoku.ac.jp/index.php?InformationCommunicationTheory

• Username: nlp2012• Password: chukougishitsu

2011-10-04 Information Communication Theory ( 情報伝達学 ) 4

Contact Information

• Office hours: • Tue, 1:00-2:30pm or by appointment

• Office: • Room 305 ( 108 after Nov ) , Electrical Engineering

and Applied Physics Research Building No.3 (電気系 3号館)

• Contact:• inui@ecei.tohoku.ac.jp @inuikentaro• okazaki@ecei.ecei.tohoku.ac.jp @chokkanorg

2011-10-04 Information Communication Theory ( 情報伝達学 ) 5

IntroductionNaoaki Okazaki

okazaki@ecei.tohoku.ac.jp

http://www.chokkan.org/

http://twitter.com/#!/chokkanorg

#nlptohoku

http://www.chokkan.org/lectures/2012nlp/p/01.pdf

Information Communication Theory ( 情報伝達学 ) 62011-10-04

Natural Language Processing (NLP)• Giving computers the ability to process human language• As old as the idea of computers themselves!• Implementations and implications of the exciting idea• The long-awaited dream (that has not come true yet)

Information Communication Theory ( 情報伝達学 ) 7

Doraemon C-3PO(Star Wars)

Atom(Astro boy)

2011-10-04

2011-10-04 Information Communication Theory ( 情報伝達学 ) 8

What are needs to be done for understanding languagesas humans do?

Part I: Knowledge (disciplines)

Lexical semantics ( 語彙意味論 )

2011-10-04 Information Communication Theory ( 情報伝達学 ) 9

How much Chinese silk was exported to Western Europe by the end of the 18th century?

N

S

EW

Meaning of words

Compositional semantics ( 合成意味論 )

2011-10-04 Information Communication Theory ( 情報伝達学 ) 10

How much Chinese silk was exported to Western Europe by the end of the 18th century?

Meaning of constituents

1700 1720 1740 1760 1780 1800

The 18th Century

the end

of

Compositional??? (with adjectives)

2011-10-04 Information Communication Theory ( 情報伝達学 ) 11

girl friendformer holeblack

towel winewhitewhite

!?

Morphology ( 形態論 )

2011-10-04 Information Communication Theory ( 情報伝達学 ) 12

How much Chinese silk was exported to Western Europe by the end of the 18th century?

Study on word formations(breaking words down into morphemes)

• Inflection ( 屈折 )• is – was – being – been• export – exports – exporting – exported – exported

• Derivation ( 派生 )• China – Chinese• West – Western

Syntax ( 統語論,文法 )

• Part-of-speech (POS): Lecture #3• Categorization of words, e.g., nouns, verbs, adjectives, adverbs

• Constituency: Lectures #4 and #5• Grouping words that may behave as a single unit or phrase• e.g., noun phrase, verb phrase, prepositional phrase

• Grammatical relations: Lecture #5• Relationship between words/constituents

2011-10-04 Information Communication Theory ( 情報伝達学 ) 13

Principles and rules for constructingphrases and sentences

Syntactic tagging and parsing

• Assign a structure to an input sentence

2011-10-04 Information Communication Theory ( 情報伝達学 ) 14

Economic news had little effect on financial markets .JJ NN VBD JJ NN IN JJ NNS

NP NP NP

PP

NP

VP

PU

S

nmod nmod nmodsbj

obj

nmod

pc

p

Constituent parsing

Dependency parsing

Nivre and Kubler (2006)

POS tagging

Semantic role ( 意味役割 )

2011-10-04 Information Communication Theory ( 情報伝達学 ) 15

How much Chinese silk was exported to Western Europe by the end of the 18th century?

1700 1720 1740 1760 1780 1800

The 18th Century

How much Chinese silk was exported to Western Europe by southern merchants?

TEMPORAL

AGENT

Coreference ( 共参照 )

U: Where is The Green Hornet playing in Mountain View?

S: The Green Hornet is playing at the Century 16 theatre.

U: When is it playing there?

S: It’s playing at 2pm, 5pm, and 8pm.

U: I’d like 1 adult and 2 children for the first show.

How much would that cost?

2011-10-04 Information Communication Theory ( 情報伝達学 ) 16

What does “it” refers to?What does “the first show” refers to?What does “that” refers to?

We can guess these easily!

Coreference ( 共参照 )

U: Where is The Green Hornet playing in Mountain View?

S: The Green Hornet is playing at the Century 16 theatre.

U: When is it playing there?

S: It’s playing at 2pm, 5pm, and 8pm.

U: I’d like 1 adult and 2 children for the first show.

How much would that cost?

2011-10-04 Information Communication Theory ( 情報伝達学 ) 17

How words like that or pronouns like it refer to previous parts of the discourse

Pragmatics ( 語用論 )

• Bob: Are you coming to the party?• Jane: I’m afraid I can’t.

• Bob: Are you coming to the party?• Jane: You know, I’m really busy.

• Bob: Could you pass me the sugar?• Jane: Yes. Here you are.2011-10-04 Information Communication Theory ( 情報伝達学 ) 18

Actions that speakers intendby their use of text

Discourse ( 談話 )

2011-10-04 Information Communication Theory ( 情報伝達学 ) 19

http://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf

Coherent structured groups of text

Various knowledge about languages• Morphology ( 形態論 ): meaningful components within words• Syntax ( 文法 ): structural relationships between words• Semantics ( 意味論 ): meanings of words, phrases,

sentences• Discourse ( 談話 ): relationships across/beyond different

sentences or statements; contextual processing• Pragmatic ( 語用論 ): relationship of meaning to the goals

and intentions of speakers; how we use languages to communicate

• World knowledge ( 世界知識 ): facts of the world; common sense

2011-10-04 Information Communication Theory ( 情報伝達学 ) 20

2011-10-04 Information Communication Theory ( 情報伝達学 ) 21

What are needs to be done for understanding languagesas humans do?

Part II: Ambiguity

Ambiguity

• We may build multiple, alternative linguistic structures and interpretations for a single input• I made her duck (see more examples later)

• Disambiguation (or resolution): to decide which linguistic/semantic structure/interpretation is the most appropriate (in the context)

2011-10-04 Information Communication Theory ( 情報伝達学 ) 22

Part-of-speech tagging and ambiguity

2011-10-04 Information Communication Theory ( 情報伝達学 ) 23

Time flies like an arrow .

NN VBZ IN DT NN .(光陰矢のごとし)

VB NNS IN DT NN .(ハエの速度を矢のように測定せよ)

NN NNS VBP DT NN .(時蠅は矢を好む)

Attachment ambiguity (1/3)

• I saw the girl on the hill with a telescope.

• I saw the girl on the hill with a telescope.

2011-10-04 Information Communication Theory ( 情報伝達学 ) 24

Attachment ambiguity (2/3)

• I saw the girl on the hill with a telescope.

• I saw the girl on the hill with a telescope.

2011-10-04 Information Communication Theory ( 情報伝達学 ) 25

Attachment ambiguity (3/3)

• I saw the girl on the hill with a telescope.

• I saw the girl on the hill with a telescope.

2011-10-04 Information Communication Theory ( 情報伝達学 ) 26

• Put [[the insects in the box] and [the bowl on the table]]

• Put the insects in [[the box] and [the bowl on the table]]

Coordination ambiguity

2011-10-04 Information Communication Theory ( 情報伝達学 ) 27

Semantic ambiguity• Syntax structure is insufficient to represent the meaning

• Distinction between syntax and semantics• Colorless green ideas sleep furiously (Chomsky, 1957)

• Opposite• John bought a book from Mary vs Mary sold a book to John

• Lexical ambiguity• I went to the bank… (of the river) or (to get some money)

• Quantifier• Every man loves a woman

2011-10-04 Information Communication Theory ( 情報伝達学 ) 28

2011-10-04 Information Communication Theory ( 情報伝達学 ) 29

The state-of-the-art ofNatural Language Processing

Commercial world

• A lot of exciting staff going on…

Information Communication Theory ( 情報伝達学 ) 302011-10-04

Machine translation (Google)

Information Communication Theory ( 情報伝達学 ) 312011-10-04

Machine translation (Google)

Information Communication Theory ( 情報伝達学 ) 322011-10-04

Watson (IBM)• Question answering system built on IBM’s DeepQA technology• 14-16 February 2011, Watson beat two human competitors, the

biggest all-time money winner on Jeopardy! and the record holder for the longest championship streak

• Hardware• 2880 processor cores (3.5 GHz POWER7 eight core processors)• 16 TB RAM in total

• Software• Written in Java and C++• Using Apache Hadoop framework for distributed computing

• Data• 200M pages (about 1M books) of structured and unstructured content• Consuming 4T of disk storage• Encyclopedias, dictionaries, thesauri, newswire articles, literary works

Information Communication Theory ( 情報伝達学 ) 33

http://en.wikipedia.org/wiki/Watson_(computer)

2011-10-04

Jeopardy!

• American quiz show featuring• history, literature, the arts, pop culture, science, sports,

geography, wordplay, etc.

• Six categories are announced, each with five trivia clues

• A correct response adds the dollar value• An incorrect response or a failure to respond within a five-second time limit deducts the dollar value

Information Communication Theory ( 情報伝達学 ) 34

http://en.wikipedia.org/wiki/Jeopardy!

2011-10-04

Final Jeopardy! and the Future of Watson

• Watch the video (08:58):• http://www.youtube.com/watch?v=Wq0XnBYC3nQ

Information Communication Theory ( 情報伝達学 ) 352011-10-04

Science behind an answer

• Watch the very nice video (06:42):• http://www.youtube.com/watch?v=DywO4zksfXw

Information Communication Theory ( 情報伝達学 ) 362011-10-04

Science behind an answer• Step 1: Question analysis

• What is type of question being asked?• What is the question asking for?

• Step 2: Hypothesis generation• Search millions of documents for possible answers

• Step 3: Hypothesis and evidence scoring• Collect positive and negative evidences to support each answer• Score evidences based on everything from source material reliability

to whether time and locations appear correct• Parallelized evidence scoring for each possible answer

• Step 4: Final merging and ranking• Learn the importance of each evidence by practicing games• Yield the final ranking of possible answers• Decide whether Watson answers the question or not based on the

confidence2011-10-04 Information Communication Theory ( 情報伝達学 ) 37

A shame (of NLP)

• Japanese translation of the book, “Einstein: His Life and Universe,” published on 23 June 2011

Information Communication Theory ( 情報伝達学 ) 38

• Chapter 13 was translated by computers, not by humans!• How this happened: http://

www.amazon.co.jp/review/R29GQAF5DUOAEW/ref=cm_cr_rdp_perm

• Very rare incident that an MT’ed book is published

• Revised version was published on 17 Aug 2011

2011-10-04

Imagine the original sentence• ボルンの妻のヘートヴィヒに最大限にしてください。

(そのヘートヴィヒは,彼の家族に関する彼の処理,今や説教された頃,彼が「自分がそのかなり不幸な回答に駆り立てられるのを許容していないべきでない」と自由に彼に叱った)。以上は,彼が目立つべきであり,彼女が言ったのを「科学の人里離れている寺」に尊敬します。

Information Communication Theory ( 情報伝達学 ) 39

• Max Born's wife, Hedwig, who had freely scolded Einstein about his treatment of his family, now lectured, “[You should] not have allowed yourself to be goaded into that rather unfortunate reply.” He should show more respect, she said, for “the secluded temple of science.” (P286)

2011-10-04

Passing exams for University of Tokyo

2011-10-04 Information Communication Theory ( 情報伝達学 ) 40

Writing short science fictions

2011-10-04 Information Communication Theory ( 情報伝達学 ) 41

Goal of this course

• Overview the issues and technologies for natural language understanding• What is possible/easy? What is impossible/difficult?• Why is this achieved or not achieved by the current

technology?

• Provide fundamental theories and techniques for natural language processing• Some techniques are useful for other research fields

• Exercise programming with real NLP tasks• You will be an experienced engineer!

2011-10-04 Information Communication Theory ( 情報伝達学 ) 42

Course plan

1. 4 Oct: Introduction

2. 11 Oct: Classification• Spam filtering, linear classifier, feature extraction,

perceptron, logistic regression, evaluation (precision, recall, F1)

3. 18 Oct: Part-of-speech tagging

4. 25 Oct: Syntactic parsing

5. 1 Nov: Statistical parsing

2011-10-04 Information Communication Theory ( 情報伝達学 ) 43