ＭＤＬによる統語論の一般化に関する試論

ＭＤＬによる統語論の一般化に関する試論

ダウマン・マイク

２００７−０６−０７

What should Syntactic Theory Explain?

• Which sentences are grammatical and which are not

or

• How to transform observed sentences into a grammar

IE

learning

Children transform observed sentences (E)

Into psychological knowledge of language (I)

Poverty of the Stimulus

• Evidence available to children is utterances produced by other speakers

• No direct cues to sentence structure

• Or to word categories

So children need prior knowledge of possible structures

UG

How should we study syntax?

Linguists’ Approach:• Choose some sentences• Decide on grammaticality of each oneMake a grammar that accounts for which of

these sentences are grammatical and which are not

sentences grammar

Informant Linguist

Computational Linguists’ Approach(Unsupervised Learning)

• Take a corpusExtract as much information from the corpus as accurately as

possibleorLearn a grammar that describes the corpus as accurately as

possible

corpus

grammarlexical items

language modeletc.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Which approach gives more insight into language?

Linguists tend to aim for high precision• But only produce very limited and arbitrary

coverage

Computational linguists tend to obtain much better coverage

• But don’t account for any body of data completely correctly

• And tend only to learn only simpler kinds of structure

Approaches seem to be largely complementary

Which approach gives more insight into the human mind?

The huge size and complexity of languages is one of their key distinctive properties

The linguists’ approach doesn’t account for this

So should we apply our algorithms to large corpora of naturally occurring data?

This won’t directly address the kind of issue that linguists focus on

Negative Evidence

• Some constructions seem impossible to learn without negative evidence

John gave a painting to the museum

John gave the museum a painting

John donated a painting to the museum

* John donated the museum a painting

Implicit Negative Evidence

If constructions don’t appear can we just assume they’re not grammatical?

No – we only see a tiny proportion of possible, grammatical sentences

People generalize from examples they have seen to form new utterances

‘[U]nder exactly what circumstances does a child conclude that a nonwitnessed sentence is ungrammatical?’ (Pinker, 1989)

Minimum Description Length (MDL)

MDL may be able to solve the poverty of the stimulus problem

Prefers the grammar that results in the simplest overall description of data

• So prefers simple grammars• And grammars that result in simple

descriptions of the dataSimplest means specifiable using the

least amount of information

Observed sentences

Space of possible sentences

Observed sentences

Grammar

Simple but non-constraining grammarSpace of possible sentences

Observed sentences

Grammars

Simple but non-constraining grammar

Complex but constraining grammar


Observed sentences

Grammars

Grammar that is a good fit to the data

Simple but non-constraining grammar

Complex but constraining grammar


MDL and Bayes’ Rule

)|()()|( hdPhPdhP ∝

• h is a hypothesis (= grammar)

• d is some data (= sentences)

The probability of a grammar given some data is equal to its a priori probability times how likely the observed sentences would be if that grammar were correct

MDL and Prior (Innate?) Bias

• MDL solves the difficult problem of deciding prior probability for each grammar

• But MDL is still subjective – the prior bias is just hidden in the formalism chosen to represent grammars, and in the encoding scheme

Why it has to be MDL

Many machine learning techniques have been applied in computational linguistics

MDL is very rarely used

Only modest success at learning grammatical structure from corpora

So why MDL?

Maximum Likelihood

Maximum likelihood can be seen as a special case of MDL in which the a priori probability of all hypotheses P(h) is equal

But the hypothesis that only the observed sentences are grammatical will result in the maximum likelihood

So ML can only be applied if there are restrictions on how well the estimated parameters can fit the data

The degree of generality of the grammars is set externally, not determined by the Maximum Likelihood principle

Maximum Entropy

Make the grammar as unrestrictive as possible

But constraints must be used to prevent a grammar just allowing any combination of words to be a grammatical sentence

Again the degree of generality of grammars is determined externally

Neither Maximum Likelihood nor Maximum Entropy provide a principle that can decide when to make generalizations

Learning Phrase Structure Grammars

• Binary or non-branching rules:S B CB EC tomato

• All derivations start from special symbol S

Encoding Grammars

Grammars can be coded as lists of three symbols

• null symbol in 3rd position indicates a non-branching rule

First symbol is rules left hand side, second and third its right hand side

S, B, C, B, E, null, C, tomato, null

Statistical Encoding of Grammars

• First we encode the frequency of each symbol

• Then encode each symbol using the frequency information

S, B, C, B, E, null, C, tomato, null

I(S) = -log 1/9 I(null) = -log 2/9

Uncommon symbols have a higher coding length than common ones

1 S NP VP

2 NP john

3 NP mary

4 VP screamed

5 VP died

Data encoding: 1, 2, 4, 1, 2, 5, 1, 3, 4

There is a restricted range of choices at each stage of the derivation

Fewer choices = higher probability

Encoding Data

Data:John screamedJohn diedMary Screamed

If we record the frequency of each rule, this information can help us make a more efficient encoding

• 1 S NP VP (3)• 2 NP john (2)• 3 NP mary (1)• 4 VP screamed (2)• 5 VP died (1)

Data: 1, 2, 4, 1, 2, 5, 1, 3, 4

Probabilities: 1 3/3, 2 2/3, 4 2/3, 1 3/3, 2 2/3…

Statistical Encoding of Data

Total frequency for NP = 3

Total frequency for VP = 3

Total frequency for S = 3

Encoding in My Model

1010100111010100101101010001100111100011010110

Symbol Frequencies

Rule Frequencies

Decoder

1 S NP VP2 NP john 3 NP mary4 VP screamed5 VP died

John screamedJohn diedMary Screamed

Grammar Data

S (1)NP (3)VP (3)john (1)mary (1)screamed (1)died (1)null (4)

Rule 1 3Rule 2 2Rule 3 1Rule 4 2Rule 5 1

Number of bits decoded = evaluation

Creating Candidate Grammars

• Start with simple grammar that allows all sentences

• Make simple change and see if it improves the evaluation (add a rule, delete a rule, change a symbol in a rule, etc.)

• Annealing search

• First stage: just look at data coding length

• Second stage: look at overall evaluation

John hit MaryMary hit EthelEthel ranJohn ranMary ranEthel hit JohnNoam hit JohnEthel screamedMary kicked EthelJohn hopes Ethel thinks Mary hit EthelEthel thinks John ranJohn thinks Ethel ranMary ranEthel hit MaryMary thinks John hit EthelJohn screamedNoam hopes John screamedMary hopes Ethel hit JohnNoam kicked Mary

Example: EnglishLearned Grammar

S NP VPVP ranVP screamedVP Vt NPVP Vs SVt hitVt kickedVs thinksVs hopesNP JohnNP EthelNP MaryNP Noam

Evaluations

050

100150200250300350400450

Evaluation (bits)

InitialGrammar

LearnedGrammar

OverallEvaluation

Grammar

Data

Real Language Data

Can the MDL metric also learn grammars from corpora of unrestricted natural language?

If it could, we’d largely have finished syntax

But search space is way too bigNeed to simplify the task in some wayOnly learn verb subcategorization classes

Switchboard Corpus( (S

(C C and)

(PRN

(, ,)

(S

(NP -SBJ (PRP you) )

( VP (VBP know) ))

(, ,) )

( NP-SBJ-1 (PRP she) )

(VP (VBD spent)

(NP

( NP (CD nine) (NNS months) )

( PP (IN out)

(PP (IN of)

(NP (DT the) (NN year) ))))

(S-ADV

(NP -SBJ (-NONE- *-1) )

(A DVP (RB just) )

( VP (VBG visiting)

(NP (PRP$ her) (NNS children) ))))

(. .) (-DFL- E_S) ))

Extracted Information:

Verb: spent

Subcategorization frame: * NP S

Extracted Data

Only verbs tagged as VBD (past tense) extracted

Modifiers to basic labels ignored

21,759 training instances

704 different verbs

706 distinct subcategorization frames

25 different types of constituent appeared alongside the verbs (e.g. S, SBAR, NP, ADVP)

Verb Class GrammarsS Class1 Subcat1

S Class1 Subcat2

S Class2 Subcat1

Class1 grew

Class1 ended

Class2 do

grew and ended appear can appear with subcats 1 and 2

do only with subcat 2

Grouping together verbs with similar subcategorizations should improve the evaluation

A New Search Mechanism

We need a search mechanism that will only produce candidate grammars of the right form

• Start with all verbs in one class• Move a randomly chosen verb to a new class

(P=0.5) or a different class (P=0.5)• Empty verb classes are deleted• Redundant rules are removed

A New Search Mechanism (2)

Annealing search:• After no changes are accepted for 2,000

iterations switch to merging phase• Merge two randomly selected classes• After no changes accepted for 2,000 iterations

switch back to moving phase• Stop after no changes accepted for 20,000

iterations• Multiple runs were conducted and the grammar

with the overall lowest evaluation selected

Grammar Evaluations

207,312.4187,026.7220,520.4Data

37,885.5111,036.529,915.1Grammar

245,198.0298,063.3250,435.5Overall Evaluation

Best learned grammar

Each verb in a separate class

One verb class

Learned ClassesClass Verbs in Class Description

1 thought, vowed, prayed, decided, adjusted, wondered, wished, allowed, knew, suggested, claimed, believed, remarked, resented, detailed, misunderstood, assumed, competed, snowballed, smoked, said, struggled, determined, noted, understood, foresaw, expected, discovered, realized, negotiated, suspected, indicated

Usually take S or

S BAR complement

(S BAR usually contains that or who etc. followed by an S)

2 enjoyed, canceled, liked, had, finished, traded, sold, ruined, needed, watched, loved, included, received, converted, rented, bred, deterred, increased, encouraged, made, swapped, shot, offered, spent, impressed, discussed, missed, carried, injured, presented, surprised…

Usually take an NP argument (often in conjunction with other arguments)

3 did did only

4 All other verbs miscellaneous

5 used, named, tried, considered, tended, refused, wanted, managed, let, forced, began, appeared

Typically take an S argument (but never just an SBAR)

6 wound, grew, ended, closed, backed Usually take a particle

Did MDL make appropriate generalizations?

The learned verb classes are clearly linguistically coherent

But they don’t account for exactly which verbs can appear with which subcats

Linguists have proposed far more fine-grained classes

Data available for learning was limited (subcats had no internal structure, Penn Treebank labels may not be sufficient)

But linguists can’t explain which verbs appear with which subcats either

Is there a correct learning mechanism?

The learned grammar will only reflect the I-language I3 if the learning mechanism makes the same kind of generalizations from the E-language E2 as do people

I1 I3I2E1 E2

produce learn

grammar

Computer model

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Why do we think MDL is likely to be a good learning mechanism?

MDL is a general purpose learning principle

But language is the product of the learning mechanisms of other people

So is there any reason to suppose that a general purpose learning mechanism would be good at learning language?

Should a general purpose learning principle like MDL be able to learn

language?When language first arose did people recruit a pre-existing

learning mechanism to the task of learning language? If so the LAD should learn in a fairly generic and

sensible way

But have we evolved mechanisms specific to language learning?

If a mechanism is specific to language learning, it should work just as well if the generalizations are idiosyncratic as if they are sensible

Conclusions

• MDL (and only MDL) can determine when to make linguistic generalizations and when not to

• Lack of negative evidence is not a particular problem when using MDL

• The same MDL metric can be used both on small sets of example sentences and on unrestricted corpora

ＭＤＬによる統語論の一般化に関する試論

Documents

Transcript of ＭＤＬによる統語論の一般化に関する試論