MDLによる統語論の一般化に関する試論
-
Upload
breanna-velazquez -
Category
Documents
-
view
25 -
download
1
description
Transcript of MDLによる統語論の一般化に関する試論
MDLによる統語論の一般化に関する試論
ダウマン・マイク
2007−06−07
What should Syntactic Theory Explain?
• Which sentences are grammatical and which are not
or
• How to transform observed sentences into a grammar
IE
learning
Children transform observed sentences (E)
Into psychological knowledge of language (I)
Poverty of the Stimulus
• Evidence available to children is utterances produced by other speakers
• No direct cues to sentence structure
• Or to word categories
So children need prior knowledge of possible structures
UG
How should we study syntax?
Linguists’ Approach:• Choose some sentences• Decide on grammaticality of each oneMake a grammar that accounts for which of
these sentences are grammatical and which are not
sentences grammar
Informant Linguist
Computational Linguists’ Approach(Unsupervised Learning)
• Take a corpusExtract as much information from the corpus as accurately as
possibleorLearn a grammar that describes the corpus as accurately as
possible
corpus
grammarlexical items
language modeletc.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Which approach gives more insight into language?
Linguists tend to aim for high precision• But only produce very limited and arbitrary
coverage
Computational linguists tend to obtain much better coverage
• But don’t account for any body of data completely correctly
• And tend only to learn only simpler kinds of structure
Approaches seem to be largely complementary
Which approach gives more insight into the human mind?
The huge size and complexity of languages is one of their key distinctive properties
The linguists’ approach doesn’t account for this
So should we apply our algorithms to large corpora of naturally occurring data?
This won’t directly address the kind of issue that linguists focus on
Negative Evidence
• Some constructions seem impossible to learn without negative evidence
John gave a painting to the museum
John gave the museum a painting
John donated a painting to the museum
* John donated the museum a painting
Implicit Negative Evidence
If constructions don’t appear can we just assume they’re not grammatical?
No – we only see a tiny proportion of possible, grammatical sentences
People generalize from examples they have seen to form new utterances
‘[U]nder exactly what circumstances does a child conclude that a nonwitnessed sentence is ungrammatical?’ (Pinker, 1989)
Minimum Description Length (MDL)
MDL may be able to solve the poverty of the stimulus problem
Prefers the grammar that results in the simplest overall description of data
• So prefers simple grammars• And grammars that result in simple
descriptions of the dataSimplest means specifiable using the
least amount of information
Observed sentences
Space of possible sentences
Observed sentences
Grammar
Simple but non-constraining grammarSpace of possible sentences
Observed sentences
Grammars
Simple but non-constraining grammar
Complex but constraining grammar
Space of possible sentences
Observed sentences
Grammars
Grammar that is a good fit to the data
Simple but non-constraining grammar
Complex but constraining grammar
Space of possible sentences
MDL and Bayes’ Rule
)|()()|( hdPhPdhP ∝
• h is a hypothesis (= grammar)
• d is some data (= sentences)
The probability of a grammar given some data is equal to its a priori probability times how likely the observed sentences would be if that grammar were correct
MDL and Prior (Innate?) Bias
• MDL solves the difficult problem of deciding prior probability for each grammar
• But MDL is still subjective – the prior bias is just hidden in the formalism chosen to represent grammars, and in the encoding scheme
Why it has to be MDL
Many machine learning techniques have been applied in computational linguistics
MDL is very rarely used
Only modest success at learning grammatical structure from corpora
So why MDL?
Maximum Likelihood
Maximum likelihood can be seen as a special case of MDL in which the a priori probability of all hypotheses P(h) is equal
But the hypothesis that only the observed sentences are grammatical will result in the maximum likelihood
So ML can only be applied if there are restrictions on how well the estimated parameters can fit the data
The degree of generality of the grammars is set externally, not determined by the Maximum Likelihood principle
Maximum Entropy
Make the grammar as unrestrictive as possible
But constraints must be used to prevent a grammar just allowing any combination of words to be a grammatical sentence
Again the degree of generality of grammars is determined externally
Neither Maximum Likelihood nor Maximum Entropy provide a principle that can decide when to make generalizations
Learning Phrase Structure Grammars
• Binary or non-branching rules:S B CB EC tomato
• All derivations start from special symbol S
Encoding Grammars
Grammars can be coded as lists of three symbols
• null symbol in 3rd position indicates a non-branching rule
First symbol is rules left hand side, second and third its right hand side
S, B, C, B, E, null, C, tomato, null
Statistical Encoding of Grammars
• First we encode the frequency of each symbol
• Then encode each symbol using the frequency information
S, B, C, B, E, null, C, tomato, null
I(S) = -log 1/9 I(null) = -log 2/9
Uncommon symbols have a higher coding length than common ones
1 S NP VP
2 NP john
3 NP mary
4 VP screamed
5 VP died
Data encoding: 1, 2, 4, 1, 2, 5, 1, 3, 4
There is a restricted range of choices at each stage of the derivation
Fewer choices = higher probability
Encoding Data
Data:John screamedJohn diedMary Screamed
If we record the frequency of each rule, this information can help us make a more efficient encoding
• 1 S NP VP (3)• 2 NP john (2)• 3 NP mary (1)• 4 VP screamed (2)• 5 VP died (1)
Data: 1, 2, 4, 1, 2, 5, 1, 3, 4
Probabilities: 1 3/3, 2 2/3, 4 2/3, 1 3/3, 2 2/3…
Statistical Encoding of Data
Total frequency for NP = 3
Total frequency for VP = 3
Total frequency for S = 3
Encoding in My Model
1010100111010100101101010001100111100011010110
Symbol Frequencies
Rule Frequencies
Decoder
1 S NP VP2 NP john 3 NP mary4 VP screamed5 VP died
John screamedJohn diedMary Screamed
Grammar Data
S (1)NP (3)VP (3)john (1)mary (1)screamed (1)died (1)null (4)
Rule 1 3Rule 2 2Rule 3 1Rule 4 2Rule 5 1
Number of bits decoded = evaluation
Creating Candidate Grammars
• Start with simple grammar that allows all sentences
• Make simple change and see if it improves the evaluation (add a rule, delete a rule, change a symbol in a rule, etc.)
• Annealing search
• First stage: just look at data coding length
• Second stage: look at overall evaluation
John hit MaryMary hit EthelEthel ranJohn ranMary ranEthel hit JohnNoam hit JohnEthel screamedMary kicked EthelJohn hopes Ethel thinks Mary hit EthelEthel thinks John ranJohn thinks Ethel ranMary ranEthel hit MaryMary thinks John hit EthelJohn screamedNoam hopes John screamedMary hopes Ethel hit JohnNoam kicked Mary
Example: EnglishLearned Grammar
S NP VPVP ranVP screamedVP Vt NPVP Vs SVt hitVt kickedVs thinksVs hopesNP JohnNP EthelNP MaryNP Noam
Evaluations
050
100150200250300350400450
Evaluation (bits)
InitialGrammar
LearnedGrammar
OverallEvaluation
Grammar
Data
Real Language Data
Can the MDL metric also learn grammars from corpora of unrestricted natural language?
If it could, we’d largely have finished syntax
But search space is way too bigNeed to simplify the task in some wayOnly learn verb subcategorization classes
Switchboard Corpus( (S
(C C and)
(PRN
(, ,)
(S
(NP -SBJ (PRP you) )
( VP (VBP know) ))
(, ,) )
( NP-SBJ-1 (PRP she) )
(VP (VBD spent)
(NP
( NP (CD nine) (NNS months) )
( PP (IN out)
(PP (IN of)
(NP (DT the) (NN year) ))))
(S-ADV
(NP -SBJ (-NONE- *-1) )
(A DVP (RB just) )
( VP (VBG visiting)
(NP (PRP$ her) (NNS children) ))))
(. .) (-DFL- E_S) ))
Extracted Information:
Verb: spent
Subcategorization frame: * NP S
Extracted Data
Only verbs tagged as VBD (past tense) extracted
Modifiers to basic labels ignored
21,759 training instances
704 different verbs
706 distinct subcategorization frames
25 different types of constituent appeared alongside the verbs (e.g. S, SBAR, NP, ADVP)
Verb Class GrammarsS Class1 Subcat1
S Class1 Subcat2
S Class2 Subcat1
Class1 grew
Class1 ended
Class2 do
grew and ended appear can appear with subcats 1 and 2
do only with subcat 2
Grouping together verbs with similar subcategorizations should improve the evaluation
A New Search Mechanism
We need a search mechanism that will only produce candidate grammars of the right form
• Start with all verbs in one class• Move a randomly chosen verb to a new class
(P=0.5) or a different class (P=0.5)• Empty verb classes are deleted• Redundant rules are removed
A New Search Mechanism (2)
Annealing search:• After no changes are accepted for 2,000
iterations switch to merging phase• Merge two randomly selected classes• After no changes accepted for 2,000 iterations
switch back to moving phase• Stop after no changes accepted for 20,000
iterations• Multiple runs were conducted and the grammar
with the overall lowest evaluation selected
Grammar Evaluations
207,312.4187,026.7220,520.4Data
37,885.5111,036.529,915.1Grammar
245,198.0298,063.3250,435.5Overall Evaluation
Best learned grammar
Each verb in a separate class
One verb class
Learned ClassesClass Verbs in Class Description
1 thought, vowed, prayed, decided, adjusted, wondered, wished, allowed, knew, suggested, claimed, believed, remarked, resented, detailed, misunderstood, assumed, competed, snowballed, smoked, said, struggled, determined, noted, understood, foresaw, expected, discovered, realized, negotiated, suspected, indicated
Usually take S or
S BAR complement
(S BAR usually contains that or who etc. followed by an S)
2 enjoyed, canceled, liked, had, finished, traded, sold, ruined, needed, watched, loved, included, received, converted, rented, bred, deterred, increased, encouraged, made, swapped, shot, offered, spent, impressed, discussed, missed, carried, injured, presented, surprised…
Usually take an NP argument (often in conjunction with other arguments)
3 did did only
4 All other verbs miscellaneous
5 used, named, tried, considered, tended, refused, wanted, managed, let, forced, began, appeared
Typically take an S argument (but never just an SBAR)
6 wound, grew, ended, closed, backed Usually take a particle
Did MDL make appropriate generalizations?
The learned verb classes are clearly linguistically coherent
But they don’t account for exactly which verbs can appear with which subcats
Linguists have proposed far more fine-grained classes
Data available for learning was limited (subcats had no internal structure, Penn Treebank labels may not be sufficient)
But linguists can’t explain which verbs appear with which subcats either
Is there a correct learning mechanism?
The learned grammar will only reflect the I-language I3 if the learning mechanism makes the same kind of generalizations from the E-language E2 as do people
I1 I3I2E1 E2
produce learn
grammar
Computer model
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Why do we think MDL is likely to be a good learning mechanism?
MDL is a general purpose learning principle
But language is the product of the learning mechanisms of other people
So is there any reason to suppose that a general purpose learning mechanism would be good at learning language?
Should a general purpose learning principle like MDL be able to learn
language?When language first arose did people recruit a pre-existing
learning mechanism to the task of learning language? If so the LAD should learn in a fairly generic and
sensible way
But have we evolved mechanisms specific to language learning?
If a mechanism is specific to language learning, it should work just as well if the generalizations are idiosyncratic as if they are sensible
Should a general purpose learning principle like MDL be able to learn
language?When language first arose did people recruit a pre-existing
learning mechanism to the task of learning language? If so the LAD should learn in a fairly generic and
sensible way
But have we evolved mechanisms specific to language learning?
If a mechanism is specific to language learning, it should work just as well if the generalizations are idiosyncratic as if they are sensible
Conclusions
• MDL (and only MDL) can determine when to make linguistic generalizations and when not to
• Lack of negative evidence is not a particular problem when using MDL
• The same MDL metric can be used both on small sets of example sentences and on unrestricted corpora