Edinburgh mt lecture6_decoding
Transcript of Edinburgh mt lecture6_decoding
Decoding
We want to solve this problem:
Q: how many English sentences are there?
e⇤ = arg max
ep(e|f)
北 风 呼啸 。
the strong north wind .
Suppose we have 5 translations per word.Suppose every word has fertility 1.
(15,000 for this example).Then we have translations!5nn!
北 风 呼啸 。
the strong north wind .
Given a sentence pair and an alignment, we can easily calculate
p(English, alignment|Chinese)
Can we decode without enumerating all translations?
北 风 呼啸 。
the strong north wind .
There are target sentences.
Key Idea
But there are only ways to start them.O(5n)
5nn!
coverage vector
Key Idea
north
northern
strong
p(north|START ) · p( |north)北
p(northern|START ) · p( |northern)北
p(strong|START ) · p( |strong)呼啸
北 风 呼啸 。
coverage vector
Key Idea
north
northern
strong
wind
p(wind|north) · p( |wind)风
p(strong|north) · p( |strong)呼啸
strong
Key Idea
Dynamic Programming
each edge labelled with a weight and a
word (or words)
amount of work:O(5n2n)
bad, but much better than
5nn!
Key Idea
Dynamic Programming
each edge labelled with a weight and a
word (or words)
north, 0.014
amount of work:O(5n2n)
bad, but much better than
5nn!
Key Idea
Dynamic Programming
each edge labelled with a weight and a
word (or words)
north, 0.014
weighted finite-state automata
amount of work:O(5n2n)
bad, but much better than
5nn!
Weighted languages
•The lattice describing the set of all possible translations is a weighted finite state automaton.
•So is the language model.
•Since regular languages are closed under intersection, we can intersect the devices and run shortest path graph algorithms.
•Taking their intersection is equivalent to computing the probability under Bayes’ rule.
Wait a second!
We want to solve this problem:e⇤ = arg max
ep(e|f)
But now we’re solving this problem:e⇤ = arg max
emax
ap(e,a|f)
= arg max
e
X
a
p(e,a|f)
Often called the Viterbi approximation
How expensive is that?
We can sum over alignments by weighted determinization
O(5n2n)nondeterministic
Wait a second!
How expensive is that?
We can sum over alignments by weighted determinization
O(5n2n)nondeterministic
O(25n2n
)deterministic
Wait a second!
I made the simplest machine translation model I could think of
and it blew up in my face
is still far too much work.O(5n2n)Ok, let’s stick with the Viterbi approximation. But…
I made the simplest machine translation model I could think of
and it blew up in my face
is still far too much work.
Can we do better?
O(5n2n)Ok, let’s stick with the Viterbi approximation. But…
北 风 呼啸 。
north wind the strong .
Can we do better?
Each arc weighted by translation probability +
bigram probability
北 风 呼啸 。
north wind the strong .
Can we do better?
Each arc weighted by translation probability +
bigram probability
Objective: find shortest path that visits each word once.
北 风 呼啸 。
London Paris NY Tokyo .
Can we do better?
Each arc weighted by translation probability +
bigram probability
Objective: find shortest path that visits each word once.
北 风 呼啸 。
London Paris NY Tokyo .
Can we do better?
Each arc weighted by translation probability +
bigram probability
Objective: find shortest path that visits each word once.
Probably not: this is the traveling salesman problem.
北 风 呼啸 。
London Paris NY Tokyo .
Can we do better?
Each arc weighted by translation probability +
bigram probability
Objective: find shortest path that visits each word once.
Probably not: this is the traveling salesman problem.Even the Viterbi approximation is too hard to solve!
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。
the sky
O(2n)number of vertices:
d = 4window
Approximation: Distortion Limits
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。
the sky
O(2n)number of vertices:
d = 4window
outside windowto left: covered
outside windowto right: uncovered
Approximation: Distortion Limits
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。
the sky
number of vertices:
d = 4window
outside windowto left: covered
outside windowto right: uncovered
O(n2d)
Approximation: Distortion Limits
Summary
•We need every possible trick to make decoding fast.
•Viterbi approximation: from worse to bad.
Summary
•We need every possible trick to make decoding fast.
•Viterbi approximation: from worse to bad.
•Dynamic programming: exact but too slow.
Summary
•We need every possible trick to make decoding fast.
•Viterbi approximation: from worse to bad.
•Dynamic programming: exact but too slow.
•NP-Completeness means exact solutions unlikely.
Summary
•We need every possible trick to make decoding fast.
•Viterbi approximation: from worse to bad.
•Dynamic programming: exact but too slow.
•NP-Completeness means exact solutions unlikely.
•Heuristic approximations: stack decoding, distortion limits.