Web Science & Technologies
University of Koblenz ▪ Landau, Germany
Introduction to Kneser-NeySmoothing on Top of Generalized Language
Models for Next Word Prediction
Martin Körner
Oberseminar
25.07.2013
Martin Körner
Oberseminar 25.07.2013
2 of 30
WeST
Content
Introduction
Language Models
Generalized Language Models
Smoothing
Progress
Summary
Martin Körner
Oberseminar 25.07.2013
3 of 30
WeST
Content
Introduction
Language Models
Generalized Language Models
Smoothing
Progress
Summary
Martin Körner
Oberseminar 25.07.2013
4 of 30
WeST
Introduction: Motivation
Next word prediction: What is the next word a user will
type?
Use cases for next word prediction:
Augmentative and Alternative
Communication (AAC)
Small keyboards (Smartphones)
Martin Körner
Oberseminar 25.07.2013
5 of 30
WeST
Introduction to next word prediction
How do we predict words?
1. Rationalist approach
• Manually encoding information about language
• “Toy” problems only
2. Empiricist approach
• Statistical, pattern recognition, and machine learning
methods applied on corpora
• Result: Language models
Martin Körner
Oberseminar 25.07.2013
6 of 30
WeST
Content
Introduction
Language Models
Generalized Language Models
Smoothing
Progress
Summary
Martin Körner
Oberseminar 25.07.2013
7 of 30
WeST
Language models in general
Language model: How likely is a sentence 𝑠?
Probability distribution: 𝑃 𝑠
Calculate 𝑃 𝑠 by multiplying conditional probabilities
Example:
𝑃 If you′re going to San Francisco , be sure …=𝑃 you′re | If ∗ 𝑃 going | If you′re ∗𝑃 to | If you′re going ∗ 𝑃 San | If you′re going to ∗𝑃 Francisco | If you′re going to San ∗ ⋯
Empirical approach would fail
Martin Körner
Oberseminar 25.07.2013
8 of 30
WeST
Conditional probabilities simplified
Markov assumption [JM80]:
Only the last n-1 words are relevant for a prediction
Example with n=5:
𝑃 sure | If you′re going to San Francisco , be
≈ 𝑃 sure | San Francisco , be
Counts as a word
Martin Körner
Oberseminar 25.07.2013
9 of 30
WeST
Definitions and Markov assumption
n-gram: Sequence of length n with a count
E.g.: 5-gram:
If you′re going to San 4
Sequence naming:
𝑤1𝑖−1 ≔ 𝑤1 𝑤2 …𝑤𝑖−1
Markov assumption formalized:
𝑃 𝑤𝑖 𝑤1𝑖−1 ≈ 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1
𝑖−1
n-1 words
Martin Körner
Oberseminar 25.07.2013
10 of 30
WeST
Formalizing next word prediction
Instead of 𝑃(𝑠):
Only one conditional probability 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1𝑖−1
• Simplify 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1𝑖−1 to 𝑃 𝑤𝑛 𝑤1
𝑛−1
NWP 𝑤1𝑛−1 = argmax𝑤𝑛∈𝑊 𝑃 𝑤𝑛 𝑤1
𝑛−1
How to calculate the probability 𝑃 𝑤𝑛 𝑤1𝑛−1 ?
Set of all words in the corpus
n-1 words n-1 words
Conditional probability with Markov assumption
Martin Körner
Oberseminar 25.07.2013
11 of 30
WeST
How to calculate 𝑃(𝑤𝑛|𝑤1𝑛−1)
The easiest way:
Maximum likelihood:
𝑃ML 𝑤𝑛 𝑤1𝑛−1 =
𝑐(𝑤1𝑛)
𝑐(𝑤1𝑛−1)
Example:
𝑃 San | If you′re going to =𝑐 If you′re going to San
𝑐 If you′re going to
Martin Körner
Oberseminar 25.07.2013
12 of 30
WeST
Content
Introduction
Language Models
Generalized Language Models
Smoothing
Progress
Summary
Martin Körner
Oberseminar 25.07.2013
13 of 30
WeST
Intro Generalized Language Models (GLMs)
Main idea:
Insert wildcard words (∗) into sequences
Example:
Instead of 𝑃 San | If you′re going to :
• 𝑃 San | If ∗ ∗ ∗
• 𝑃 San | If ∗ ∗ to
• 𝑃 San | If ∗ going ∗
• 𝑃 San | If ∗ going to
• 𝑃 San | If you′re ∗ ∗
• …
Separate different types of GLMs based on:
1. Sequence length
2. Number of wildcard words
Aggregate results
Length: 5, Wildcard words: 2
Martin Körner
Oberseminar 25.07.2013
14 of 30
WeST
Why Generalized Language Models?
Data sparsity of n-grams
“If you′re going to San” is seen less often than for example
“If ∗ ∗ to San”
Question: Does that really improve the prediction?
Result of evaluation: Yes
… but we should use smoothing for language models
Martin Körner
Oberseminar 25.07.2013
15 of 30
WeST
Content
Introduction
Language Models
Generalized Language Models
Smoothing
Progress
Summary
Martin Körner
Oberseminar 25.07.2013
16 of 30
WeST
Smoothing
Problem: Unseen sequences
Try to estimate probabilities of unseen sequences
Probabilities of seen sequences need to be reduced
Two approaches:
1. Backoff smoothing
2. Interpolation smoothing
Martin Körner
Oberseminar 25.07.2013
17 of 30
WeST
Backoff smoothing
If sequence unseen: use shorter sequence
E.g.: if 𝑃 San | going to = 0 use 𝑃 San | to
𝑃𝑏𝑎𝑐𝑘 𝑤𝑛 𝑤𝑖𝑛−1 =
𝜏 𝑤𝑛 𝑤𝑖𝑛−1 𝑖𝑓 𝑐 𝑤𝑖
𝑛 > 0
𝛾 ∗ 𝑃𝑏𝑎𝑐𝑘 𝑤𝑛 𝑤𝑖+1𝑛−1 𝑖𝑓 𝑐 𝑤𝑖
𝑛 = 0
Weight Lower order
probability (recursive)
Higher order
probability
Martin Körner
Oberseminar 25.07.2013
18 of 30
WeST
Interpolated Smoothing
Always use shorter sequence for calculation
𝑃𝑖𝑛𝑡𝑒𝑟 𝑤𝑛 𝑤𝑖𝑛−1 = 𝜏 𝑤𝑛 𝑤𝑖
𝑛−1 + 𝛾 ∗ 𝑃𝑖𝑛𝑡𝑒𝑟 𝑤𝑛 𝑤𝑖+1𝑛−1
Seems to work better than backoff smoothing
Higher order
probability
Weight Lower order
probability (recursive)
Martin Körner
Oberseminar 25.07.2013
19 of 30
WeST
Kneser-Ney smoothing [KN95] intro
Interpolated smoothing
Idea: Improve lower order calculation
Example: Word visiting unseen in corpus
𝑃 Francisco | visiting = 0
Normal interpolation: 0 + γ ∗ 𝑃 Francisco
𝑃 San | visiting = 0
Normal interpolation: 0 + γ ∗ 𝑃 San
Result: Francisco is as likely as San at that position
Is that correct?
Difference between Francisco and San?
Answer: Number of different contexts
Martin Körner
Oberseminar 25.07.2013
20 of 30
WeST
Kneser-Ney smoothing idea
For lower order calculation:
Don’t use 𝑐 𝑤𝑛 Instead: Number of different bigrams the word completes:
𝑁1+ • 𝑤𝑛 ≔ 𝑤𝑛−1: 𝑐 𝑤𝑛−1𝑛 > 0
Or in general:
𝑁1+ • 𝑤𝑖+1𝑛 = 𝑤𝑖: 𝑐 𝑤𝑖
𝑛 > 0
In addition:
𝑁1+ • 𝑤𝑖+1𝑛−1• = 𝑤𝑛
𝑁1+ • 𝑤𝑖+1𝑛
𝑁1+ 𝑤𝑖𝑛−1 • = 𝑤𝑛: 𝑐 𝑤𝑖
𝑛 > 0
Count
Martin Körner
Oberseminar 25.07.2013
21 of 30
WeST
Kneser-Ney smoothing equation (highest)
Highest order calculation:
𝑃KN 𝑤𝑛 𝑤𝑖𝑛−1 =
max{𝑐 𝑤𝑖𝑛 − 𝐷, 0}
𝑐 𝑤𝑖𝑛−1
+
𝐷
𝑐 𝑤𝑖𝑛−1
𝑁1+ 𝑤𝑖𝑛−1 • 𝑃KN 𝑤𝑛 𝑤𝑖+1
𝑛−1
count
Total counts
Assure positive valueDiscount value
0 ≤ 𝐷 ≤ 1
Lower order probability
(recursion)
Lower order weight
Martin Körner
Oberseminar 25.07.2013
22 of 30
WeST
Kneser-Ney smoothing equation
Lower order calculation:
𝑃KN 𝑤𝑛 𝑤𝑖𝑛−1 =
max{𝑁1+ • 𝑤𝑖𝑛 − 𝐷, 0}
𝑁1+ • 𝑤𝑖𝑛−1 •
+
𝐷
𝑁1+ • 𝑤𝑖𝑛−1 •
𝑁1+ 𝑤𝑖𝑛−1 • 𝑃KN 𝑤𝑛 𝑤𝑖+1
𝑛−1
Lowest order calculation: 𝑃KN 𝑤𝑛 =𝑁1+ •𝑤𝑖
𝑛
𝑁1+ •𝑤𝑖𝑛−1•
Continuation count
Total continuation counts
Assure positive valueDiscount value
Lower order probability
(recursion)
Lower order weight
Martin Körner
Oberseminar 25.07.2013
23 of 30
WeST
Modified Kneser-Ney smoothing [CG98]
Different discount values for different absolute counts
Lower order calculation:
𝑃KN 𝑤𝑛 𝑤𝑖𝑛−1 =
max{𝑁1+ • 𝑤𝑖𝑛 − 𝐷(𝑐 𝑤𝑖
𝑛 ), 0}
𝑁1+ • 𝑤𝑖𝑛−1 •
+
𝐷1𝑁1 𝑤𝑖𝑛−1 • + 𝐷2𝑁2 𝑤𝑖
𝑛−1 • + 𝐷3+𝑁3+ 𝑤𝑖𝑛−1 •
𝑁1+ • 𝑤𝑖𝑛−1 •
𝑃KN 𝑤𝑛 𝑤𝑖+1𝑛−1
State of the art (since 15 years!)
Martin Körner
Oberseminar 25.07.2013
24 of 30
WeST
Smoothing of GLMs
We can use all smoothing techniques on GLMs as well!
Small modification:
E.g: 𝑃 San | If ∗ going ∗
Lower order sequence :
– Normally: 𝑃 San | ∗ going ∗
– Instead use 𝑃 San | going ∗
Martin Körner
Oberseminar 25.07.2013
25 of 30
WeST
Content
Introduction
Language Models
Generalized Language Models
Smoothing
Progress
Summary
Martin Körner
Oberseminar 25.07.2013
26 of 30
WeST
Progress
Done Yet:
Extract text from XML files
Building GLMs
Kneser-Ney and modified Kneser-Ney smoothing
Indexing with MySQL
ToDo’s
Finish evaluation program
Run evaluation
Analyze results
Martin Körner
Oberseminar 25.07.2013
27 of 30
WeST
Content
Introduction
Language Models
Generalized Language Models
Smoothing
Progress
Summary
Martin Körner
Oberseminar 25.07.2013
28 of 30
WeST
Summary
Data Sets Language Models Smoothing
• More Data
• Better Data• Katz
• Good-Turing
• Witten-Bell
• Kneser-Ney
• …
• n-grams
• Generalized
Language
Models
Martin Körner
Oberseminar 25.07.2013
29 of 30
WeST
Thank you for your attention!
Questions?
Martin Körner
Oberseminar 25.07.2013
30 of 30
WeST
Sources
Images: Wheelchair Joystick (Slide 4):
http://i01.i.aliimg.com/img/pb/741/422/527/527422741_355.jpg
Smartphone Keyboard (Slide 4):
https://activecaptain.com/articles/mobilePhones/iPhone/iPhone_Keyboard.jpg
References: [CG98]: Stanley Chen and Joshua Goodman. An empirical study of smoothing
techniques for language modeling. Technical report, Technical Report TR-10-
98, Harvard University, August, 1998.
[JM80]: F. Jelinek and R.L. Mercer. Interpolated estimation of markov source
parameters from sparse data. In Proceedings of the Workshop on Pattern
Recognition in Practice, pages 381–397, 1980.
[KN95]: Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram
language modeling. In Acoustics, Speech, and Signal Processing, 1995.
ICASSP-95., 1995 International Conference on, volume 1, pages 181–184.
IEEE, 1995.
Top Related