Discriminative Training Approaches for Continuous Speech Recognition Jen-Wei Kuo, Shih-Hung Liu,...
-
Upload
moris-rogers -
Category
Documents
-
view
234 -
download
1
Transcript of Discriminative Training Approaches for Continuous Speech Recognition Jen-Wei Kuo, Shih-Hung Liu,...
Discriminative Training Approaches for Continuous Speech Recognition
Jen-Wei Kuo, Shih-Hung Liu, Berlin Chen, Hsin-min Wang
Speaker : 郭人瑋
Main reference: D. Povey. “Discriminative Training for Large Vocabulary Speech Recognition,” Ph.D Dissertation, Peterhouse, University of Cambridge, July 2004
2
Statistical Speech Recognition
– In this presentation, language model is assumed to be given in advance while acoustic model is needed to be estimated
• HMMs (hidden Markov models) are widely adopted for acoustic modeling
)(Wp)|( WOp
RecognizedSentence
Speech AcousticMatch
FeatureExtraction
LinguisticDecoding
3
Expected Risk
• Let be a finite set of various possible word sequences for a given observation utterance– Assume that the true word sequence is also in
• Let be the action of classifying a given observation sequence to a word sequence – Let be the loss incurred when we take such an
action (and the true word sequence is just )
• Therefore, the (expected) risk for a specific action
,,, 321 WWWr W
jr WO
rO
rW
),( jn WWl
r
nWrnjnrjr OWPWWlOWOR
W)|(),()|(
rjW W
rW
jr WO
rO
nW rn WW
rW
[Duda et al. 2000]
4
Decoding: Minimum Expected Risk (1/2)
• In speech recognition, we can take the action with the minimum (expected) risk
• If zero-one loss function is adopted (string-level error)
– Then
r
njj Wrnjn
Wrjr
WOWPWWlOWOR
W)|(),(minarg)|(minarg
if 1
if 0),(
jn
jnjn WW
WWWWl
)|(1minarg
)|( minarg)|(minarg,
rjW
WWWrn
Wrjr
W
OWP
OWPOWOR
j
jnr
njj
W
5
Decoding: Minimum Expected Risk (2/2)
– Thus,
• Select the word sequence with maximum posterior probability (MAP decoding)
• The string editing or Levenshtein distance also can be accounted for the loss function– Take individual word errors into consideration– E.g., Minimum Bayes Risk (MBR) search/decoding [V. Goel et al.
2004], Word Error Minimization [Mangu et al. 2000]
)|(maxarg)|(minarg rjW
rjrW
OWPOWORjj
6
Training: Minimum Overall Expected Risk (1/2)
• In training, we should minimize the overall (expected) loss of the actions of the training utterances– is the true word sequence of
– The integral extends over the whole observation sequence space
• However, when a limited number of training observation sequences are available, the overall risk can be approximated by
rr WO
rO
rrrrroverall dOOpOWORR )()|)((
rO
rW
R
R
r Wrrnrn
r
R
rrrroverall
rn
OpOWPWWl
OpOWORR
1
1
)()|(),(
)()|)((
W
7
Training: Minimum Overall Expected Risk (2/2)
• Assume to be uniform – The overall risk can be further expressed as
• If zero-one loss function is adopted
– Then
)( rOp
R
r Wrnrnoverall
rn
OWPWWlR1
)|(),(W
rn
rnrn WW
WWWWl
if 1
if 0),(
r rrrWWW
rnOverall OWPOWPRrn
rn
1,W
8
Training: Minimum Error Rate
• Minimum Error Rate (MER) estimation
– MER is equivalent to MAP
r rrrWWW
rnOverallMER OWPOWPRrn
rn
1minmin,W
r rrOverallMER OWPR
maxmin
9
Training: Maximum Likelihood (1/2)
• The objective function of Maximum Likelihood (ML) estimation can be obtained if Jensen Inequality is further applied
– Maximize the overall log-posterior of all training utterances
r rr
rr
rrrr rrML
WOP
OP
WPWOPOWP
logmaxarg
logmaxarglogmaxarg
xx
xx
log1
log1
r rr
r rrOverall
OWP
OWPR
log
1
r rrML WOPF log
minimize theupper bound
uniform
Independent of
[Schlüter 2000]
10
Training: Maximum Likelihood (2/2)
• On the other hand, the discriminative training approaches attempt to optimize the correctness of the model set by formulating an objective function that in some way penalizes the model parameters that are liable to confuse correct and incorrect answers
• MLE can be considered as a derivation from overall log-posterior
11
Training: Maximum Mutual Information (1/3)
• The objective function can be defined as the sum of the pointwise mutual information of all training utterances and their associated true word sequences
• The maximum mutual information (MMI) estimation tries to find a new parameter set ( ) that maximizes the above objective function
rk
Wkr
rr
rr
rrr
rr
rr
r rrMMI
WPWOP
WOP
OP
WOP
WPOP
WOP
WOF
rk W
log
log,
log
,MI
[Bahl et al. 1986]
12
Training: Maximum Mutual Information (2/3)
• An alternative derivation based on the overall expected risk criterion
– Which is equivalent to the maximization of overall log-posterior of all training utterances
r
Wkkr
rr
rr
rrrr rrMMI
rk
WPWOP
WOP
OP
WPWOPOWP
W
logmaxarg
logmaxarglogmaxarg
rk
Wkr
rrMMI WPWOP
WOPF
rk W
log
Independent of
include rW
13
Training: Maximum Mutual Information (3/3)
• When we maximize the MMIE objection function– Not only the probability of true word sequence (numerator,
like the MLE objective function) can be increased, but also can the probabilities of other possible word sequences (denominator) be decreased
– Thus, MMIE attempts to make the correct hypothesis more probable, while at the same time it also attempts to make incorrect hypotheses less probable
• MMIE also can be considered as a derivation from overall log-posterior
14
Training: Minimum Classification Error (1/2)
• The misclassification measure is defined as
• Minimization of the overall misclassification measure is similar to MMIE when language model is assumed uniformly distributed
rkr
k WWWkr
rrrr WOP
CWOPOd
,
1log|-log)(
W
rk
rk WWW
rC,
1W
r
Wkr
rrrMMI
rk
WOPC
WOPFW
1
1loglog-)(
[Chou 2000]
15
Training: Minimum Classification Error (2/2)
• Embed a sigmoid (loss) function to smooth the misclassification measure
• Let and , then we have
• Minimization of the overall loss directly minimizes (classification) error rate, so MCE can be regarded as a derivation from MER
dr eOdl
1
1))((
r
kWkr
rr
rr WOP
WOP
COdl
W
|1
1))((
1 rClog
16
Training: Minimum Phone Error
• The objective function of Minimum Phone Error (MPE) is directly derived from the overall expected risk criterion
– Replace the loss function with the so-called accuracy function
• MPE tries to maximize the expected (phone or word) accuracy of all possible word sequences (generated by the recognizer) regarding the training utterances
r riW
Wkkr
iir
r riW
riMPE
WWAWPWOp
WPWOp
WWAOWpF
ri
rk
ri
),()()|(
)()|(
),()|()(
WW
W
),( ri WWA),( ri WWl
[Povey 2004]
17
Objective Function Optimization
• Objective function has the “latent variable” problem, such that it can not be directly optimized
Iterative optimization
• Gradient-based approaches• E.g., MCE
• Expectation Maximum (EM)– strong-sense auxiliary function
• E.g., MLE – Weak-sense auxiliary function
• E.g., MMIE, MPE
18
Three Steps for EM
• Step 1.Draw a lower bound– Use the Jensen’s inequality
• Step 2.Find the best lower bound auxiliary function– Let the lower bound touch the objective function at
current guess
• Step 3.Maximize the auxiliary function– Obtain the new guess– Go to Step 2 until converge
[Minka 1998]
19
)(F
objective function
current guess
Step 1.Draw a lower bound (1/3)
20
)(F
)()( Fg
Step 1.Draw a lower bound (2/3)
)(glower bound function
objective function
21
Step 1.Draw a lower bound (3/3)
S
S
S
Sp
SOPSp
Sp
SOPSp
SOPOP
)(
)|,(log)(
)(
)|,()(log
)|,(log)|(log
Apply Jensen’s Inequality
The lower bound function of )(F
ondistributiy probabilitarbitrary An : )(Sp
dslower boun infinite
22
)(F
),( g
Step 2.Find the best lower bound (1/4)
objective function
lower bound function
23
Step 2.Find the best lower bound (2/4)
– Let the lower bound touch the objective function at current guess
– Find the best at
SSp
S
Sp
SOPSpSp
SpSp
SOPSp
)(
)|,(log)(maxarg)(best The
at )( w.r.t )(
)|,(log)( maximize want to We
)(
)(Sp
24
Step 2.Find the best lower bound (3/4)
),|()|,(
)|,()(
)|,(
1)|,(
)()|,(
)(
1)|,(log)(log
01)(log)|,(log
1)(log)|,(log
)(log)()|,(log)()(1
1)(constrain theenforce
to multiplier Lagrange introduce We
OSPSOP
SOPSp
SOP
ee
e
SOPeSp
e
SOPeSp
SOPSp
SpSOP
SpSOP
SpSpSOPSpSp
Sp
SS
S
S
SSS
S
After derivation w.r.t )(Sp
Set it to zero
25
Step 2.Find the best lower bound (4/4)
SS
S
S
S
OSPOSPSOPOSP
OSP
SOPOSPg
OPOSP
SOPOSPg
F
gOSP
SOPOSPDefine
),|(log),|()|,(log),|(
),|(
)|,(log),|(),(
)|(log),|(
)|,(log),|(),(
check thatcan also We
)(for function auxiliary The
),(),|(
)|,(log),|(
Q function
26
)(F
),( g
Step 3.Maximize the auxiliary function (1/3)
NEW
auxiliary function
27
)(F
Step 3.Maximize the auxiliary function (2/3)
NEW
)()( FF NEW
objective function
28
)(F
Step 3.Maximize the auxiliary function (3/3)
NEW
objective function
29
)(F
),( g
Step 2.Find the best lower bound
objective function
auxiliary function
30
)(F
NEW
Step 3.Maximize the auxiliary function
)()( FF NEW
objective function
31
Strong-sense Auxiliary Function
• If is said to be a strong-sense auxiliary function for around ,iff
)()(),(),( FFgg
),( g
)(F
),( g)(F
[Povey et al. 2003]
32
Weak-sense Auxiliary Function (1/5)
• If is said to be a weak-sense auxiliary function for around ,iff
),( g)(F
)(),( Fg
),( g
)(F
33
)(F
)( ),( aroundaldifferentisamehaveFandg
Weak-sense Auxiliary Function (2/5)
),( g
objective function
auxiliary function
34
)(F
Weak-sense Auxiliary Function (3/5)
),( g
NEW
objective function
auxiliary function
35
)(F
NEW
Weak-sense Auxiliary Function (4/5)
)()( holdnotmayFF NEW
objective function
36
Weak-sense Auxiliary Function (5/5)
• If is said to be a smooth function around ,iff
• Speed up convergence• Provide more stable estimate
),( smg
),(),( smsm gg
)(F
),( smg
37
)(F
Smooth Function (1/2)
),( smg
objective function
smooth function
38
Smooth Function (2/2)
)(F
),(),( smgg
NEW
objective function
is also a weak-senseauxiliary function
39
MPE: Discrimination
• The MPE objective function is less sensitive to portions of the training data that are poorly transcribed
• A (word) lattice structure can be used here to approximate the set of all possible word sequences of each training utterance – Training statistics can be efficiently computed via such structure
rW
rlatW
40
MPE: Auxiliary Function (1/2)
• The weak-sense auxiliary function for MPE model updating can be defined as
– is a scalar value (a constant) calculated for each phone arc q, and can be either positive or negative (because of the accuracy function)
– The auxiliary function also can be decomposed as
)|(log)|(log
),(1
qOpqOp
FH r
R
r q r
MPEMPE
rlat
W
)|(log qOp
F
r
MPE
R
r qr
r
MPE
R
r qr
r
MPEMPE
rlat
rlat
qOpqOp
F
qOpqOp
FH
1
1
)|(log)|(log
,0max
)|(log)|(log
,0max),(
W
W
arcs with positive
contributions(so-called numerator)
arcs with negative contributions
(so-called denominator)
r riW
Wkkr
iirMPE WWA
WPWOp
WPWOpF
rir
k
),()()|(
)()|()(
WW
still have the “latent variable” problem
axg
xag
xg
xH
x
xg
x
xH
xg
xF
x
xg
x
xF
41
MPE: Auxiliary Function (2/2)
• The auxiliary function can be modified by considering the normal auxiliary function for
– The smoothing term is not added yet here
• The key quantity (statistics value) required in MPE training is , which can be termed as
)|(log qOp r
),,,()|(log
),(1
qrQqOp
FH ML
R
r q r
MPEMPE
rlat
W
)|(log qOp
F
r
MPEMPEq
)|(log qOp
F
r
MPE
),,,( qrQML
42
MPE: Statistics Accumulation (1/2)
• The objective function can be expressed as (for a specific phone arc )
• The differential can be expressed as
)(
)()|()()|(
),()()|(),()()|(
)()|(
),()()|((
,,
,,
R
r r
r
R
r WqW kkrWqW iir
WqW riiirWqW riiir
R
r W kkr
W riiirMPE
b
a
WPWOpWPWOp
WWAWPWOpWWAWPWOp
WPWOp
WWAWPWOpF
krlatkk
rlatk
irlatii
rlati
rlati
rlati
WW
WW
W
W
2
)|(log)|(log
)|(log)|(log
(
r
rrr
r
r
r
MPE
b
qOp
bab
qOp
a
qOp
b
a
qOp
F
q
43
MPE: Statistics Accumulation (2/2)
ravgr
rq
R
r
W kkr
W riiir
W kkr
WqW kkr
W kkr
WqW kkr
WqW kkr
WqW riiir
r
rWqW kkr
rr
WqW riiir
iirr
ir
r
r
r
r
r
r
r
r
cqc
WPWOp
WWAWPWOp
WPWOp
WPWOp
WPWOp
WPWOp
WPWOp
WWAWPWOp
b
a
qOp
WPWOp
bqOp
WWAWPWOp
WqifWOpqOp
WOp
b
qOpb
b
a
b
qOpa
rlatk
rlati
rlatk
krlati
rlatk
krlati
krlatk
irlati
krlati
irlati
W
W
W
W
W
W
W
W
W
W
)()|(
),()()|(
)()|(
)()|(
)()|(
)()|(
)()|(
),()()|(
|log
)()|(
1
|log
),()()|(
?? )|()|(log
)|(
)|(log)|(log
,
:
:
,
2,
,
)(qcrq
ravgc
rq
The average accuracy of sentences passing through the arc q
The likelihood of the arc q
The average accuracy of all the sentences in the word graph
xKdcax
x
xK
x
x
x
xK
dcxaxK
loglog
44
MPE: Accuracy Function (1/4)
• and can be calculated in an approximation way using the word graph and the Forward-Backward algorithm
• Note that the exact accuracy function is express as the sum of phone-level accuracy over all phones , e.g.
– However, such accuracy is obtained by full alignment between the true and all possible word sequences, which is computational expensive
qcrravgc
qA q
insertion if 1on substituti if 0
phonecorrect if 1)(qA
45
MPE: Accuracy Function (2/4)
• An approximated phone accuracy is defined
– : the ration of the portion of that is overlapped by
phonesdifferent if ),(1
phone same are and if ),(21max)(
zqe
qzzqeqA
z
),( zqe z q
1. Assume the true word sequence has no pronunciation variation 2. Phone accuracy can be obtained by simple local search3. Context-independent phones can be used for accuracy calculation
46
MPE: Accuracy Function (3/4)
• Forward-Backward algorithm for statistics calculation– Use “phone graph” as the vehicle
for 開始時間為 0的音素 q
endfor t=1 to T-1 for 開始時間為 t的音素 q
for 結束時間為 t-1且可連至 q的音素 r
end
for 結束時間為 t-1且可連至 q的音素 r
end
endend
)(qAq )|( qOpq
0q
)|( rqprqq
)(qAq
rq
rqq
rqp
)|(
)|( qOpqq
Forward
47
MPE: Accuracy Function (4/4)
for 結束時間為 T-1的音素 q
for t=T-2 to 0 for 結束時間為 t的音素 q
for 開始時間為 t+1且可連至 q的音素r
end
for 開始時間為 t+1且可連至 q的音素r
end endend
0q
)|()|( rOpqrprqq
0q
)()|()|(
rArOpqrp
rq
rqq
0q
1qBackward
for 每一音素 q
end
qT q
qT qq
avgc的音素分枝所有結束時間為
的音素分枝所有結束時間為
1
1
qqqc )(
))(( avgqMPEq cqc
qqq
qqq
48
MPE: Smoothing Function
• The smoothing function can be defined as
– The old model parameters( ) are used here as the hyper-parameters
– It has a maximum value at
)()()(|)log(|2
),( 11 mmmmmT
mmmm
m
sm trD
H
0
)()(2
),(
022
),(
11
1
mm
mm
mm
mm
mm
mm
mm
Tm
Tm
Tm
T
mmmmmmT
mm
m
sm
mmmm
m
sm
DH
DH
mm ,
49
MPE: Final Auxiliary Function (1/2)
)|(log)|(log
)(),( qOp
qOp
FH r
r
MPE
qrMPE
rlat
W
),),((log)(),( mmrrqm
MPErq
m
et
stqrMPE toNtg
q
qrlat
W
weak-sense auxiliary function
strong-sense auxiliary function
smoothing function involved
)()()(|)log(|2
),),((log)(),(
11
mmmmmT
mmmm
m
mmrrqm
MPErq
m
et
stqrMPE
trD
toNtgq
qrlat
W
weak-sense auxiliary function
50
MPE: Final Auxiliary Function (2/2)
r ri
WW
kkr
iirMPE WWA
WPWOp
WPWOpF
rlati
rlatk
),()()|(
)()|()(
WW
)|(log)|(log
)(),( qOp
qOp
FH r
r
MPE
qrMPE
rlat
W
),),((log)(),( mmrrqm
MPErq
m
et
stqrMPE toNtg
q
qrlat
W
Weak-sense auxiliary function
Strong-sense auxiliary
Weak-senseAdd smooth function
)()()(|)log(|2
),),((log)(),(
11
mmmmmT
mmmm
m
mmrrqm
MPErq
m
et
stQqrMPE
trD
toNtgq
qr
51
MPE: Model Update (1/2)
• Based on the final auxiliary function, we have the following update formulas
22222
2
}{
)()}()({
}{
)}()({
mdmd
denm
numm
mdmdmddenmd
nummd
md
mddenm
numm
mdmddenmd
nummd
md
D
DOO
D
DOO
m
denm
numm
mmdenm
numm
m D
DOO
)()(
Tmm
mdenm
numm
Tmmmm
denm
numm
m D
DOO
)()( 22
diagonal covariance matrix
dimension feature :
component mixture:
d
m
correlation matrix TN
i
TioioN
1
1
52
MPE: Model Update (2/2)
• Two sets of statistics (numerator, denominator) are accumulated respectively
R
r q
e
str
rq
rqm
denm
R
r q
e
str
rq
rqm
denm
R
r q
e
st
rq
rqm
denm
rlat
q
q
MPE
rlat
q
q
MPE
rlat
q
q
MPE
tOtO
tOtO
t
1
22
1
1
)(),0max()()(
)(),0max()()(
),0max()(
W
W
W
R
r q
e
str
rq
rqm
numm
R
r q
e
str
rq
rqm
numm
R
r q
e
st
rq
rqm
numm
rlat
q
q
MPE
rlat
q
q
MPE
rlat
q
q
MPE
tOtO
tOtO
t
1
22
1
1
)(),0max()()(
)(),0max()()(
),0max()(
W
W
W
53
MPE: Setting Constants (1/2)
• The mean and variance update formulas rely on the proper setting of the smoothing constant ( )– If is too large, the step size is small and convergence is slow – If is too small, the algorithm may become unstable– also needs to make all variance positive
mdD
2md
mdm DD or
mdD
0)]([)()(2)(
0))(()])([)((
0)(][)(
][)(
)(
2222222
2222
2222
2222
2
OODdOOD
DODDO
D
DO
D
DO
D
DO
D
DO
mdjmmdmdmdmdmmdmmdmdmdmd
mdmdmdmdmmdmdmdmd
mdm
mdmdmd
mdm
mdmdmdmd
mdmdm
mdmdmdmdm
mdm
mdmdmdmd
A B C
mdD
54
MPE: Setting Constants (2/2)
• Previous work [Povey 2004] used a value of that was twice the minimum positive value needed to insure all variance updates were positive
A
CABBDmd
2
42
mdD
55
MPE: I-Smoothing
• I-smoothing increases the weight of the numerator counts depending on the amounts of data available for each Guassian
• This is done by multiplying the numerator terms
( ) in the update formulas by
– can be set empirically (e.g., )
)(,)(, 2OO numm
numm
numm
mnumm
numm
MLmML
m
mnumm
numm
MLmML
m
mnumm
numm
OOO
OOO
)()()(
)()()(
222
emphasize positive contributions (arcs with higher accuracy)
m 100m
56
Preliminary Experimental Results (1/3)
CER(%) ML_itr10 ML_itr10
MPE_itr10
ML_itr150 ML_itr150
MPE_itr10
MFCC(CMS) 29.48 26.52 28.72 24.44
MFCC(CN) 26.60 25.05 26.96 24.54
HLDA+MLLT(CN)
23.64 20.92 23.78 20.79
• Experiment conducted on the MATBN (TV broadcast news) corpus (field reporters)– Training: 34,672 utterances (25.5hrs)
– Testing: 292 utterances (1.45hrs, outside-testing)– Metric: Chinese character error rate (CER)
12.83% relative improvement11.51% relative improvement
57
Preliminary Experimental Results (2/3)
• Another experiment conducted on the MATBN (TV broadcast news) corpus (field reporters) – Discriminative Feature: HLDA+MLLT(CN)– Discriminative training: MPE– Total 34,672 utterances (24.5hrs) - (10-fold cross validation)
A relative improvement of 26.2% finally achieved (HLDA-MLLT(CN)+ MPE)
21.4318.43
15.8
0
5
10
15
20
25
MFCC_CN ML HLDA+MLLT_CNML
HLDA+MLLT_CNMPE
CER(%
)
58
Preliminary Experimental Results (3/3)
• Corpus segmentation conducted on the TIMIT corpus– 50 phone categories
– Training: 4546 utterances (3.8 hrs)
– Testing: 1646 utterances (1.4 hrs)
• Minimum Error Length (MEL)
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
0(ML)
1 2 3 4 5 6 7 8 9 10 iter
Frame Err(%)
TrainSet TestSet TrainSet cri
r
ri
SS
jr
irMEL SSErrLen
SOp
SOpF
ri
rj
),()|(
)|()(
SegSeg
ww
eh ih
eh w
w
ih
ih
sil
w
eh
eh
w
eh
eh
ih
w
dh
ih
sil w
w
w
wih
dhdh
dh
dhey
eyey
Lattice generated by force alignment
59
Preliminary Experimental Results (3/3)
• Corpus segmentation conducted on the TIMIT corpus– 50 phone categories
– Training: 4546 utterances (3.8 hrs)
– Testing: 1646 utterances (1.4 hrs)
• Minimum Error Length (MEL)
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
0(ML)
1 2 3 4 5 6 7 8 9 10 iter
Frame Err(%)
TrainSet TestSet TrainSet cri
r
ri
SS
jr
irMEL SSErrLen
SOp
SOpF
ri
rj
),()|(
)|()(
SegSeg
ww
eh ih
eh w
w
ih
ih
sil
w
eh
eh
w
eh
eh
ih
w
dh
ih
sil w
w
w
wih
dhdh
dh
dhey
eyey
Lattice generated by force alignment
60
ww
eh ih
eh w
w
ih
ih
sil
w
eh
eh
w
eh
eh
ih
w
dh
ih
sil w
w
w
wih
dhdh
dh
dhey
eyey
Lattice generated by force alignment
61
Conclusions & Future work
• MPE/MWE (or MMI) based discriminative training approaches have shown effectiveness in Chinese continuous speech recognition– Joint training of feature transformation, acoustic models and
language models
– Unsupervised training
– More in-deep investigation and analysis are needed
– Exploration of variant accuracy/error functions
62
References
• D. Povey, P.C. Woodland, M.J.F. Gales, “Discriminative MAP for Acoustic Model Adaptation,” in Proc. ICASSP 2003
• R. Schluter, W. Macherey, B. Muller, H. Ney, “Comparison of discriminative training criteria and optimization methods for speech recognition,” Speech Communication 34, 2001
• K. Vertanen, “An Overview of Discriminative Training for Speech Recognition”
• V. Goel, S. Kumar, W. Byrne, “Segmental minimum Bayes-risk decoding for automatic speech recognition,” IEEE Transactions on Speech and Audio Processing, May 2004
• J.W. Kuo, “An Initial Study on Minimum Phone Error Discriminative Learning of Acoustic Models for Mandarin Large Vocabulary Continuous Speech Recognition” Master Thesis, NTNU, 2005
• J.W. Kuo, B. Chen, "Minimum Word Error Based Discriminative Training of Language Models," Eurospeech 2005
63
References
• Lidia Mangu, Eric Brill, Andreas Stolcke, “Finding Consensus in Speech Recognition: Word Error Minimization and Other Applications of Confusion Networks,” Computer Speech and Language 14, 2000
• R. O. Duda, P. E. Hart and D. G. Stork , Pattern Classification, Second Edition. New York: John & Wiley, 2000
• R. Schlüter , “Investigations on Discriminative Training Criteria,” Ph.D Dissertation, RWTH Aachen - University of Technology, September 2000
• L. R. Bahl, P. F. Brown, P. V. de Souza and R. L. Mercer, “Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition,” in Proc. ICASSP’86
• Wu Chou, “Discriminant-Function-Based Minimum Recognition Error Rate Pattern-Recognition Approach to Speech Recognition,” Proceedings of the IEEE, Vol. 88, No. 8, 2000
• T. Minka. Expectation-Maximization as lower bound maximization. http://research.microsoft.com/~minka/papers/em.html, 1998
Thank You !
65
Appendix: MMI v.s. MPE
24.50
25.00
25.50
26.00
26.50
27.00
27.50
28.00
0 1 2 3 4 5 6 7 8 9 10訓練次數
(%)字錯誤率 ML MMI MPE
CER relative reduction: MMI 8% MPE 12%
27.88
24.51
25.62