Post on 01-Jan-2016
description
An Initial Study on English Continuous Speech Recognition
:: Bayes Theoryp(O)
1BBN2IBM(T.J. Watson)345Dragon Systems 6LIMSI-CNRS 7SRI 8AT&T 9MsState ISIP 10(Microsoft)
() 20023 (International Computer Science Institution , ICSI)(DARPA)EARS (Effective Affordable Reusable Speech-to-text Program)(Rich Transcription)RT03RT04 (Linguistic Data Consortium, LDC)SwitchboardSwitchboard CellularCallhomeEARSLDC(Fisher Collection)
()
BBNIBMCU2004 BBN/LIMSIIBM 20042004 CU-HTK20RT10RT10RTRT 04RT04RT0313.5%15.2%17%2,300()2,100()2,180()VTLN()PLP + CMSHLDA+MLLTVTLNPLP + CMVN +LDAfMPE + LDA+MLLTVTLNHLDA+ CMVN1. ML-SI (+HLDA) I. STM II. SCTM III. Cross-word SCTM2. ML-HLDA-SAT (+MLLT)1.SI.DC.PLP2.SA.FC.fMPE3.SA.DC.fMPE+MPEMPE + TriphoneQuinphone
()
BBNIBMCU2004 BBN/LIMSIIBM 20042004 CU-HTKWitten-Bell +Interpolated LMKneser-Ney +Interpolated LMKneser-Ney + Good-Turing +Interpolated LM1. ML-SI : I.Triphone + Bigram II.Within-word Quinphone + Trigram III.Cross-word Quinphone + Fourgram2. ML-HLDA-SAT3. Regression Classes1. SI.DC.PLP: Quinphone + Fourgram2. SA.FC.fMPE: Quinphone + Fourgram3. SA.DC.fMPE+MPE: Septaphone + Fourgram1. Triphone + Fourgram2. Quinphone + Fourgram3. Lattice MLLR
40 6(silence) sil (pause)sp
():Festlex CMU105,626 begin b ih g ih ncoffee k aa f iy hello hh ax l owyes y eh s ("begin" nil (((b ih g) 0) ((ih n) 1)))("coffee" nil (((k aa f) 1) ((iy) 0)))("hello" nil (((hh ax l) 0) ((ow) 1)))("yes" nil (((y eh s) 1))) Festlex CMU Festlex CMU
ax: (mean) (Covarience Matrix) ()2 1 (39)
(Context dependence)
()
1. (40)
2.40*40*40 =64000() (Data Sparseness)
3. (State)(Tying)(Tree-based Clustering) 1 :(Root)
3. () 2 : (Decision Tree) :
3. ()
4.
Viterbi
(Tree-Copy Search)Bigram(Word Graph Rescoring)Trigram
: (Channel Effects) (CMS)
(CMVN) :
(LDA)(HLDA) (MLLT)
(LDA) HMM (B) (W)
(HLDA)
(MLLT)
,,
() (MFCC) (MFCC+CMS) (MFCC+CMVN) (LDA+MLLT+CMVN) (HLDA+MLLT+CMVN)
:(Count Merging)(Model Interpolation)
():(Count Merging) : Data level CA CB (Model Interpolation) : Model level
HMMHMM1128
HMM
(Confusion Matrix)(Normalized) () ()(Likelihood) (Substitution) : w iy w eh : w w aw ae
(EAT) (16 KHz) (VOA)(16 KHz) (BNC)(102M) 90%10%
()
EAT1grandpa2for instance3 six five seven seven four five seven 4 Green Mountain Energy
VOA1their workshops were long ago damaged2an internet message taking responsibility for their deaths3it is one of those things that i dreaded the entire time
EAT
VOA
(hr)5,3403.3330,6375000.564,373()5,178
(hr)20,0007.02 53,9221,0000.652,781()2,370
VOAFeature : MFCC_CMS Language Model :BNC+VOA(1:1)
(%)TCWG1*176,07346.7254.102*2145,31846.5153.013*3217,74445.6252.944*4290,50544.9150.86
()EATFeature : MFCC_CMS Language Model: EAT 40.55%49.53% 4
(%)TCWG1125,37530.1240.552*1143,73536.4149.533*4549,95336.4549.35
VOA(Count Merging)Feature: MFCC_CMS Mixtures: 76,073 () BNCVOABNCBNC
(%)TCWG110BNC45.9051.43201VOA47.7049.46311BNC+VOA46.7254.104150BNC+VOA*5046.2853.7851100BNC+VOA*10046.3153.65
()EAT(Count Merging)Feature: HLDA+MLLT+CMVN Mixtures: 26,548 EATBNCEATBNC
(%)TCWG110BNC32.2128.83201EAT45.2252.01311BNC+EAT32.3533.5741100BNC+EAT*10036.9239.86
()VOA(Model Interpolation)
(%)(%)(%)(%)0.0051.430.5552.090.0552.850.6051.860.1052.550.6551.700.1552.940.7051.430.2052.800.7551.150.2552.570.8050.970.3052.280.8550.810.3552.140.9050.290.4052.050.9549.850.4552.161.0048.480.5052.23--
VOALanguage Model :BNC+VOA(1:1)*1
(%)TCWG1MFCC78,41245.2552.052MFCC_CMS76,07346.7254.103MFCC_CMVN73,08345.8351.644LDA+MLLT_ CMVN70,67251.5459.895HLDA+MLLT_ CMVN71,62749.2354.42
()EATLanguage Model: EAT*1 MFCCMFCC_CMSMFCC_CMVN EAT(Channel Effects)
(%)TCWG1MFCC145,31929.6940.042MFCC_CMS143,73536.4149.533MFCC_CMVN138,71333.9347.024LDA+MLLT_CMVN138,28947.3059.535HLDA+MLLT_CMVN141,33346.4859.71
ex.0~1 viterbi
(Supervised Training) (Lightly Supervised Training) (Unsupervised Training)How are youHow are you
(True Transcription)
()
51.7358.2057.84
EAT
HLDA+MLLT+CMVN(hr)20,0007.02 53,92242,96033.4 108,3231,0000.65 2,781()4,229
EAT
(%)---TCWG1HMM(1)141,33350.1457.842HMM(3)221,82049.7851.733HMM(4)191,31450.8658.20
()EAT
(%)---TCWG1HMM(1) 141,333 50.14 57.842HMM(2) 216,31856.29 64.74
()EAT0.2
zs0.38ayax0.25shs0.38ayt0.25jhr0.33kt0.23jht0.33uhax0.23zhax0.33mn0.23zhl0.33aoow0.23zhsh0.33chn0.22awl0.30ths0.22ngn0.29bf0.21dt0.27lr0.20awaa0.25iyih0.20
() () M EAT(General)EAT = *
MNAMN10120.510150.512160.41021400.4:::
()EAT
(%) ()(%) ()TCWGTCWG50.6158.0550.6158.0510.80045.8752.7346.8755.2820.97049.6056.7949.8657.8730.970.151.0858.2050.9358.2340.970.350.8657.8751.1558.52
VOAEAT
VOAEAT1LDA+MLLT+CMVNHLDA+MLLT+CMVN23.33(5340)40.42(62906)30.56(500)0.65(1000)45,1784,229570,672()216,310()64,3738,8507BNC+VOA(1:1)EAT859.89 %65.71 %
(Minimum Phone Error, MPE)EAT