Download - 英文連續語音辨識之初步研究 An Initial Study on English Continuous Speech Recognition

Transcript
  • An Initial Study on English Continuous Speech Recognition

  • :: Bayes Theoryp(O)

  • 1BBN2IBM(T.J. Watson)345Dragon Systems 6LIMSI-CNRS 7SRI 8AT&T 9MsState ISIP 10(Microsoft)

  • () 20023 (International Computer Science Institution , ICSI)(DARPA)EARS (Effective Affordable Reusable Speech-to-text Program)(Rich Transcription)RT03RT04 (Linguistic Data Consortium, LDC)SwitchboardSwitchboard CellularCallhomeEARSLDC(Fisher Collection)

  • ()

    BBNIBMCU2004 BBN/LIMSIIBM 20042004 CU-HTK20RT10RT10RTRT 04RT04RT0313.5%15.2%17%2,300()2,100()2,180()VTLN()PLP + CMSHLDA+MLLTVTLNPLP + CMVN +LDAfMPE + LDA+MLLTVTLNHLDA+ CMVN1. ML-SI (+HLDA) I. STM II. SCTM III. Cross-word SCTM2. ML-HLDA-SAT (+MLLT)1.SI.DC.PLP2.SA.FC.fMPE3.SA.DC.fMPE+MPEMPE + TriphoneQuinphone

  • ()

    BBNIBMCU2004 BBN/LIMSIIBM 20042004 CU-HTKWitten-Bell +Interpolated LMKneser-Ney +Interpolated LMKneser-Ney + Good-Turing +Interpolated LM1. ML-SI : I.Triphone + Bigram II.Within-word Quinphone + Trigram III.Cross-word Quinphone + Fourgram2. ML-HLDA-SAT3. Regression Classes1. SI.DC.PLP: Quinphone + Fourgram2. SA.FC.fMPE: Quinphone + Fourgram3. SA.DC.fMPE+MPE: Septaphone + Fourgram1. Triphone + Fourgram2. Quinphone + Fourgram3. Lattice MLLR

  • 40 6(silence) sil (pause)sp

  • ():Festlex CMU105,626 begin b ih g ih ncoffee k aa f iy hello hh ax l owyes y eh s ("begin" nil (((b ih g) 0) ((ih n) 1)))("coffee" nil (((k aa f) 1) ((iy) 0)))("hello" nil (((hh ax l) 0) ((ow) 1)))("yes" nil (((y eh s) 1))) Festlex CMU Festlex CMU

  • ax: (mean) (Covarience Matrix) ()2 1 (39)

  • (Context dependence)

  • ()

  • 1. (40)

  • 2.40*40*40 =64000() (Data Sparseness)

  • 3. (State)(Tying)(Tree-based Clustering) 1 :(Root)

  • 3. () 2 : (Decision Tree) :

  • 3. ()

  • 4.

  • Viterbi

    (Tree-Copy Search)Bigram(Word Graph Rescoring)Trigram

  • : (Channel Effects) (CMS)

    (CMVN) :

    (LDA)(HLDA) (MLLT)

  • (LDA) HMM (B) (W)

    (HLDA)

    (MLLT)

    ,,

  • () (MFCC) (MFCC+CMS) (MFCC+CMVN) (LDA+MLLT+CMVN) (HLDA+MLLT+CMVN)

  • :(Count Merging)(Model Interpolation)

  • ():(Count Merging) : Data level CA CB (Model Interpolation) : Model level

  • HMMHMM1128

    HMM

  • (Confusion Matrix)(Normalized) () ()(Likelihood) (Substitution) : w iy w eh : w w aw ae

  • (EAT) (16 KHz) (VOA)(16 KHz) (BNC)(102M) 90%10%

  • ()

    EAT1grandpa2for instance3 six five seven seven four five seven 4 Green Mountain Energy

    VOA1their workshops were long ago damaged2an internet message taking responsibility for their deaths3it is one of those things that i dreaded the entire time

  • EAT

    VOA

    (hr)5,3403.3330,6375000.564,373()5,178

    (hr)20,0007.02 53,9221,0000.652,781()2,370

  • VOAFeature : MFCC_CMS Language Model :BNC+VOA(1:1)

    (%)TCWG1*176,07346.7254.102*2145,31846.5153.013*3217,74445.6252.944*4290,50544.9150.86

  • ()EATFeature : MFCC_CMS Language Model: EAT 40.55%49.53% 4

    (%)TCWG1125,37530.1240.552*1143,73536.4149.533*4549,95336.4549.35

  • VOA(Count Merging)Feature: MFCC_CMS Mixtures: 76,073 () BNCVOABNCBNC

    (%)TCWG110BNC45.9051.43201VOA47.7049.46311BNC+VOA46.7254.104150BNC+VOA*5046.2853.7851100BNC+VOA*10046.3153.65

  • ()EAT(Count Merging)Feature: HLDA+MLLT+CMVN Mixtures: 26,548 EATBNCEATBNC

    (%)TCWG110BNC32.2128.83201EAT45.2252.01311BNC+EAT32.3533.5741100BNC+EAT*10036.9239.86

  • ()VOA(Model Interpolation)

    (%)(%)(%)(%)0.0051.430.5552.090.0552.850.6051.860.1052.550.6551.700.1552.940.7051.430.2052.800.7551.150.2552.570.8050.970.3052.280.8550.810.3552.140.9050.290.4052.050.9549.850.4552.161.0048.480.5052.23--

  • VOALanguage Model :BNC+VOA(1:1)*1

    (%)TCWG1MFCC78,41245.2552.052MFCC_CMS76,07346.7254.103MFCC_CMVN73,08345.8351.644LDA+MLLT_ CMVN70,67251.5459.895HLDA+MLLT_ CMVN71,62749.2354.42

  • ()EATLanguage Model: EAT*1 MFCCMFCC_CMSMFCC_CMVN EAT(Channel Effects)

    (%)TCWG1MFCC145,31929.6940.042MFCC_CMS143,73536.4149.533MFCC_CMVN138,71333.9347.024LDA+MLLT_CMVN138,28947.3059.535HLDA+MLLT_CMVN141,33346.4859.71

  • ex.0~1 viterbi

  • (Supervised Training) (Lightly Supervised Training) (Unsupervised Training)How are youHow are you

  • (True Transcription)

  • ()

    51.7358.2057.84

  • EAT

    HLDA+MLLT+CMVN(hr)20,0007.02 53,92242,96033.4 108,3231,0000.65 2,781()4,229

  • EAT

    (%)---TCWG1HMM(1)141,33350.1457.842HMM(3)221,82049.7851.733HMM(4)191,31450.8658.20

  • ()EAT

    (%)---TCWG1HMM(1) 141,333 50.14 57.842HMM(2) 216,31856.29 64.74

  • ()EAT0.2

    zs0.38ayax0.25shs0.38ayt0.25jhr0.33kt0.23jht0.33uhax0.23zhax0.33mn0.23zhl0.33aoow0.23zhsh0.33chn0.22awl0.30ths0.22ngn0.29bf0.21dt0.27lr0.20awaa0.25iyih0.20

  • () () M EAT(General)EAT = *

    MNAMN10120.510150.512160.41021400.4:::

  • ()EAT

    (%) ()(%) ()TCWGTCWG50.6158.0550.6158.0510.80045.8752.7346.8755.2820.97049.6056.7949.8657.8730.970.151.0858.2050.9358.2340.970.350.8657.8751.1558.52

  • VOAEAT

    VOAEAT1LDA+MLLT+CMVNHLDA+MLLT+CMVN23.33(5340)40.42(62906)30.56(500)0.65(1000)45,1784,229570,672()216,310()64,3738,8507BNC+VOA(1:1)EAT859.89 %65.71 %

  • (Minimum Phone Error, MPE)EAT