SpeechRecognitionHUNG-YI LEE ๆๅฎๆฏ
LAS: ๅฐฑๆฏ seq2seq
CTC: decoder ๆฏ linear classifier ็ seq2seq
RNA: ่ผธๅ ฅไธๅๆฑ่ฅฟๅฐฑ่ฆ่ผธๅบไธๅๆฑ่ฅฟ็ seq2seq
RNN-T: ่ผธๅ ฅไธๅๆฑ่ฅฟๅฏไปฅ่ผธๅบๅคๅๆฑ่ฅฟ็ seq2seq
Neural Transducer: ๆฏๆฌก่ผธๅ ฅไธๅ window ็ RNN-T
MoCha: window ็งปๅไผธ็ธฎ่ชๅฆ็ Neural Transducer
Last Time
Two Points of Views
Source of image: ๆ็ณๅฑฑ่ๅธซใๆธไฝ่ช้ณ่็ๆฆ่ซใ
Seq-to-seq HMM
Hidden Markov Model (HMM)
SpeechRecognition
speech text
X Y
Yโ = ๐๐๐maxY
๐ Y|๐
= ๐๐๐maxY
๐ ๐|Y ๐ Y
๐ ๐
= ๐๐๐maxY
๐ ๐|Y ๐ Y
๐ ๐|Y : Acoustic Model
๐ Y : Language Model
HMM
Decode
HMM
hh w aa t d uw y uw th ih ng k
what do you think
t-d+uw1 t-d+uw2 t-d+uw3
โฆโฆ t-d+uw d-uw+y uw-y+uw y-uw+th โฆโฆ
d-uw+y1 d-uw+y2 d-uw+y3
Phoneme:
Tri-phone:
State:
๐ ๐|Y ๐ ๐|๐
A token sequence Y corresponds to a sequence of states S
HMM๐ ๐|Y ๐ ๐|๐
A sentence Y corresponds to a sequence of states S
๐ ๐ ๐
Start End
๐ ๐
HMM๐ ๐|Y ๐ ๐|๐
A sentence Y corresponds to a sequence of states S
Transition Probability
๐ ๐|๐๐ ๐
Emission Probability
t-d+uw1
d-uw+y3
P(x|โt-d+uw1โ)
P(x|โd-uw+y3โ)
Gaussian Mixture Model (GMM)
Probability from one state to another
HMM โ Emission Probability
โข Too many states โฆโฆ
P(x|โt-d+uw3โ)P(x|โd-uw+y3โ)
Tied-state
Same Addresspointer pointer
็ตๆฅตๅๆ : Subspace GMM [Povey, et al., ICASSPโ10]
(Geoffrey Hinton also published deep learning for ASR in the same conference)
[Mohamed , et al., ICASSPโ10]
P๐ ๐|๐ =?
transition
๐ ๐|๐
๐ ๐emission(GMM)
โโ๐๐๐๐๐(๐)
๐ ๐|โ
alignment
which state generates which vector
โ1 = ๐๐๐๐๐๐
๐ ๐ ๐
โ2 = ๐๐๐๐๐๐
โ = ๐๐๐๐๐๐
๐ ๐|โ1
๐ ๐|โ2
โ = ๐๐๐๐๐
๐ ๐ ๐ ๐ ๐ ๐๐ ๐|๐ ๐ ๐|๐ ๐ ๐|๐ ๐ ๐|๐ ๐ ๐|๐
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
๐ ๐ฅ1|๐ ๐ ๐ฅ2|๐ ๐ ๐ฅ3|๐ ๐ ๐ฅ4|๐ ๐ ๐ฅ5|๐ ๐ ๐ฅ6|๐
How to use Deep Learning?
Method 1: Tandem
โฆโฆ โฆโฆ
๐ฅ๐
Size of output layer= No. of states DNN
โฆโฆ
โฆโฆ
Last hidden layer or bottleneck layer are also possible.
New acoustic feature for HMM
๐(๐|๐ฅ๐) ๐(๐|๐ฅ๐) ๐(๐|๐ฅ๐)
State classifier
Method 2: DNN-HMM Hybrid
DNN
๐ ๐|๐ฅโฆโฆ
๐ฅ
๐ ๐ฅ|๐
๐ ๐ฅ|๐ =๐ ๐ฅ, ๐
๐ ๐=๐ ๐|๐ฅ ๐ ๐ฅ
๐ ๐Count from training data
CNN, LSTM โฆ
DNN output
How to train a state classifier?
Train HMM-GMM model
state sequence: { a , b , c }
Acoustic features:
Utterance + Label(without alignment)
Utterance + Label(aligned)
state sequence:
Acoustic features:
a a a b b c c
How to train a state classifier?
Train HMM-GMM model
Utterance + Label(without alignment)
Utterance + Label(aligned)
DNN1
How to train a state classifier?
Utterance + Label(without alignment)
Utterance + Label(aligned)
DNN2DNN1realignment
Human Parity!
โข ๅพฎ่ป่ช้ณ่พจ่ญๆ่ก็ช็ ด้ๅคง้็จ็ข๏ผๅฐ่ฉฑ่พจ่ญ่ฝๅ้ไบบ้กๆฐดๆบ๏ผ(2016.10)
โข https://www.bnext.com.tw/article/41414/bn-2016-10-19-020437-216
โข IBM vs Microsoft: 'Human parity' speech recognition record changes hands again (2017.03)
โข http://www.zdnet.com/article/ibm-vs-microsoft-human-parity-speech-recognition-record-changes-hands-again/
Machine 5.9% v.s. Human 5.9%
Machine 5.5% v.s. Human 5.1%
[Yu, et al., INTERSPEECHโ16]
[Saon, et al., INTERSPEECHโ17]
Very Deep
[Yu, et al., INTERSPEECHโ16]
Back to End-to-end
LAS
๐ ๐|X =?
Yโ = ๐๐๐maxY
๐๐๐๐ Y|๐
๐ฅ4๐ฅ3๐ฅ2๐ฅ1
โข LAS directly computes ๐ ๐|X
๐ ๐
๐ ๐|X = ๐ ๐|๐ ๐ ๐|๐, ๐ โฆ๐ง0
๐0
๐ง1
Size V
๐1
๐ง1 ๐ง2
a
๐ ๐ ๐ ๐
๐2
๐ง3
b
๐ ๐ธ๐๐
Beam Search
๐โ = ๐๐๐max๐
๐๐๐P๐ ๐|๐
Decoding:
Training:
CTC, RNN-T
๐ ๐|X =?
๐ฅ4๐ฅ3๐ฅ2๐ฅ1
โข LAS directly computes ๐ ๐|X
๐ ๐
๐ ๐|X = ๐ ๐|๐ ๐ ๐|๐, ๐ โฆ
โข CTC and RNN-T need alignment
Encoder
โ4โ3โ2โ1
โ = ๐ ๐ ๐ ๐๐ โ|๐
๐ ๐ ๐ ๐๐ ๐ ๐ ๐
P Y|๐ =
โโ๐๐๐๐๐ ๐
๐ โ|๐
โ ๐ ๐
Yโ = ๐๐๐maxY
๐๐๐๐ Y|๐Beam Search
๐โ = ๐๐๐max๐
๐๐๐P๐ ๐|๐
Decoding:
Training:
HMM, CTC, RNN-T
P๐ ๐|๐ =
โโ๐๐๐๐๐ ๐
๐ ๐|โ
๐โ = ๐๐๐max๐
๐๐๐P๐ ๐|๐
Yโ = ๐๐๐maxY
๐๐๐๐ Y|๐
P๐ Y|๐ =
โโ๐๐๐๐๐ ๐
๐ โ|๐
HMM CTC, RNN-T
๐P๐ ๐|๐
๐๐=?
1. Enumerate all the possible alignments
2. How to sum over all the alignments
3. Training:
4. Testing (Inference, decoding):
HMM, CTC, RNN-T
๐ ๐|๐ =
โโ๐๐๐๐๐ ๐
๐ ๐|โ
๐โ = ๐๐๐max๐
๐๐๐P๐ ๐|๐
Yโ = ๐๐๐maxY
๐๐๐๐ Y|๐
๐ Y|๐ =
โโ๐๐๐๐๐ ๐
๐ โ|๐
HMM CTC, RNN-T
๐P๐ ๐|๐
๐๐=?
1. Enumerate all the possible alignments
2. How to sum over all the alignments
3. Training:
4. Testing (Inference, decoding):
All the alignments
HMM
CTC
RNN-T
LAS
ไฝ ๅๅจๅฟไป้บผโบ
c a t c c a a a t c a a a a t
c ๐ a a t t
add ๐
duplicate to length ๐
๐ c a ๐ t ๐
add ๐ x ๐
c ๐ ๐ ๐ a ๐ ๐ t ๐
c ๐ ๐ a ๐ ๐ t ๐ ๐
๐ = 6
SpeechRecognition
speech text
c a t
๐ = 3
โฆ
duplicatec a t
to length ๐
โฆ
c a t โฆโฆ
๐ โ โ ๐๐๐๐๐(๐)
HMM c a t c c a a a t c a a a a tduplicate to length ๐
โฆ
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
nexttoken
duplicate
For n = 1 to ๐
output the n-th token ๐ก๐ times
constraint: ๐ก1 + ๐ก2 +โฏ๐ก๐ = ๐, ๐ก๐ > 0Trellis Graph
c c c
a
a
t
aa
t
t t
HMM c a t c c a a a t c a a a a tduplicate to length ๐
โฆ
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
nexttoken
duplicate
For n = 1 to ๐
output the n-th token ๐ก๐ times
constraint: ๐ก1 + ๐ก2 +โฏ๐ก๐ = ๐, ๐ก๐ > 0Trellis Graph
c c c c cc
For n = 1 to ๐
output the n-th token ๐ก๐ times
constraint:
output โ๐โ ๐๐ times
๐ก๐ > 0 ๐๐ โฅ 0
output โ๐โ ๐0 times
๐ก1 + ๐ก2 +โฏ๐ก๐ +
๐0 + ๐1 +โฏ๐๐ = ๐
CTC c ๐ a a t t
add ๐
๐ c a ๐ t ๐duplicate
c a t
to length ๐
โฆ
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
๐
c
๐
a
๐
t
๐
next token
duplicate
insert ๐
CTC c ๐ a a t t
add ๐
๐ c a ๐ t ๐duplicate
c a t
to length ๐
โฆ
(๐ can be skipped)
c
CTC c ๐ a a t t
add ๐
๐ c a ๐ t ๐duplicate
c a t
to length ๐
โฆ
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
๐
c
๐
a
๐
t
๐
duplicate ๐
next token
cannot skip any token
๐
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
๐
c
๐
a
๐
t
๐
next token
duplicate
insert ๐
CTC c ๐ a a t t
add ๐
๐ c a ๐ t ๐duplicate
c a t
to length ๐
โฆ
duplicate
next token
CTC c ๐ a a t t
add ๐
๐ c a ๐ t ๐duplicate
c a t
to length ๐
โฆ
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
๐
c
๐
a
๐
t
๐
cc
a
t
๐
a
๐
t
๐๐
c
๐
CTC c ๐ a a t t
add ๐
๐ c a ๐ t ๐duplicate
c a t
to length ๐
โฆ
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
๐
c
๐
a
๐
t
๐
c
a
t
๐๐
c c ca
t
c
๐
CTC c ๐ a a t t
add ๐
๐ c a ๐ t ๐duplicate
c a t
to length ๐
โฆ
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
๐
s
๐
e
๐
e
๐
next token
duplicate
insert ๐
โฆ ee โฆ โ e
Exception: when the next token is the same token
c
๐
RNN-Tadd ๐ x ๐
c ๐ ๐ ๐ a ๐ ๐ t ๐
c ๐ ๐ a ๐ ๐ t ๐ ๐c a t โฆโฆ
For n = 1 to ๐
output the n-th token 1 times
constraint:
output โ๐โ ๐๐ times
output โ๐โ ๐0 times
๐0 + ๐1 +โฏ๐๐ = ๐
c a t
Put some ๐ (option)
Put some ๐ (option)
Put some ๐ (option)
Put some ๐
at least once
๐๐ > 0
๐๐ โฅ 0 for n = 1 to ๐ โ 1
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
RNN-Tadd ๐ x ๐
c ๐ ๐ ๐ a ๐ ๐ t ๐
c ๐ ๐ a ๐ ๐ t ๐ ๐c a t โฆโฆ
output token
Insert ๐
๐
๐
c
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
output token
Insert ๐
RNN-Tadd ๐ x ๐
c ๐ ๐ ๐ a ๐ ๐ t ๐
c ๐ ๐ a ๐ ๐ t ๐ ๐c a t โฆโฆ
๐c
๐ ๐๐
๐ก๐
๐ก๐ ๐
c
a
๐ ๐ ๐ ๐ ๐ ๐
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
output token
Insert ๐
RNN-Tadd ๐ x ๐
c ๐ ๐ ๐ a ๐ ๐ t ๐
c ๐ ๐ a ๐ ๐ t ๐ ๐c a t โฆโฆ
๐
๐ ๐ ๐ ๐ ๐
๐ก
c
a
๐
c a tStart End
c a tStart End
๐ ๐ ๐ ๐
c a tStart End
๐ ๐ ๐ ๐
HMM
CTC
RNN-T
not gen not gen not gen
1. Enumerate all the possible alignments
HMM, CTC, RNN-T
๐ ๐|Y =
โโ๐๐๐๐๐ ๐
๐ ๐|โ
๐โ = ๐๐๐max๐
๐๐๐P๐ ๐|๐
Yโ = ๐๐๐maxY
๐๐๐๐ Y|๐
๐ Y|๐ =
โโ๐๐๐๐๐ ๐
๐ โ|๐
HMM CTC, RNN-T
๐P๐ ๐|๐
๐๐=?
2. How to sum over all the alignments
3. Training:
4. Testing (Inference, decoding):
This part is challenging.
Score Computation
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
output token
Insert ๐
๐c
๐ ๐๐
๐๐ก๐ ๐
โ = ๐ c ๐ ๐ a ๐ t ๐ ๐
๐ โ|๐
= ๐ ๐|๐
ร ๐ ๐|๐, ๐
ร ๐ ๐|๐, ๐๐ โฆโฆ
๐
โ1 โ2 โ2 โ3 โ4 โ4 โ5 โ5 โ6
๐1,0
๐
๐2,0
๐
๐2,1
๐
๐3,1
๐
๐4,1
๐
๐4,2
๐ก
๐5,2
๐
๐5,3
๐
๐6,3
c a t<BOS>
๐0 ๐1 ๐2 ๐3
๐0 ๐0 ๐1 ๐1 ๐1 ๐2 ๐2 ๐3๐3
โ = ๐ c ๐ ๐ a ๐ t ๐ ๐
Score Computation
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
๐1,0 ๐
๐2,0 ๐
๐2,1 ๐ ๐3,1 ๐๐4,1 ๐
๐4,2 ๐๐5,2 ๐ก
๐5,3 ๐ ๐6,3 ๐
โ = ๐ c ๐ ๐ a ๐ t ๐ ๐
๐ โ|๐
Score Computation
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
c<B
OS>
a
๐ 0๐ 1
๐ 2
Because ๐ is not considered!
๐4,2 ๐
๐4,2 ๐๐4,2
๐
โ1 โ2 โ2 โ3 โ4 โ4 โ5 โ5 โ6
๐1,0
๐
๐2,0
๐
๐2,1
๐
๐3,1
๐
๐4,1
๐
๐4,2
๐ก
๐5,2
๐
๐5,3
๐
๐6,3
c a t<BOS>
๐0 ๐1 ๐2 ๐3
๐0 ๐0 ๐1 ๐1 ๐1 ๐2 ๐2 ๐3๐3
๐
โ4 โ5 โ5 โ6
๐ ๐ ๐ ๐ ๐
๐4,2
๐ก
๐5,2
๐
๐5,3
๐
๐6,3
c a t<BOS>
๐0 ๐1 ๐2 ๐3
๐2 ๐2 ๐3๐3๐ ๐๐ ๐ ๐
๐๐ ๐ ๐๐
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
๐ผ4,2
๐ผ4,2 = ๐ผ4,1๐4,1 ๐ + ๐ผ3,2๐3,2 ๐
๐ผ4,1
๐ผ3,2
generate โaโ
read ๐ฅ4
generate โ๐ โ,
๐ผ๐,๐:the summation of the scores of all the alignmentsthat read i-th acoustic features and output j-th tokens
๐ผ๐,๐
โโ๐๐๐๐๐ ๐
๐ โ|๐
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
๐ผ4,2 = ๐ผ4,1๐4,1 ๐ + ๐ผ3,2๐3,2 ๐
๐ผ๐,๐:the summation of the scores of all the alignmentsthat read i-th acoustic features and output j-th tokens
You can compute summation of the scores of all the alignments.
1. Enumerate all the possible alignments
HMM, CTC, RNN-T
P๐ ๐|Y =
โโ๐๐๐๐๐ ๐
๐ ๐|โ
๐โ = ๐๐๐max๐
๐๐๐P๐ ๐|๐
Yโ = ๐๐๐maxY
๐๐๐๐ Y|๐
P๐ Y|๐ =
โโ๐๐๐๐๐ ๐
๐ โ|๐
HMM CTC, RNN-T
๐P๐ ๐|๐
๐๐=?
2. How to sum over all the alignments
3. Training:
4. Testing (Inference, decoding):
Training
๐ c ๐ ๐ a ๐ t ๐ ๐
๐1,0 ๐ ๐2,0 ๐ ๐2,1 ๐ ๐3,1 ๐ ๐4,1 ๐ ๐4,2 ๐ ๐5,2 ๐ก ๐5,3 ๐ ๐6,3 ๐
๐โ = ๐๐๐max๐
๐๐๐๐ ๐|๐
๐ ๐|๐ =
โ
๐ โ|๐
๐๐ ๐|๐
๐๐=?
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
๐4,1 ๐
๐3,2 ๐
๐1,0 ๐ ๐2,0 ๐ ๐2,1 ๐ ๐3,1 ๐ ๐4,1 ๐ ๐4,2 ๐ ๐5,2 ๐ก ๐5,3 ๐ ๐6,3 ๐
๐ ๐|๐ =
โ
๐ โ|๐
Each arrow is a component in ๐ ๐|๐ =
โ
๐ โ|๐
Training
๐ c ๐ ๐ a ๐ t ๐ ๐
๐1,0 ๐ ๐2,0 ๐ ๐2,1 ๐ ๐3,1 ๐ ๐4,1 ๐ ๐4,2 ๐ ๐5,2 ๐ก ๐5,3 ๐ ๐6,3 ๐
๐๐ ๐|๐
๐๐=?
๐โ = ๐๐๐max๐
๐๐๐๐ ๐|๐
๐ ๐|๐ =
โ
๐ โ|๐
๐
๐4,1 ๐
๐3,2 ๐
โฆโฆ๐ ๐|๐
๐๐4,1 ๐
๐๐
๐๐ ๐|๐
๐๐4,1 ๐+๐๐3,2 ๐
๐๐
๐๐ ๐|๐
๐๐3,2 ๐+โฏ
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
๐
๐4,1 ๐
๐3,2 ๐
โฆโฆ๐ ๐|๐
๐๐ ๐|๐
๐๐=?
Each arrow is a component
๐๐4,1 ๐
๐๐
๐๐ ๐|๐
๐๐4,1 ๐+๐๐3,2 ๐
๐๐
๐๐ ๐|๐
๐๐3,2 ๐+โฏ
๐4,1 ๐
๐3,2 ๐
๐1,0 ๐
โ1 โ2 โ2 โ3 โ4
๐1,0
๐2,0 ๐
๐2,0
๐2,1 ๐
๐2,1
๐3,1 ๐
๐3,1
๐4,1 ๐
๐4,1
c<BOS>
๐0 ๐1
๐0 ๐0 ๐1 ๐1 ๐1
๐๐4,1 ๐
๐๐=?
Backpropagation (through time)
To encoder
๐ ๐|๐ =
โ ๐ค๐๐กโ ๐4,1 ๐
๐ โ|๐ +
โ ๐ค๐๐กโ๐๐ข๐ก ๐4,1 ๐
๐ โ|๐
=
โ ๐ค๐๐กโ ๐4,1 ๐
๐ โ|๐
๐4,1 ๐
๐๐ ๐|๐
๐๐=?
๐๐4,1 ๐
๐๐
๐๐ ๐|๐
๐๐4,1 ๐+๐๐3,2 ๐
๐๐
๐๐ ๐|๐
๐๐3,2 ๐+โฏ
๐4,1 ๐ ร ๐๐กโ๐๐
๐๐ ๐|๐
๐๐4,1 ๐
=1
๐4,1 ๐
โ ๐ค๐๐กโ ๐4,1 ๐
๐ โ|๐
=
โ ๐ค๐๐กโ ๐4,1 ๐
๐๐กโ๐๐
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
๐ฝ4,2
๐ฝ4,2 = ๐ฝ4,3๐4,2 ๐ก + ๐ฝ5,2๐4,2 ๐
generate โtโ
read ๐ฅ5
generate โ๐โ
๐ฝ๐,๐:the summation of the score of all the alignmentsstaring from i-th acoustic features and j-th tokens
๐ฝ5,2
๐ฝ4,3
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
๐ฝ4,2
๐ผ4,1
๐4,1 ๐
๐๐ ๐|๐
๐๐4,1 ๐=
1
๐4,1 ๐
๐ ๐ค๐๐กโ ๐4,1 ๐
๐ โ|๐ ๐ผ4,1 ๐4,1 ๐ ๐ฝ4,2
๐๐ ๐|๐
๐๐=?
๐๐4,1 ๐
๐๐
๐๐ ๐|๐
๐๐4,1 ๐+๐๐3,2 ๐
๐๐
๐๐ ๐|๐
๐๐3,2 ๐+โฏ๐ผ4,1๐ฝ4,2
1. Enumerate all the possible alignments
HMM, CTC, RNN-T
P๐ ๐|Y =
โโ๐๐๐๐๐ ๐
๐ ๐|โ
๐โ = ๐๐๐max๐
๐๐๐P๐ ๐|๐
Yโ = ๐๐๐maxY
๐๐๐๐ Y|๐
P๐ Y|๐ =
โโ๐๐๐๐๐ ๐
๐ โ|๐
HMM CTC, RNN-T
๐P๐ ๐|๐
๐๐=?
2. How to sum over all the alignments
3. Training:
4. Testing (Inference, decoding):
Yโ = ๐๐๐maxY
๐๐๐P Y|๐
= ๐๐๐maxY
๐๐๐
โโ๐๐๐๐๐ ๐
๐ โ|๐
โ ๐๐๐maxY
maxโโ๐๐๐๐๐ ๐
๐๐๐๐ โ|๐
maxโโ๐๐๐๐๐ ๐
๐ โ|๐็ๆณ
โโ = ๐๐๐ maxโ
๐๐๐๐ โ|๐
Yโ = ๐๐๐๐๐โ1 โโ
Decoding
โ = ๐ c ๐ ๐ a ๐ t ๐ ๐ โฆ
๐ โ|๐ = ๐ โ1|๐ ๐ โ2|๐, โ1 ๐ โ3|๐, โ1, โ2 โฆ
โ1 โ2 โ3
็พๅฏฆ
โโ = ๐๐๐ maxโ
๐๐๐๐ โ|๐
๐
โ1 โ2 โ2 โ3 โ4 โ4 โ5 โ5 โ6
๐1,0
๐
๐2,0
๐
๐2,1
๐
๐3,1
๐
๐4,1
๐
๐4,2
๐ก
๐5,2
๐
๐5,3
๐
๐6,3
c a t<BOS>
๐0 ๐1 ๐2 ๐3
๐0 ๐0 ๐1 ๐1 ๐1 ๐2 ๐2 ๐3๐3
max
โโ
Beam Search for better
approximation
Summary
LAS CTC RNN-T
Decoder dependent independent dependent
Alignment not explicit (soft alignment)
Yes Yes
Training just train it sum over alignment
sum over alignment
On-line No Yes Yes
Reference
โข [Yu, et al., INTERSPEECHโ16] Dong Yu, Wayne Xiong, Jasha Droppo, Andreas Stolcke , Guoli Ye, Jinyu Li , Geoffrey Zweig, Deep Convolutional Neural Networks with Layer-wise Context Expansion and Attention, INTERSPEECH, 2016
โข [Saon, et al., INTERSPEECHโ17] George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, BhuvanaRamabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, Phil Hall, English Conversational Telephone Speech Recognition by Humans and Machines, INTERSPEECH, 2017
โข [Povey, et al., ICASSPโ10] Daniel Povey, Lukas Burget, Mohit Agarwal, Pinar Akyazi, Kai Feng, Arnab Ghoshal, Ondrej Glembek, Nagendra Kumar Goel, Martin Karafiat, Ariya Rastrow, Richard C. Rose, Petr Schwarz, Samuel Thomas, Subspace Gaussian Mixture Models for speech recognition, ICASSP, 2010
โข [Mohamed , et al., ICASSPโ10] Abdel-rahman Mohamed and Geoffrey Hinton, Phone recognition using Restricted Boltzmann Machines, ICASSP, 2010
Top Related