Towards Machine Comprehension of Spoken...

61
機器能不能無師自通 學習人類語言 Hung-yi Lee

Transcript of Towards Machine Comprehension of Spoken...

Page 1: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

機器能不能無師自通學習人類語言

Hung-yi Lee

Page 2: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Unsupervised Learning

Abstractive Summarization

Chat-bot

Voice ConversionAudio word2vec &

Speech Recognition

Generative Adversarial Network (GAN)

https://www.youtube.com/playlist?list=PLJV_el3uVTsMd2G9ZjcpJn1YfnM9wVOBf

Page 3: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Abstractive Summarization

• Now machine can do abstractive summary (write summaries in its own words)

Title 1

Title 2

Title 3

Training Data

title generated by machine

without hand-crafted rules

(in its own words)

Page 4: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Abstractive Summarization

• Input: transcriptions of audio, output: summary

ℎ1 ℎ2 ℎ3 ℎ4

RNN Encoder: read through the input

w1 w4w2 w3

transcriptions of audio from automatic speech recognition (ASR)

𝑧1 𝑧2 ……

……wA wB

RNN generator

We need lots of labelled training data (supervised).

Page 5: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Unsupervised Abstractive Summarization

• Now machine can do abstractive summary by seq2seq (write summaries in its own words)

summary 1

summary 2

summary 3

seq2seq

Page 6: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Unsupervised Abstractive Summarization

G RSummary?

Seq2seq Seq2seq

document documentword

sequence

Only need a lot of documents to train the model

This is a seq2seq2seq auto-encoder.

Using a sequence of words as latent representation.

not readable …

Page 7: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Unsupervised Abstractive Summarization

G R

Seq2seq Seq2seq

word sequence

D

Human written summaries Real or not

DiscriminatorLet Discriminator considers

my output as real

document document

Summary?

Readable

REINFORCE algorithm is used.

Page 8: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Unsupervised Abstractive Summarization

• Document:澳大利亞今天與13個國家簽署了反興奮劑雙邊協議,旨在加強體育競賽之外的藥品檢查並共享研究成果 ……

• Summary:

• Human:澳大利亞與13國簽署反興奮劑協議

• Unsupervised:澳大利亞加強體育競賽之外的藥品檢查

• Document:中華民國奧林匹克委員會今天接到一九九二年冬季奧運會邀請函,由於主席張豐緒目前正在中南美洲進行友好訪問,因此尚未決定是否派隊赴賽 ……

• Summary:

• Human:一九九二年冬季奧運會函邀我參加

• Unsupervised:奧委會接獲冬季奧運會邀請函

感謝王耀賢同學提供實驗結果

Page 9: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Unsupervised Abstractive Summarization

• Document:據此間媒體27日報道,印度尼西亞蘇門答臘島的兩個省近日來連降暴雨,洪水泛濫導致塌方,到26日為止至少已有60人喪生,100多人失蹤 ……

• Summary:

• Human:印尼水災造成60人死亡

• Unsupervised:印尼門洪水泛濫導致塌雨

• Document:安徽省合肥市最近為領導幹部下基層做了新規定:一律輕車簡從,不準搞迎來送往、不準搞層層陪同 ……

• Summary:

• Human:合肥規定領導幹部下基層活動從簡

• Unsupervised:合肥領導幹部下基層做搞迎來送往規定:一律簡

感謝王耀賢同學提供實驗結果

Page 10: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Semi-supervised Learning

25

26

27

28

29

30

31

32

33

34

0 10k 500k

RO

UG

E-1

Number of document-summary pairs used

WGAN Reinforce Supervised

Using matched data

(3.8M pairs are used)

(unpublished result)

Page 11: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Unsupervised Learning

Abstractive Summarization

Chat-bot

Voice ConversionAudio word2vec &

Speech Recognition

Generative Adversarial Network (GAN)

Page 12: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Sequence-to-sequence

Encoder Decoder

Input sentence c

output sentence x

Training data:

A: How are you ?

B: I’m good.

……

……

How are you ?

I’m good.

Seq2seq

Output: Not bad I’m John.

Maximize likelihood

Training Criterion

Human better

better

Page 13: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Reinforcement Learning

• Machine obtains feedback from user

• Chat-bot learns to maximize the expected reward

How are you?

Bye bye☺

Hello

Hi ☺

-10 3

Page 14: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Alpha GO style training !

• Let two agents talk to each other

How old are you?

See you.

See you.

See you.

How old are you?

I am 16.

I though you were 12.

What make you think so?

Still need humans to provide reward

https://arxiv.org/pdf/1606.01541.pdf

Page 15: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

human dialogues

Generative Adversarial Network (GAN)

• Let two agents talk to each other

How old are you?

See you.

See you.

See you.

D

Real or not

Discriminator

Conditional GAN

https://arxiv.org/pdf/1701.06547.pdf

Page 16: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Personalized Chat-bot

• General chat-bots generate plain responses

• Human talks in different styles and sentiments to different people in different conditions.

• We want the response of chat-bot to be controllable.

• Therefore, chat-bot can be personalized in the future

• We only focus on generate positive response.

Input: How was your day today?It is wonderful today.

It is terrible today.

Optimistic Chat-bot

[Chih-Wei Lee, et al., ICASSP, 2018]

Page 17: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Approaches

Input sentence

response sentence

Chatbot

En De Transformation

Positive response

Do not have to change

Input sentence

response sentence

Chatbot

En De

Parameters are modified

Type 2. Output Transformation

Type 1. System Modification

Positive response

1. Persona-based Approach2. Reinforcement

3. Plug & Play4. Cycle GAN

Page 18: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Approaches

• 1. Persona-Based Model

How is today

Today is awesome

SentimentClassifier

Today is awesome

0.9

Training

How positive is the reference

off-the-shelf

[Jiwei Li, et al., ACL, 2016]

(reference response)

Page 19: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Approaches

• 1. Persona-Based Model

How is today

Today is bad

SentimentClassifier

Today is bad

0.1

Training

How positive is the reference

[Jiwei Li, et al., ACL, 2016]

(reference response)

Page 20: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Approaches

• 1. Persona-Based Model

I love you

? ? ?

?

Testing

?

Response: I love you, too.

= 0.0

? = 1.0

Response: I am not ready to start a relationship.

1.0

Page 21: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Approaches

2. Reinforcement Learning

How is today

Today is bad

Positive reward for positive response Sentiment

Classifier

0.1

good

Network parameters are updated

Page 22: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Approaches

3. Plug & Play

response sentence

Transformation

Positive response

code

sentence VRAEEncoder

VRAEDecoder

sentence

coderesponse of chat-bot

VRAEEncoder

VRAEDecoder

Positiveresponse

SentimentClassifier

new code

As large as possible

As close as possible

VRAE = Variational Recurrent Auto-encoder

Page 23: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Approaches

4. Cycle GAN

response sentence

Transformation

Positive response

Domain X Domain Y

It is good.

It’s a good day.

I love you.

It is bad.

It’s a bad day.

I don’t love you.

positive sentences negative sentences

Speaker A Speaker B

Page 24: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Cycle GAN

𝐺𝑋→𝑌 𝐺Y→X

as close as possible

𝐺Y→X 𝐺𝑋→𝑌

as close as possible

𝐷𝑌𝐷𝑋scalar: belongs to domain Y or not

scalar: belongs to domain X or not

Page 25: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Cycle GAN

𝐺𝑋→𝑌 𝐺Y→X

as close as possible

𝐺Y→X 𝐺𝑋→𝑌

as close as possible

𝐷𝑌𝐷𝑋negative sentence? positive sentence?

It is bad. It is good. It is bad.

I love you. I hate you. I love you.positive

positive

positivenegative

negative negative

Page 26: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Cycle GAN

𝐺𝑋→𝑌 𝐺Y→X

as close as possible

𝐺Y→X 𝐺𝑋→𝑌

as close as possible

𝐷𝑌𝐷𝑋negative sentence? positive sentence?

It is bad. It is good. It is bad.

I love you. I hate you. I love you.positive

positive

positivenegative

negative negative

Word embedding[Lee, et al., ICASSP, 2018]

Discrete?

Page 27: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Cycle GAN

• Negative sentence to positive sentence:it's a crappy day → it's a great dayi wish you could be here → you could be hereit's not a good idea → it's good ideai miss you → i love youi don't love you → i love youi can't do that → i can do thati feel so sad → i happyit's a bad day → it's a good dayit's a dummy day → it's a great daysorry for doing such a horrible thing → thanks for doing a great thingmy doggy is sick → my doggy is my doggymy little doggy is sick → my little doggy is my little doggy

Page 28: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Human Evaluation Objective evaluation has the same trend.

Sentiment Grammar Coherence

Seq2seq (Baseline) 0.16 1.00 0.55

Persona-based 0.71 0.75 0.24

Reinforcement 0.70 0.93 0.35

Plug & Play 0.48 0.43 0.15

Cycle GAN 0.63 0.91 0.44

Sentiment: Is the chat-bot response positive?

Human annotators gave score 0, 1, 2, 3, 4, 5, and normalized to from 0 to 1.

Grammar: Is the chat-bot response grammatically correct?

Coherence: Does chat-bot output a suitable response?

Page 29: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Human Evaluation Objective evaluation has the same trend.

Sentiment Grammar Coherence

Seq2seq (Baseline) 0.16 1.00 0.55

Persona-based 0.71 0.75 0.24

Reinforcement 0.70 0.93 0.35

Plug & Play 0.48 0.43 0.15

Cycle GAN 0.63 0.91 0.44

⚫Modifying the chat-bot parameters leads to more obvious positive responses

⚫ Reinforcement learning and cycle GAN make less grammatical errors.

⚫ Cycle GAN has better semantic coherence.

Page 30: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Example 1

• Input: yeah, i’m just less interested in the how-to and more in the what and why of the things we build

• Seq2seq: it’s not a good thing .

• Persona: I’m interested in this .

• RL: it’s a good one

• Plug and play: it’s not a wrong thing .

• Cycle GAN: it’s a good thing .

Page 31: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Example 2

• Input: always learning and self improving is a huge turn on for me so believe me iunderstand

• Seq2seq: i'm not sure if i can see you in a relationship

• Persona: and you're gonna have a word with your mother .

• RL: i love you so much

• Plug and play: i'm not sure if i can see you a a special

• Cycle GAN: i'm sure if i can see you in a relationship.

Page 32: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Unsupervised Learning

Abstractive Summarization

Chat-bot

Voice ConversionAudio word2vec &

Speech Recognition

Generative Adversarial Network (GAN)

Page 33: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Voice Conversion

Page 34: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

In the past

Today

Speaker A Speaker B

How are you? How are you?

Good morning Good morning

Speaker A Speaker B

天氣真好 How are you?

再見囉 Good morning

Speakers A and B are talking about completely different things.

Page 35: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

• Multi-target VC [Chou et al., arxiv 2018]

𝑒𝑛𝑐(𝒙)𝒙

Voice Conversion

C

𝑬nc Dec

𝒚

𝒚 𝒚′····

𝑒𝑛𝑐(𝒙)

𝑬nc Dec

𝒚"

𝑮

𝒚"

D+C

Real data

𝒙 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚) 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚′)

➢ Stage-1

➢ Stage-2

F/R

ID

···

Page 36: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

• Subjective evaluations

Voice Conversion (Multi-target VC)

Fig. 20: Preference test results

1. The proposed method uses non-parallel data.2. The multi-target VC approach outperforms one-stage only.

3. The multi-target VC approach is comparable to Cycle-GAN-VC in terms of the naturalness and the similarity.

Page 37: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Speaker A Speaker B

感謝周儒杰同學提供實驗結果

1

2 3

Page 38: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Unsupervised Learning

Abstractive Summarization

Chat-bot

Voice ConversionAudio word2vec &

Speech Recognition

Generative Adversarial Network (GAN)

Page 39: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Audio Word2Vector word-level audio segment

Model Model Model Model

Learn from lots of audio without annotation

Page 40: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Audio Word2Vector (v1)

• The audio segments corresponding to words with similar pronunciations are close to each other.

ever ever

never

never

never

dog

dog

dogs

In the following discussion, assume that we already get the segmentation.

Page 41: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Query-by-example Spoken Term Detection

user

“Trump”spoken

query

Compute similarity between spoken queries and audio

files on acoustic level, and find the query term

Spoken Content

“Trump” “Trump”

Page 42: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Query-by-example Spoken Term Detection

Audio archive divided into variable-length audio segments

Audio Word to Vector

Audio Word to Vector

Similarity

Search Result

Spoken Query

Off-line

On-line

Much faster than DTW

Page 43: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Sequence-to-sequence Auto-encoder

audio segment

acoustic featuresx1 x2 x3 x4

RNN Encoder

vector

The vector we want

We use sequence-to-sequence auto-encoder here

The training is unsupervised.

audio segment

Page 44: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Sequence-to-sequence Auto-encoder

RNN Decoderx1 x2 x3 x4

y1 y2 y3 y4

x1 x2 x3x4

RNN Encoder

audio segment

acoustic features

The RNN encoder and decoder are jointly trained.

Input acoustic features

Page 45: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

RNN

Encoder

RNN

Decoderinput segment reconstructed

Original Seq2seq Auto-encoder

Include phonetic information,

speaker information, etc.

Feature Disentangle

Phonetic

Encoder

Speaker

Encoder

RNN

Decoder

𝑧

𝑒

reconstructed input segment

Page 46: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Feature Disentangle

Phonetic

Encoder

Speaker

Encoder

RNN

Decoder

𝑧

𝑒

reconstructed input segment

Speaker

Encoder

Speaker

Encoder

same speaker as close as

possible

𝒙𝑖

𝒙𝑗

𝑒𝑖

𝑒𝑗

different speakersdistance

larger than a

threshold

Assume that we know speaker ID of segments.

Page 47: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Feature Disentangle

Phonetic

Encoder

Speaker

Encoder

RNN

Decoder

𝑧

𝑒

reconstructed input segment

Phonetic

Encoder

Phonetic

Encoder

Speaker

Classifier score

Learn to confuse Speaker Classifier

𝒙𝑖

𝒙𝑗

𝑧𝑖

𝑧𝑗 same speaker

different speakers

Inspired from domain adversarial training

Page 48: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Audio segments of two different speakers

Page 49: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

• At each time step, RNN encoder determines whether it is right before a boundary.

• If it is determined as right before a boundary, a vector (an embedding for an audio segment) is outputted.

RNN Encoder

Joint Learning of Segmentation and Seq2seq Auto-encoder

audio segment x

embedding for x

Where to segment is determined automatically.

Page 50: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

RNN Encoder

RNN Decoder

Joint Learning of Segmentation and Seq2seq Auto-encoder

audio segment x

embedding for x

RNN decoder reconstructs the input utterance

Page 51: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Joint Learning of Segmentation and Seq2seq Auto-encoder

• The learning criterion of RNN encoder and decoder is the weighted sum of the following two terms.

• 1. Minimizing Reconstruction error

• 2. Minimizing the number of segments, that is the number of output embedding

The second term is necessary.

If we only minimize reconstruction error,

RNN encoder would output embedding at all of the time steps.

Page 52: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Joint Learning of Segmentation and Seq2seq Auto-encoder

• The learning criterion of RNN encoder and decoder is the weighted sum of the following two terms.

• 1. Minimizing Reconstruction error

• 2. Minimizing the number of segments, that is the number of output embedding

Determine whether to output an embedding is discrete.

Therefore, the whole network is not differentiable.

Use REINFORCE algorithm for training

Page 53: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Audio Word2Vector (v2)

• Audio word to vector with semantics

flower tree

dog

cat

cats

walk

walked

run

Page 54: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Audio Skip-gram

Skip-gram

One-hot

𝑤𝑡

𝑤𝑡−2

𝑤𝑡−1

𝑤𝑡+1

𝑤𝑡+2

Semantic embedding

linear

linear

Page 55: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Audio Skip-gram

Phonetic

Encoder

Skip-gram

Semantic embedding

𝑤𝑡

𝑤𝑡−2

𝑤𝑡−1

𝑤𝑡+1

𝑤𝑡+2Audio

Replace one-hot with phonetic encoder output

2 hidden layer

2 hidden layer

Page 56: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Experiments

• Spearman’s rank correlation scores of audio word2vec and text word2vec

Dataset V1 V2

MEN 0.38 0.43

Mturk 0.37 0.47

RG65 0.12 0.16

RW 0.70 0.71

SimLex999 0.22 0.27

WS353 0.50 0.52

WS353R 0.47 0.53

WS353S 0.53 0.50

Wo

rd P

air

Set

[Chen, et al., arXiv, 2018]

V1: Phonetic Information

V2: Semantic Information

Training Corpus: Librispeech

Page 57: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

dog

catpig

Iyou

orange

apple

Text Word2Vec

……I eat an apple.You eat an orange.

……

Audio Word2Vec (v2)

I eat an apple

you eat an orange

A large collection of Text

A large collection of Audio

Unsupervised Conditional Generation

Unsupervised Speech Recognition?

Page 58: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Labeled Pairs Top 1 Top 10 Top 100

0 0.00 0.00 0.02

1K 0.04 0.14 0.50

2K 0.11 0.45 0.76

5K 0.18 0.61 0.86

Audio Word2Vec (v2)

Conditional

Generatordog

catpig

Iyou

orange

apple

Word2Vec

dog

Page 59: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Unsupervised Speech Recognition

AY L AH V Y UW

G UH D B AY

HH AW AA R Y UW

T AY W AA N

AY M F AY NGAN

“AY”=

Phone-level Acoustic Pattern Discovery

p1

p1 p3 p2

p1 p4 p3 p5 p5

p1 p5 p4 p3

p1 p2 p3 p4

[Liu, et al., arXiv, 2018]

Phoneme sequences from Text

Page 60: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Unsupervised Speech Recognition

• Phoneme recognitionAudio: TIMITText: WMT

supervised

WGAN-GP

Gumbel-softmax

[Liu, et al., arXiv, 2018]

(using oracle phoneme boundaries)

Page 61: Towards Machine Comprehension of Spoken Contenton-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-6_General Speak… · •Now machine can do abstractive summary (write summaries in

Conclusion

Abstractive Summarization

Chat-bot

Voice ConversionAudio word2vec &

Speech Recognition

Towards Unsupervised Learning by GAN