Towards Machine Comprehension of Spoken...

機器能不能無師自通學習人類語言

Hung-yi Lee

Unsupervised Learning

Abstractive Summarization

Chat-bot

Voice ConversionAudio word2vec &

Speech Recognition

Generative Adversarial Network (GAN)

https://www.youtube.com/playlist?list=PLJV_el3uVTsMd2G9ZjcpJn1YfnM9wVOBf


• Now machine can do abstractive summary (write summaries in its own words)

Title 1

Title 2

Title 3

Training Data

title generated by machine

without hand-crafted rules

(in its own words)


• Input: transcriptions of audio, output: summary

ℎ1 ℎ2 ℎ3 ℎ4

RNN Encoder: read through the input

w1 w4w2 w3

transcriptions of audio from automatic speech recognition (ASR)

𝑧1 𝑧2 ……

……wA wB

RNN generator

We need lots of labelled training data (supervised).

Unsupervised Abstractive Summarization

• Now machine can do abstractive summary by seq2seq (write summaries in its own words)

summary 1

summary 2

summary 3

seq2seq


G RSummary?

Seq2seq Seq2seq

document documentword

sequence

Only need a lot of documents to train the model

This is a seq2seq2seq auto-encoder.

Using a sequence of words as latent representation.

not readable …


G R

Seq2seq Seq2seq

word sequence

D

Human written summaries Real or not

DiscriminatorLet Discriminator considers

my output as real

document document

Summary?

Readable

REINFORCE algorithm is used.


• Document:澳大利亞今天與13個國家簽署了反興奮劑雙邊協議,旨在加強體育競賽之外的藥品檢查並共享研究成果 ……

• Summary:

• Human:澳大利亞與13國簽署反興奮劑協議

• Unsupervised:澳大利亞加強體育競賽之外的藥品檢查

• Document:中華民國奧林匹克委員會今天接到一九九二年冬季奧運會邀請函,由於主席張豐緒目前正在中南美洲進行友好訪問,因此尚未決定是否派隊赴賽 ……

• Summary:

• Human:一九九二年冬季奧運會函邀我參加

• Unsupervised:奧委會接獲冬季奧運會邀請函

感謝王耀賢同學提供實驗結果


• Document:據此間媒體27日報道,印度尼西亞蘇門答臘島的兩個省近日來連降暴雨,洪水泛濫導致塌方,到26日為止至少已有60人喪生,100多人失蹤 ……

• Summary:

• Human:印尼水災造成60人死亡

• Unsupervised:印尼門洪水泛濫導致塌雨

• Document:安徽省合肥市最近為領導幹部下基層做了新規定:一律輕車簡從,不準搞迎來送往、不準搞層層陪同 ……

• Summary:

• Human:合肥規定領導幹部下基層活動從簡

• Unsupervised:合肥領導幹部下基層做搞迎來送往規定:一律簡

感謝王耀賢同學提供實驗結果

Semi-supervised Learning

25

26

27

28

29

30

31

32

33

34

0 10k 500k

RO

UG

E-1

Number of document-summary pairs used

WGAN Reinforce Supervised

Using matched data

(3.8M pairs are used)

(unpublished result)



Chat-bot


Speech Recognition


Sequence-to-sequence

Encoder Decoder

Input sentence c

output sentence x

Training data:

A: How are you ?

B: I’m good.

……

……

How are you ?

I’m good.

Seq2seq

Output: Not bad I’m John.

Maximize likelihood

Training Criterion

Human better

better

Reinforcement Learning

• Machine obtains feedback from user

• Chat-bot learns to maximize the expected reward

How are you?

Bye bye☺

Hello

Hi ☺

-10 3

Alpha GO style training !

• Let two agents talk to each other

How old are you?

See you.

See you.

See you.

How old are you?

I am 16.

I though you were 12.

What make you think so?

Still need humans to provide reward

https://arxiv.org/pdf/1606.01541.pdf

human dialogues


• Let two agents talk to each other

How old are you?

See you.

See you.

See you.

D

Real or not

Discriminator

Conditional GAN

https://arxiv.org/pdf/1701.06547.pdf

Personalized Chat-bot

• General chat-bots generate plain responses

• Human talks in different styles and sentiments to different people in different conditions.

• We want the response of chat-bot to be controllable.

• Therefore, chat-bot can be personalized in the future

• We only focus on generate positive response.

Input: How was your day today?It is wonderful today.

It is terrible today.

Optimistic Chat-bot

[Chih-Wei Lee, et al., ICASSP, 2018]

Approaches

Input sentence

response sentence

Chatbot

En De Transformation

Positive response

Do not have to change

Input sentence

response sentence

Chatbot

En De

Parameters are modified

Type 2. Output Transformation

Type 1. System Modification

Positive response

1. Persona-based Approach2. Reinforcement

3. Plug & Play4. Cycle GAN

Approaches

• 1. Persona-Based Model

How is today

Today is awesome

SentimentClassifier

Today is awesome

0.9

Training

How positive is the reference

off-the-shelf

[Jiwei Li, et al., ACL, 2016]

(reference response)

Approaches


How is today

Today is bad

SentimentClassifier

Today is bad

0.1

Training

How positive is the reference

[Jiwei Li, et al., ACL, 2016]

(reference response)

Approaches


I love you

? ? ?

?

Testing

?

Response: I love you, too.

= 0.0

? = 1.0

Response: I am not ready to start a relationship.

1.0

Approaches

2. Reinforcement Learning

How is today

Today is bad

Positive reward for positive response Sentiment

Classifier

0.1

good

Network parameters are updated

Approaches

3. Plug & Play

response sentence

Transformation

Positive response

code

sentence VRAEEncoder

VRAEDecoder

sentence

coderesponse of chat-bot

VRAEEncoder

VRAEDecoder

Positiveresponse

SentimentClassifier

new code

As large as possible

As close as possible

VRAE = Variational Recurrent Auto-encoder

Approaches

4. Cycle GAN

response sentence

Transformation

Positive response

Domain X Domain Y

It is good.

It’s a good day.

I love you.

It is bad.

It’s a bad day.

I don’t love you.

positive sentences negative sentences

Speaker A Speaker B

Cycle GAN

𝐺𝑋→𝑌 𝐺Y→X

as close as possible

𝐺Y→X 𝐺𝑋→𝑌


𝐷𝑌𝐷𝑋scalar: belongs to domain Y or not

scalar: belongs to domain X or not

Cycle GAN





𝐷𝑌𝐷𝑋negative sentence? positive sentence?

It is bad. It is good. It is bad.

I love you. I hate you. I love you.positive

positive

positivenegative

negative negative

Cycle GAN





𝐷𝑌𝐷𝑋negative sentence? positive sentence?

It is bad. It is good. It is bad.

I love you. I hate you. I love you.positive

positive

positivenegative

negative negative

Word embedding[Lee, et al., ICASSP, 2018]

Discrete?

Cycle GAN

• Negative sentence to positive sentence:it's a crappy day → it's a great dayi wish you could be here → you could be hereit's not a good idea → it's good ideai miss you → i love youi don't love you → i love youi can't do that → i can do thati feel so sad → i happyit's a bad day → it's a good dayit's a dummy day → it's a great daysorry for doing such a horrible thing → thanks for doing a great thingmy doggy is sick → my doggy is my doggymy little doggy is sick → my little doggy is my little doggy

Human Evaluation Objective evaluation has the same trend.

Sentiment Grammar Coherence

Seq2seq (Baseline) 0.16 1.00 0.55

Persona-based 0.71 0.75 0.24

Reinforcement 0.70 0.93 0.35

Plug & Play 0.48 0.43 0.15

Cycle GAN 0.63 0.91 0.44

Sentiment: Is the chat-bot response positive?

Human annotators gave score 0, 1, 2, 3, 4, 5, and normalized to from 0 to 1.

Grammar: Is the chat-bot response grammatically correct?

Coherence: Does chat-bot output a suitable response?

Human Evaluation Objective evaluation has the same trend.

Sentiment Grammar Coherence

Seq2seq (Baseline) 0.16 1.00 0.55

Persona-based 0.71 0.75 0.24

Reinforcement 0.70 0.93 0.35

Plug & Play 0.48 0.43 0.15

Cycle GAN 0.63 0.91 0.44

⚫Modifying the chat-bot parameters leads to more obvious positive responses

⚫ Reinforcement learning and cycle GAN make less grammatical errors.

⚫ Cycle GAN has better semantic coherence.

Example 1

• Input: yeah, i’m just less interested in the how-to and more in the what and why of the things we build

• Seq2seq: it’s not a good thing .

• Persona: I’m interested in this .

• RL: it’s a good one

• Plug and play: it’s not a wrong thing .

• Cycle GAN: it’s a good thing .

Example 2

• Input: always learning and self improving is a huge turn on for me so believe me iunderstand

• Seq2seq: i'm not sure if i can see you in a relationship

• Persona: and you're gonna have a word with your mother .

• RL: i love you so much

• Plug and play: i'm not sure if i can see you a a special

• Cycle GAN: i'm sure if i can see you in a relationship.



Chat-bot


Speech Recognition


Voice Conversion

In the past

Today

Speaker A Speaker B

How are you? How are you?

Good morning Good morning

Speaker A Speaker B

天氣真好 How are you?

再見囉 Good morning

Speakers A and B are talking about completely different things.

• Multi-target VC [Chou et al., arxiv 2018]

𝑒𝑛𝑐(𝒙)𝒙

Voice Conversion

C

𝑬nc Dec

𝒚

𝒚 𝒚′····

𝑒𝑛𝑐(𝒙)

𝑬nc Dec

𝒚"

𝑮

𝒚"

D+C

Real data

𝒙 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚) 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚′)

➢ Stage-1

➢ Stage-2

F/R

ID

···

• Subjective evaluations

Voice Conversion (Multi-target VC)

Fig. 20: Preference test results

1. The proposed method uses non-parallel data.2. The multi-target VC approach outperforms one-stage only.

3. The multi-target VC approach is comparable to Cycle-GAN-VC in terms of the naturalness and the similarity.

Speaker A Speaker B

我

感謝周儒杰同學提供實驗結果

1

2 3



Chat-bot


Speech Recognition


Audio Word2Vector word-level audio segment

Model Model Model Model

Learn from lots of audio without annotation

Audio Word2Vector (v1)

• The audio segments corresponding to words with similar pronunciations are close to each other.

ever ever

never

never

never

dog

dog

dogs

In the following discussion, assume that we already get the segmentation.

Query-by-example Spoken Term Detection

user

“Trump”spoken

query

Compute similarity between spoken queries and audio

files on acoustic level, and find the query term

Spoken Content

“Trump” “Trump”

Query-by-example Spoken Term Detection

Audio archive divided into variable-length audio segments

Audio Word to Vector

Audio Word to Vector

Similarity

Search Result

Spoken Query

Off-line

On-line

Much faster than DTW

Sequence-to-sequence Auto-encoder

audio segment

acoustic featuresx1 x2 x3 x4

RNN Encoder

vector

The vector we want

We use sequence-to-sequence auto-encoder here

The training is unsupervised.

audio segment

Sequence-to-sequence Auto-encoder

RNN Decoderx1 x2 x3 x4

y1 y2 y3 y4

x1 x2 x3x4

RNN Encoder

audio segment

acoustic features

The RNN encoder and decoder are jointly trained.

Input acoustic features

RNN

Encoder

RNN

Decoderinput segment reconstructed

Original Seq2seq Auto-encoder

Include phonetic information,

speaker information, etc.

Feature Disentangle

Phonetic

Encoder

Speaker

Encoder

RNN

Decoder

𝑧

𝑒

reconstructed input segment

Feature Disentangle

Phonetic

Encoder

Speaker

Encoder

RNN

Decoder

𝑧

𝑒


Speaker

Encoder

Speaker

Encoder

same speaker as close as

possible

𝒙𝑖

𝒙𝑗

𝑒𝑖

𝑒𝑗

different speakersdistance

larger than a

threshold

Assume that we know speaker ID of segments.

Feature Disentangle

Phonetic

Encoder

Speaker

Encoder

RNN

Decoder

𝑧

𝑒


Phonetic

Encoder

Phonetic

Encoder

Speaker

Classifier score

Learn to confuse Speaker Classifier

𝒙𝑖

𝒙𝑗

𝑧𝑖

𝑧𝑗 same speaker

different speakers

Inspired from domain adversarial training

Audio segments of two different speakers

• At each time step, RNN encoder determines whether it is right before a boundary.

• If it is determined as right before a boundary, a vector (an embedding for an audio segment) is outputted.

RNN Encoder

Joint Learning of Segmentation and Seq2seq Auto-encoder

audio segment x

embedding for x

Where to segment is determined automatically.

RNN Encoder

RNN Decoder


audio segment x

embedding for x

RNN decoder reconstructs the input utterance


• The learning criterion of RNN encoder and decoder is the weighted sum of the following two terms.

• 1. Minimizing Reconstruction error

• 2. Minimizing the number of segments, that is the number of output embedding

The second term is necessary.

If we only minimize reconstruction error,

RNN encoder would output embedding at all of the time steps.


• The learning criterion of RNN encoder and decoder is the weighted sum of the following two terms.

• 1. Minimizing Reconstruction error

• 2. Minimizing the number of segments, that is the number of output embedding

Determine whether to output an embedding is discrete.

Therefore, the whole network is not differentiable.

Use REINFORCE algorithm for training

Audio Word2Vector (v2)

• Audio word to vector with semantics

flower tree

dog

cat

cats

walk

walked

run

Audio Skip-gram

Skip-gram

One-hot

𝑤𝑡

𝑤𝑡−2

𝑤𝑡−1

𝑤𝑡+1

𝑤𝑡+2

Semantic embedding

linear

linear

Audio Skip-gram

Phonetic

Encoder

Skip-gram

Semantic embedding

𝑤𝑡

𝑤𝑡−2

𝑤𝑡−1

𝑤𝑡+1

𝑤𝑡+2Audio

Replace one-hot with phonetic encoder output

2 hidden layer

2 hidden layer

Experiments

• Spearman’s rank correlation scores of audio word2vec and text word2vec

Dataset V1 V2

MEN 0.38 0.43

Mturk 0.37 0.47

RG65 0.12 0.16

RW 0.70 0.71

SimLex999 0.22 0.27

WS353 0.50 0.52

WS353R 0.47 0.53

WS353S 0.53 0.50

Wo

rd P

air

Set

[Chen, et al., arXiv, 2018]

V1: Phonetic Information

V2: Semantic Information

Training Corpus: Librispeech

dog

catpig

Iyou

orange

apple

Text Word2Vec

……I eat an apple.You eat an orange.

……

Audio Word2Vec (v2)

I eat an apple

you eat an orange

A large collection of Text

A large collection of Audio

Unsupervised Conditional Generation

Unsupervised Speech Recognition?

Labeled Pairs Top 1 Top 10 Top 100

0 0.00 0.00 0.02

1K 0.04 0.14 0.50

2K 0.11 0.45 0.76

5K 0.18 0.61 0.86

Audio Word2Vec (v2)

Conditional

Generatordog

catpig

Iyou

orange

apple

Word2Vec

dog

Unsupervised Speech Recognition

AY L AH V Y UW

G UH D B AY

HH AW AA R Y UW

T AY W AA N

AY M F AY NGAN

“AY”=

Phone-level Acoustic Pattern Discovery

p1

p1 p3 p2

p1 p4 p3 p5 p5

p1 p5 p4 p3

p1 p2 p3 p4

[Liu, et al., arXiv, 2018]

Phoneme sequences from Text

Unsupervised Speech Recognition

• Phoneme recognitionAudio: TIMITText: WMT

supervised

WGAN-GP

Gumbel-softmax

[Liu, et al., arXiv, 2018]

(using oracle phoneme boundaries)

Conclusion


Chat-bot


Speech Recognition

Towards Unsupervised Learning by GAN

Towards Machine Comprehension of Spoken...

Documents

Transcript of Towards Machine Comprehension of Spoken...