Ph.D defence (Shinnosuke Takamichi)

46
2015©Shinnosuke TAKAMICHI 12/22/2015 Acoustic modeling and speech parameter generation for high-quality statistical parametric speech synthesis (高音質な統計的パラメトリック音声合成のための 音響モデリング法と音声パラメータ生成法) Nara Institute of Science and Technology Shinnosuke Takamichi Ph.D defense

Transcript of Ph.D defence (Shinnosuke Takamichi)

Page 1: Ph.D defence (Shinnosuke Takamichi)

2015©Shinnosuke TAKAMICHI

12/22/2015

Acoustic modeling and speech parameter generation

for high-quality statistical parametric speech synthesis (高音質な統計的パラメトリック音声合成のための

音響モデリング法と音声パラメータ生成法)

Nara Institute of Science and Technology

Shinnosuke Takamichi

Ph.D defense

Page 2: Ph.D defence (Shinnosuke Takamichi)

/46

Research target

2

Speech

Page 3: Ph.D defence (Shinnosuke Takamichi)

/46

Speech synthesis and its benefits

Speech synthesis: a method to synthesize speech by a computer

– Text-To-Speech (TTS) [Sagisaka et al., 1988.]

– Voice Conversion (VC) [Stylianou et al., 1988.]

What is required?

– Flexible control of voice beyond ability of one human

– High-quality speech generation like human

3

Text TTS

VC

Page 4: Ph.D defence (Shinnosuke Takamichi)

/46

Statistical parametric speech synthesis

Statistical parametric speech synthesis [Zen et al., 2009.]

– Statistical modeling of relationship between input/output

– Better flexibility than unit selection synthesis [Iwahashi et al., 1993.]

HMM-based TTS & GMM-based VC* [Tokuda et al., 2013.] [Toda et

al., 2007.]

– Mathematical support of the flexibility

– Application from other research areas

– But…

4 *HMM: Hidden Markov Model, GMM: Gaussian Mixture Model

Page 5: Ph.D defence (Shinnosuke Takamichi)

/46

Natural speech vs. synthetic speech

in speech quality

5

Natural speech

spoken by human

Synthetic speech

of HMM-based TTS & GMM-based VC

Why?

Page 6: Ph.D defence (Shinnosuke Takamichi)

/46

Problem definition and rest of this talk

6

Text

Parameteri-

zation error

Insufficient

modeling

Over-

smoothing

Parameteri-

zation error

Text

analysis

Speech

analysis

Speech

parameter

generation

Waveform

synthesis

Acoustic

Modeling

Approaches

in this thesis

Modeling of individual

speech segment

Chapter 3

Modulation spectrum

for over-smoothing

Chapter 4 Chapter 5

Chapter 2

Page 7: Ph.D defence (Shinnosuke Takamichi)

/46

Speech synthesis

Analysis Generation Synthesis Modeling

Modeling of individual

speech segment

Modulation spectrum

for over-smoothing

Chapter 4 Chapter 5

Text

Chapter 2

Chapter 3

Page 8: Ph.D defence (Shinnosuke Takamichi)

/46

2 approaches to speech synthesis

Unit selection synthesis [Iwahashi et al., 1993.]

– High quality but low flexibility

8

Pre-recorded speech database Synthetic speech

Segment selection

Text Text

analysis

Speech

analysis

Param.

Gen.

Wave.

synthesis

Acoustic

modeling

Statistical parametric speech synthesis [Zen et al., 2009.]

– High flexibility but low quality

Page 9: Ph.D defence (Shinnosuke Takamichi)

/46

Text/speech analysis

and waveform synthesis

Text analysis (e.g., [Sagisaka et al., 1990.])

Speech analysis (e.g., [Kawahara et al., 1999.])

9 ana gen syn model

j i

あ ら ゆ る 現 実 を・・・ Sentence

Accent phrase

a ts r a y u r u g e n u o Phoneme

Low High

Pow

er

Frequency

Fourier transform

& pow.

Envelope = spectral parameters

Periodicity in detail = Pitch (F0)

Page 10: Ph.D defence (Shinnosuke Takamichi)

/46

Acoustic modeling in HMM-based TTS

10

𝝀 = argmax 𝑃 𝒀|𝑿, 𝝀

ML training of HMM parameter sets 𝝀

ana gen syn model

“Hello”

Speech

analysis

Text

analysis

Context labels

𝒀 Speech features

Time

sil-h+e h-e+l e-l+o

HMM 𝝀

Context-tied

Gaussian distribution

𝑁 ⋅; 𝝁, 𝚺

e-l+o a-l+o

o-l+o

𝑿

[Zen et al., 2007.]

Page 11: Ph.D defence (Shinnosuke Takamichi)

/46

Acoustic modeling in GMM-based VC

ML training of GMM parameter sets 𝝀

11

𝝀 = argmax 𝑃 𝒀𝑡 , 𝑿𝑡|𝝀

𝑿 Speech features

𝒀 Speech features

Speech

analysis

Speech

analysis

ana gen syn model

GMM 𝝀

𝑿𝑡

𝒀𝑡

𝑁 ⋅; 𝝁, 𝚺 𝑿𝑡

𝒀𝑡

Joint vector

at time 𝑡

[Stylianou et al., 1988.]

Page 12: Ph.D defence (Shinnosuke Takamichi)

/46

Probability to generate features

in HMM-based TTS

12

Text

analysis

HMM parameter sets 𝝀

“Hello”

𝑿

ana gen syn model

Probability to generate the synthetic speech features 𝒀

𝑃 𝒀|𝑿, 𝒒 , 𝝀 = 𝑁 𝒀;𝑬𝒒 , 𝑫𝒒

“h”

“o”

𝝁1

𝝁2

𝝁𝑇

𝝁𝑡

𝑬𝒒

𝜮1−1

𝜮2−1

𝜮𝑇−1

𝜮𝑡−1

𝑫𝒒 −1

Mean vector Covariance matrix

𝒒

[Tokuda et al., 2000.]

Page 13: Ph.D defence (Shinnosuke Takamichi)

/46

Probability to generate features

in GMM-based VC

13

Probability to generate the synthetic speech features 𝒀

𝑃 𝒀|𝑿, 𝒒 , 𝝀 = 𝑁 𝒀;𝑬𝒒 , 𝑫𝒒

Speech

analysis

GMM parameter sets 𝝀

𝑿

𝝁1

𝝁2

𝝁𝑇

𝝁𝑡

𝑬𝒒

𝜮1−1

𝜮2−1

𝜮𝑇−1

𝜮𝑡−1

𝑫𝒒 −1

Mean vector Covariance matrix

𝒒

[Toda et al., 2007.]

ana gen syn model

Page 14: Ph.D defence (Shinnosuke Takamichi)

/46

Speech parameter generation

ML generation of synthetic speech parameters 𝒚 𝒒

– Computationally-efficient generation (solved in a closed form)

14

𝒚 𝒒 = argmax 𝑃 𝒀|𝑿, 𝒒 , 𝝀 = argmax 𝑃 𝒚, Δ𝒚|𝑿, 𝒒 , 𝝀

Time

Sta

tic 𝒚

Tem

pora

l

de

lta

Δ𝒚

𝒚 𝒒

Δ𝒚 𝒒

Mean and variance

[Tokuda et al., 2000.]

ana gen syn model

Page 15: Ph.D defence (Shinnosuke Takamichi)

/46

Statistical sample-based speech synthesis

Analysis Generation Synthesis Modeling

Modeling of individual

speech segment

Modulation spectrum

for over-smoothing

Chapter 3 Chapter 4 Chapter 5

Text

Chapter 2

Page 16: Ph.D defence (Shinnosuke Takamichi)

/46

Quality degradation

by acoustic modeling

16

Averaging across input features

Context-tied Gaussian in HMM-based TTS

→ Robust to the unseen context

→ Loses info. of individual speech parameters.

𝑁 ⋅; 𝝁, 𝚺

e-l+o a-l+o

o-l+o

Proposed approach

– Models individual speech parameters while keeping robustness.

– Select one model in parameter generation.

– → Able to alleviate the quality degradation caused by averaging

Page 17: Ph.D defence (Shinnosuke Takamichi)

/46

Acoustic modeling

of the proposed method

17

From the tied model to Rich context-GMM (R-GMM)

𝑁 ⋅; 𝝁, 𝚺

e-l+o a-l+o

o-l+o

Rich context models [Yan et al., 2009.] Less-averaged models having robustness

R-GMM Model that is formed as the same as the

conventional tied model

Update the mean while tying the covariance.

Gathers them with the same mixture weights.

Page 18: Ph.D defence (Shinnosuke Takamichi)

/46

Speech parameter generation

from R-GMMs

18

ML generation of synthetic speech parameters 𝒚 𝒒

– Iterative generation with the explicit model selection*

𝒚 𝒒 = argmax 𝑃 𝒚, Δ𝒚|𝒎 ,𝑿 𝑃 𝒎 |𝒚, Δ𝒚, 𝑿

Mean

± variance

Time Time

Sta

tic fe

atu

re 𝒚

𝒎

Tied model R-GMM

∗ 𝝀 (HMM/GMM parameter sets) is omitted.

Page 19: Ph.D defence (Shinnosuke Takamichi)

/46

Discussion

Initialization of the parameter generation (Sec. 3.5)

– Uses speech parameters from the over-trained statistics.

– → Avoids averaging by initialization, and alleviating over-training by

parameter generation.

Comparison to unit selection synthesis (Sec. 2.2)

– The model selection corresponds to the waveform segment selection.

– → Integrates unit selection in the statistical modeling.

Comparison to conventional hybrid methods

– Able to apply voice controlling methods, e.g., [Yamagishi et al., 2007.].

– → Better flexibility than [Yan et al., 2009.][Ling et al., 2007.] (Sec. 2.8)

19

Page 20: Ph.D defence (Shinnosuke Takamichi)

/46

Subjective evaluation

(preference test on speech quality)

20

Pre

fere

nce s

core

0.0

0.2

0.4

0.6

0.8

1.0 HMM-based TTS

Spectrum H H R R T

F0 H R H R T

H/G: HMM/GMM (= tied model), R: R-GMM, T: Target (``R’’ using reference)

95% conf. interval

GMM-based VC

Pre

fere

nce

sco

re

0.0

0.2

0.4

0.6

0.8

1.0

G R T

Page 21: Ph.D defence (Shinnosuke Takamichi)

/46

Modulation spectrum-based post-filter

Analysis Generation Synthesis Modeling

Modeling of individual

speech segment

Modulation spectrum

for over-smoothing

Chapter 4 Chapter 5

Text

Chapter 2

Chapter 3

Page 22: Ph.D defence (Shinnosuke Takamichi)

/46

Over-smoothing

in parameter generation

22

Time Natural speech parameters

Time

Synthetic speech parameters

Speech

parameter

generation

Acoustic

modeling

Page 23: Ph.D defence (Shinnosuke Takamichi)

/46

Revisits speech parameter generation (Sec. 2.6)

ML generation of synthetic speech parameters 𝒚 𝒒 *

23

Time

Sp

ectr

al p

ara

me

ter

Natural

𝒚 𝒒 = argmax 𝑃 𝒚, Δ𝒚|𝑿

𝑿: input features

𝝀 (HMM/GMM parameter sets) is omitted.

[Tokuda et al., 2000.]

HMM

Page 24: Ph.D defence (Shinnosuke Takamichi)

/46

Global Variance (GV) and

parameter generation w/ GV

ML generation with GV constraint

24

Time

Natural

HMM HMM+GV

Sp

ectr

al p

ara

me

ter

𝒗(𝒚)

𝒚 𝒒 = argmax 𝑃 𝒚, Δ𝒚|𝑿 𝑃 𝒗 𝒚𝜔

𝒗 𝒚 : GV (= 2nd moment), 𝜔: weight of the GV term

[Toda et al., 2007.]

Page 25: Ph.D defence (Shinnosuke Takamichi)

Something is still different between them...

→ What is it?

Page 26: Ph.D defence (Shinnosuke Takamichi)

/46

Modulation Spectrum (MS) definition

MS: power spectrum of the sequence

– Represents temporal fluctuation. [Atlas et al., 2003.]

– Segment features in speech recognition [Thomas et al., 2009.]

– Captures speech intelligibility. [Drullman et al., 1994.]

26

2nd

moment

DFT

& pow.

GV (scalar)

MS (vector)

Time

Sp

ee

ch

pa

ram

ete

r

DFT: Discrete Fourier Transform

Page 27: Ph.D defence (Shinnosuke Takamichi)

/46

HMM

natural

Modulation frequency

Mo

du

latio

n s

pe

ctr

um

Speech quality will be improved by filling this gap!

Example of the MS

HMM+GV

27

Page 28: Ph.D defence (Shinnosuke Takamichi)

/46

Post-filtering process

28

Training data

Speech param. MS Statistics

(Gaussian)

HMMs

Filtering

Training

Synthesis

Post-filtering in the MS domain

– Linear conversion (interpolation) using 2 Gaussian distributions

Page 29: Ph.D defence (Shinnosuke Takamichi)

/46

Filtered speech parameter sequence

29

Time

HMM

HMM+GV

natural

HMM → post-filter

Sp

ectr

al p

ara

me

ter

Generate fluctuating speech parameters by the post-filtering!

Page 30: Ph.D defence (Shinnosuke Takamichi)

/46

Discussion 1: What is the MS?

30

Speech

parameter GV (temporal power)

Freq. 1

Freq. 2

Freq. 𝐷s

+

+ …

=

MS (frequency power)

Sum of MSs = GV

Fourier transform

Time

Page 31: Ph.D defence (Shinnosuke Takamichi)

/46

Discussion 2

Why post-filter?

– Independent on the original speech synthesis process

– → High portability and high quality

Further application

– For spectrum, F0 (non-continuous), duration (unactual param.)

– Segment-level filter (faster process)

Advantages compared to the conventional post-filters

– Automatic design/tuning [Eyben et al., 2014.][Yoshimura et al., 1999.]

31

Page 32: Ph.D defence (Shinnosuke Takamichi)

/46

Subjective evaluation

(preference test on speech quality)

32

Pre

fere

nce

sco

re

0.0

0.2

0.4

0.6

0.8

1.0

Pre

fere

nce

sco

re

0.0

0.2

0.4

0.6

0.8

1.0

Spectrum

in HMM-based TTS

Spectrum

in GMM-based VC

HMM GMM

+GV

HMM

+GV

post-filtering

Page 33: Ph.D defence (Shinnosuke Takamichi)

/46

Speech synthesis integrating modulation spectrum

Analysis Generation Synthesis Modeling

Modeling of individual

speech segment

Modulation spectrum

for over-smoothing

Chapter 5

Text

Chapter 2

Chapter 3 Chapter 4

Page 34: Ph.D defence (Shinnosuke Takamichi)

/46

Problems of the MS-based post-filter

MS-based post-filter

– External process for MS emphasis

– → Causes over-emphasis ignoring speech synthesis criteria.

– → Difficult to utilize flexibility that HMM/GMMs have

34

Approaches: Joint optimization using HMM/GMMs and MS

– Integrate MS statistics as the one of the acoustic models.

– Speech parameter generation with MS … high-quality

– Acoustic model training with MS … high-quality and fast

Page 35: Ph.D defence (Shinnosuke Takamichi)

/46

Speech parameter generation

considering the MS

35

ML generation with MS constraint

𝒚 𝒒 = argmax 𝑃 𝒚, Δ𝒚|𝑿 𝑃 𝒔 𝒚𝜔

𝒔 𝒚 : MS (= power spectrum), 𝜔: weight of the MS term

𝑬𝒒 𝑫𝒒

𝑃 𝒚, Δ𝒚|𝑿 = 𝑁 𝒚, Δ𝒚 ; 𝑬𝒒 , 𝑫𝒒 𝑃 𝒔 𝒚 = 𝑁 𝒔 𝒚 ; 𝝁s, 𝚺𝐬

Modulation freq. M

S

Natural

Quadratic function of 𝒚

Page 36: Ph.D defence (Shinnosuke Takamichi)

/46

Discussion

(comparison to MS-based post-filter)

Initialization

– Basic ML generation (``HMM’’) → MS-based post-filter

– → Part optimization by initialization, and joint optimization by iteration

36

HMM → post-filter

HMM

Time

Spectr

al para

mete

r

HMM+MS

Page 37: Ph.D defence (Shinnosuke Takamichi)

/46

Effect in the MS

37

HMM

HMM+GV

natural HMM+MS

Modulation frequency

Mo

du

latio

n s

pe

ctr

um

Fills the gap by the proposed generation algorithm!

Page 38: Ph.D defence (Shinnosuke Takamichi)

/46

Effect in the GV

38

Index of speech parameters

Log G

V

HMM HMM+GV

natural

Recovers the GV w/o considering the GV!

HMM+MS

Page 39: Ph.D defence (Shinnosuke Takamichi)

/46

Subjective evaluation

(preference test on speech quality)

39

Pre

fere

nce

sco

re

0.0

0.2

0.4

0.6

0.8

1.0

Pre

fere

nce

sco

re

0.0

0.2

0.4

0.6

0.8

1.0 GMM-based VC

HMM

+MS

HMM

+GV

GMM

+MS

GMM

+GV

HMM-based TTS

*+GV Parameter generation w/ GV (Sec. 2.9)

*+MS Parameter generation w/ MS

Page 40: Ph.D defence (Shinnosuke Takamichi)

/46

Problems of parameter generation

and MS-constrained training

40

Speech parameter generation considering the MS

– Iterative process in synthesis

– → Computationally-inefficient speech synthesis

𝝀 = argmax 𝑃 𝒚|𝑿 𝑃 𝒔 𝒚𝜔

𝑃 𝒚|𝑿 = 𝑁 𝒚; 𝒚 𝒒 , 𝜮 : Trajectory likelihood (Sec. 2.8)

𝑃 𝒔 𝒚 = 𝑁 𝒔 𝒚 ; 𝒔 𝒚 𝒒 , 𝜮𝐬 : MS likelihood

Minimizes difference between 𝒚 and 𝒚 𝒒 .

Minimizes difference between 𝒔 𝒚 and 𝒔 𝒚 𝒒 .

Acoustic model training constrained with MS

– Train HMMs/GMMs 𝝀 to generate param. 𝒚 𝒒 having natural MS

Page 41: Ph.D defence (Shinnosuke Takamichi)

/46

Trained HMM parameters

41

Basic training (Sec. 2.4-5)

Trajectory training (Sec. 2.8)

Time

De

lta

fe

atu

re

Updates HMM/GMM param. to generate fluctuating param.!

MS-constrained training

Page 42: Ph.D defence (Shinnosuke Takamichi)

/46

Discussion

Computational efficiency in parameter generation

– Basic generation algorithm (Sec. 2.6) can be used without MS.

– → Not only high-quality but also computationally-efficient

Which is better in quality, proposed param. gen. or training?

– Structures of HMMs/GMMs have limitation for recovering MS.

– → The parameter generation considering the MS is better.

42

Portability Quality Computation time

Post-filter Best! (no depend-ency on models)

Better Better (120 ms)

Param. gen. Better Best!(optimization in synthesis)

Worse (1 min~)

Training Worse Better Best! (5 ms)

Page 43: Ph.D defence (Shinnosuke Takamichi)

/46

Subjective evaluation

(preference test on speech quality)

43

Pre

fere

nce

sco

re

0.0

0.2

0.4

0.6

0.8

1.0

Pre

fere

nce

sco

re 0.0

0.2

0.4

0.6

0.8

1.0 HMM-based TTS GMM-based VC

HMM TRJ GV MS-

TRJ

GMM TRJ GV MS-

TRJ

HMM/GMM Basic HMM/GMM training (Sec. 2.4-5)

TRJ Trajectory HMM training (Sec. 2.8)

GV GV-constrained training (Sec. 2.9)

MS-TRJ MS-constrained trajectory training

Page 44: Ph.D defence (Shinnosuke Takamichi)

/46

Conclusion

Page 45: Ph.D defence (Shinnosuke Takamichi)

/46

Conclusion

Problem in this thesis

– Quality degradation in synthetic speech, which is caused by

parameterization error, insufficient modeling, and over-smoothing

Chapter 3: statistical parametric speech synthesis

– Addresses the insufficiency in the acoustic modeling.

– Models the individual speech parameter with rich context models.

Chapter 4 & 5: approaches using Modulation Spectrum (MS)

– Addresses the over-smoothing in the parameter generation.

– 1. MS-based post-filter: high portability

– 2. Parameter generation w/ MS: highest quality

– 3. MS-constrained training: computationally-efficient generation

45

Page 46: Ph.D defence (Shinnosuke Takamichi)

/46

Future work

Improvements of rich context modeling

– Quality degradation even if the best models are selected. (Sec. A.5)

Theoretical analysis of MS

– Why is the speech quality improved by the MS?

MS for DNN-based speech synthesis

– More flexible structures to integrate the MS

GPU implementation of the proposed methods

– Rich-context-model selection & param. generation with the MS

46