The past, present and future of singing synthesis

22
Kanru Hua (華侃如) June 19, 2016 The Past, Present and Future of Singing Voice Modeling

Transcript of The past, present and future of singing synthesis

Page 1: The past, present and future of singing synthesis

Kanru Hua (華侃如)June 19, 2016

The Past, Present and Futureof Singing Voice Modeling

Page 2: The past, present and future of singing synthesis

Motivation

“You are making too many assumptions, this thing won’t work on realspeech signal.”

— Jont B. Allen

● What’s wrong with current and past researches in this area?

● What’s our next step?

Page 3: The past, present and future of singing synthesis

What’s in a Speech/Singing Synthesizer

Parameter Generator

Vocoder

Text / Music Score

Speech Audio

Generate pitch, duration and spectrum… from input

Generate waveform from parameters Vocoder

Page 4: The past, present and future of singing synthesis

Part 1History of Speech Analysis/Synthesis

(http://clas.mq.edu.au/speech/synthesis/history_synthesis/)

Page 5: The past, present and future of singing synthesis

History of Math & Acoustics

1600 1700 1800 1900 2000

Law of Forces/Motions, Foundation of Calculus

Wave Equation,Complex Number

Fourier/Laplace Transform,Analog Circuits & Electromagnetism

Newton Bernoulli, Euler, d‘Alembert

(http://www2.ling.su.se/staff/hartmut/kemplne.htm)

Gauss, Fourier, Laplace, Riemann, Cauchy, Kirchhoff, Heaviside

Filtering Theory, Digital Systems, Sampling Theory, ...

Page 6: The past, present and future of singing synthesis

History of Math & Acoustics

1600 1700 1800 1900 2000

Law of Forces/Motions, Foundation of Calculus

Wave Equation,Complex Number

Fourier/Laplace Transform,Analog Circuits & Electromagnetism

Filtering Theory, Digital Systems, Sampling Theory, ...

Newton Bernoulli, Euler, d‘Alembert

Gauss, Fourier, Laplace, Riemann, Cauchy, Kirchhoff, Heaviside

(http://www2.ling.su.se/staff/hartmut/kemplne.htm)

= =

Frequency Response

Page 7: The past, present and future of singing synthesis

Source-Filter Model

Vocal TractVocal Folds LipLung

tf f

Signal Generator (Source) Filter 1 Filter 2

Signal Generator Filter 1 Filter 2Filter 0

Page 8: The past, present and future of singing synthesis

20th Century, the Dawn of Speech ProcessingCooley and Tukey (1965): Fast Fourier TransformOppenheim (1969): one of the earliest digital implementation of speech analysis/ synthesis

InputPitch

(source)

Cepstrum(vocal tract filter)

Analysis Synthesis

Spectrum

Output

Page 9: The past, present and future of singing synthesis

Family Tree of Speech A/S AlgorithmsHomomorphic Filtering

(Oppenheim, 1969)STRAIGHT

(Kawahara, 1998)

WORLD1(Morise, 2009)

WORLD2(Morise, 2013)

TANDEM-STRAIGHT(Kawahara & Morise, 2007)

PSOLA(?, 1985)

Phase Vocoder(Flanagan et al, 1966)

Source-FilterModel

Sinusoidal Model(McAulay & Quatieri, 1986)

SMS(Serra, 1989)

Autotune

CELP(Atal & Schroeder,1983)

LSP/LSF(Itakura, 1975)

MGC/MLSA(Imai, et al., 1983)

SinsyMelodyne

& NiaoNiao& tn_fnds

Harmonic+Noise(Stylianou, 1993)

NBVPM(Bonada, 2004)

WBVPM(Bonada, 2008)

Vocaloid Vocaloid 2+RUCE(Rocaloid 4)

Rocaloid 3

Sine+Noise+Transient(Levin & Smith, 1998)

CeVIO

Quasi-Harmonic Model(Pantazis, et al., 2008)

Chiptune

Vocaine(Agiomyrgiannakis, 2015)

Linear Prediction(Atal & Schroeder,1967)

Page 10: The past, present and future of singing synthesis

Part 2What’s Wrong

Page 11: The past, present and future of singing synthesis

Quasi-static AssumptionAlgorithms affected:

● Homomorphic Filtering● PSOLA● Linear Prediction & CELP & MLSA● Sinusoidal Model● Harmonic+Noise Model● SMS & NBVPM● WORLD & STRAIGHT (slightly)

Page 12: The past, present and future of singing synthesis

Mis-represented Aperiodic ComponentPopular belief:1. Speech = periodic signal + aperiodic signal (breathing noise)2. Aperiodic signal is filtered white noise

Aperiodic

Periodic (Friction)

Page 13: The past, present and future of singing synthesis

Mis-represented Aperiodic Component

t

Algorithms affected:● (Quasi-)Harmonic+Noise Model● SMS & Sines+Noise+Transients Model● WORLD & (TANDEM-)STRAIGHT● Algorithms that do not model aperiodic component

○ Phase vocoder, CELP, MLSA, ...

Page 14: The past, present and future of singing synthesis

Over-simplified Source-Filter Model

Tract FilterOscillator Lip Filter

Tract FilterOscillator

Source Filter

Assumption: source filter is independent from pitch

Equivalent assumption:“When my pitch is higher by 12 semitones, my vocal folds still oscillate at the same speed.”

Affected algorithms: all of those listed on page 11

Page 15: The past, present and future of singing synthesis

Part 3Future: How to Fix &

the Low Level Speech Model

Page 16: The past, present and future of singing synthesis

“Neoclassical” Approaches to Speech Modeling

Tract

Source

Lip

t

f

f

InputInverse

Linear Prediction(Atal & Schroeder,1967)

ARX(Wen, et al., 1995)

ARX-LF(Vincent, et al., 2005)

LF Model(Liljencrants, Fant and

Lin, 1985)

OVE Synthesizer(Fant, 1953)

Page 17: The past, present and future of singing synthesis

“Neoclassical” Approaches to Speech ModelingDegottex (2013): similar idea, but in frequency domain

Hua (2016, in progress): more robust under poor recording conditions; less sensitive to processed input.

Page 18: The past, present and future of singing synthesis

The Low Level Speech Model (new version)

Level 0(Signal Level)

Input Signal

Pitch Harmonic Model Noise Model

SpectrumChannel 1 EnergyChannel 2 EnergyChannel 3 Energy

...

Harmonic ModelHarmonic Model

Harmonic Model

Output Signal

Glottal/Source Information(LF Model)

Vocal Tract Filter Lip FilterLevel 1(Acoustic Level)

An acoustically meaningful speech model

Page 19: The past, present and future of singing synthesis

Inverse Analysis of Speech

Original

Glottal Flow(Source Signal)

Page 20: The past, present and future of singing synthesis

Pitch Shifting powered by LLSM

Original

50% Pitch

200% Pitch

Page 21: The past, present and future of singing synthesis

Pitch Shifting powered by LLSM

Original

50% Pitch

200% Pitch

Instants of vocal fold closure were revealed

Page 22: The past, present and future of singing synthesis

Reference● A.V. Oppenheim, “Speech Analysis-Synthesis System Based on Homomorphic Filtering”. JASA

(1969): Vol. 45, No. 2.

● Degottex, Gilles, et al. "Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis." Speech Communication 55.2 (2013): 278-294.

● H. K. Dunn, "The calculation of vowel resonances, and an electrical vocal tract", Journal of the Acoustical Society of America, 1950, vol. 22, p. 740-753.

● Pantazis, Yannis, and Yannis Stylianou. "Improving the modeling of the noise part in the harmonic plus noise model of speech." Acoustics, Speech and Signal Processing (2008). IEEE International Conference on.