Pitch Tracking ( 音高追蹤 )

Post on 11-Jan-2016

116 views 0 download

description

Pitch Tracking ( 音高追蹤 ). Jyh-Shing Roger Jang ( 張智星 ) MIR Lab, Dept of CSIE National Taiwan University jang@mirlab.org http://mirlab.org/jang. Pitch ( 音高 ). Definition of pitch Fundamental frequency (FF, in Hz): Reciprocal of the fundamental period in a quasi-periodic waveform - PowerPoint PPT Presentation

Transcript of Pitch Tracking ( 音高追蹤 )

Pitch Tracking (音高追蹤 )

Jyh-Shing Roger Jang (張智星 )

MIR Lab, Dept of CSIE

National Taiwan University

jang@mirlab.org

http://mirlab.org/jang

Pitch (音高)Definition of pitch

Fundamental frequency (FF, in Hz): Reciprocal of the fundamental period in a quasi-periodic waveform

Pitch (in semitone): Obtained from the fundamental frequency through a log-based transformation (to be detailed later)

Characteristics of pitch Noise and unvoiced sounds do not have pitch.

Pitch Tracking (音高追蹤 ) Pitch tracking (PT): The process of computing the

pitch vector of a give audio segment (對整段音訊求取音高 )

Sample applications Query by singing/humming (哼唱選歌 ) Tone recognition for Mandarin (華語的音調辨識 ) Intonation scoring for English (英語的音調評分 ) Prosody analysis for speech synthesis (語音合成中的韻律分析 )

Pitch scaling and duration modification (音高調節與長度改變 )

Typical Steps for Pitch Tracking

Pre-processing Filtering Excitation extraction

Main processing Frame blocking PDF (periodicity

detection function) computation

Pitch candidates via max picking over PDF

Post-processing Unreliable pitch removal

via volume/clarity thresholding

Pitch refinement via parabolic interpolation

Pitch smoothing via median filters, etc.

Frame Blocking

Sample rate = 16 kHzFrame size = 512 samplesFrame duration = 512/16000 = 0.032 s = 32 msOverlap = 192 samplesHop size = frame size – overlap = 512-192 = 320 samplesFrame rate = 16000/320 = 50 frames/sec = Pitch rate

0 50 100 150 200 250 300-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Zoom in

Overlap

Frame

0 500 1000 1500 2000 2500-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Periodicity Detection FunctionsPDF (periodicity detection function) is used to

detect the period of a waveformTwo categories of PDF

Time domain (時域 )ACF (Autocorrelation function)NSDF (Normalized squared difference function)AMDF (Average magnitude difference function)

Frequency domain (頻域 )Harmonic product spectrumCepstrum

ACF: Auto-correlation Function

Shifted frame s(t-):

Original frame s(t):

=30 acf(30) = inner product of the overlap part

Pitch period

To play safe, the frame size needs to cover at least two fundamental periods!

1n

t

acf s t s t

)0(s )1( ns

)(s )1( ns0-index based,[s(0), s(1), …, s(n-1)]

Quiz candidate!

Quizcandidate!

ACF: Formula 1

Assume a frame is represented by s(t), t=0~n-1

ACF formula

s(t-):

s(t):

s(t-)

1n

t

acf s t s t

t

s(t)

Shift to right

ACF: Formula 2

Assume a frame is represented by s(t), t=0~n-1

ACF formula

s(t+):

s(t):

s(t+)

1

0

n

t

acf s t s t

t

s(t)

Shift to left

This formula is the same as the previous one!

Example of ACF

sunday.wav Sample rate = 16kHz Frame size = 512

(starting from point 9000)

Fundamental frequency Max of ACF occurs at

index 131 FF = 16000/131 =

123.077 Hz

frame2acf01.m

Index 0 Index 131

We suppose it is zero-based indexing.

Locating the Pitch Point

If the range of human’s FF is [40, 1000], then we have the interval for locating fundamental period (FP):

frame2acfPitchPoint01.m

0 100 200 300 400 500 600-1

-0.5

0

0.5

1Input frame

0 100 200 300 400 500 600-20

0

20

40ACF vector (method = 1)

Original ACF

Truncated ACFACF pitch point

401000

100040

fsFP

fsFP

fs

Index: 0

Index: FP

Quiz candidate!

Sample rate

Locating the Fundamental Period (II)

The human pitch range could go wrong Pitch too high

Vitas (local short clip)Whistling

Low-pitch singing/humming requires a big frame size

Example of ACF Based PT

Specs Sample rate = 11025 Hz Frame size = 353 points

= 32 ms Overlap = 0 Frame rate = 31.25 f/s

Playback Original singing Pitch by ACF

wave2pitchByAcf01.m

1 2 3 4 5 6 7 8-1

0

1Waveform

0 1 2 3 4 5 6 7 80

100

200Volume

1 2 3 4 5 6 7 840

60

80

Time (second)

Sem

itone

Original pitch (blue) and volume-thresholded pitch (red)

Example of ACF Based PT (II)

Specs The previous script is

converted into a function pitchTrackingSimple.m for easy access.

ptByAcf01.m

1 2 3 4 5 6 7 8-1

0

1Waveform

0 1 2 3 4 5 6 7 80

100

200Volume

1 2 3 4 5 6 7 840

60

80

Time (second)

Sem

itone

Original pitch (blue) and volume-thresholded pitch (red)

Demo of ACF-based PT

Real-time display of ACF for pitch tracking goPtByAcf.mdl under SAP toolbox

Real-time pitch tracking for mic input goPtByAcf2.mdl under SAP toolbox

ACF Variants to Avoid Tapering

Normalized version

frame2acf02.m

Half-frame shifting

frame2acf03.m

1n

t

s t s tacf

n

/2

0

n

t

acf s t s t

method=2 method=3

NSDF: ACF Variant with Normalize Range

NSDF: normalized squared difference function Formula:

A variant of ACF within the range [-1 1], based on the inequality:

2 2

2 s t s tnsdf

s t s t

12

1

22

22

22222222

ii

ii

iiiiii

yx

yx

yxyxyxyxxyyx

NSDF Example

frame2nsdf01.m

Clarity: height of the pitch point

AMDF: Average Magnitude Difference Function

Shifted frame s(i-):

Original frame s(i):

=30

30

amdf(30) = sum of abs. difference of the overlap part

Pitch period

1n

t

amdf s t s t

Quiz candidate!

Comparison between ACF & AMDF

Formulas ACF:

AMDF:

Two major advantages of AMDF over ACF AMDF requires less computing power AMDF is less likely to have the risk of overflow

1n

t

acf s t s t

1n

t

amdf s t s t

Quiz candidate!

Example of AMDF

sunday.wav Sample rate = 16kHz Frame size = 512

(starting from point 9000)

Fundamental frequency Pitch point occurs at

index 131, which is harder to determine

frame2amdf01.m

Index 0 Index 131

Example of AMDF to Pitch

sunday.wav Sample rate = 16kHz Frame size = 512

(starting from point 9000)

Fundamental frequency Pitch point occurs at

index 131, which is determined correctly

FF = 16000/131 = 123.077 Hz

frame2amdf4pt01.m

0 100 200 300 400 500 600-1

-0.5

0

0.5

1Input frame

0 100 200 300 400 500 600-100

0

100

200AMDF vector (method = 1)

Original AMDF4PT

Truncated AMDF4PTAMDF pitch point

Index 0

Index 131

Example of AMDF Based PT

Specs Sample rate = 11025 Hz Frame size = 353 points

= 32 ms Overlap = 0 Frame rate = 31.25 f/s

Playback Original singing Pitch by AMDF

ptByAmdf01.m

1 2 3 4 5 6 7 8-1

0

1Waveform

0 1 2 3 4 5 6 7 80

100

200Volume

1 2 3 4 5 6 7 8

40

60

80

Time (second)

Sem

itone

Original pitch (blue) and volume-thresholded pitch (red)

AMDF: Variations to Avoid Tapering

Normalized version

frame2amdf02.m

Half-frame shifting

frame2amdf03.m

1n

t

s t s tamdf

n

/2

0

n

t

amdf s t s t

method=2 method=3

Combining ACF and AMDF

ACF

AMDF

Frame

ACF/AMDF

Audio Features in Time Domain

Audio features presented in the time domain

Intensity

Fundamental period

Timbre: Waveform within an FP

Audio Features in Frequency DomainEnergy: Sum of power spectrumPitch: Distance between harmonicsTimber: Smoothed spectrum

Second formant F2First formant

F1Pitch freq

Energy

About DFT & FFT

Terminology DFT: Discrete Fourier transform FFT: Fast Fourier transform, which is an efficient

method for computing DFT

More about DFT

Harmonic Product Spectrum (HPS)

Procedure1. Compute the power spectrum of a frame

2. Eliminate its trend obtained from 20-order polynomial fitting Formants are removed

3. Apply exponential weighting to suppress high-frequency harmonics

4. Down sample and add to enhance the harmonics at the fundamental frequency

5. Find the max as the pitch point

“Down Sample and Add” in HPS

Example of HPSframe2hps01.m

50 100 150 200 250 300 350

-0.20

0.2

Frame

Samples

0 1000 2000 3000 4000 5000-200

0

200Power spectrum and its trend

0 1000 2000 3000 4000 5000-200

0

200Trend-subtracted power spectrum and its tapering version

0 1000 2000 3000 4000 5000-100

0100

Down-sampled versions of power spectrum

0 1000 2000 3000 4000 5000-50

050

Harmonic product spectrum

Freq (Hz)

Example of PT by HPSptByHps01.m

1 2 3 4 5 6 7 8-1

0

1Waveform

0 1 2 3 4 5 6 7 80

100

200Volume

1 2 3 4 5 6 7 8

40

60

Blue: original pitch, black: volume-thresholded pitch)

Time (second)

Sem

itone

PT by Cepstrum

Formula for cepstrum

Procedure for PT by cepstrum1.Compute the power spectrum of a frame.

2.Eliminate the trend of the power spectrum if necessary.

3.Take the inverse FFT on the (symmetric) power spectrum. (The result is real, why?)

4.Find position of the max to compute the pitch.

)(log framefftifftcepstrum

PT by Cepstrum: How It Works?

Close to sinusoids!

This should be a single pulse only!

Example of Cepstrumframe2ceps01.m

50 100 150 200 250 300 350

-0.2

0

0.2

Frame

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

-6-4-202

Power spectrum

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

-2

-1

0

Cepstrum

Example of PT by CepstrumptByCeps01.m

1 2 3 4 5 6 7 8-1

0

1Waveform

0 1 2 3 4 5 6 7 80

100

200Volume

1 2 3 4 5 6 7 8

60

80

Blue: original pitch, black: volume-thresholded pitch)

Time (second)

Sem

itone

Two Parts of PT

PT has two parts Voicing detection

Decide if a frame has a melody pitch or not

Pitch estimationEstimate the most likely melody pitch of a frame

These two parts can be performed in any orderPerformance evaluation of PT depends on

these two parts

Performance Evaluation of PT

Several criteria for PT performance evaluation Raw pitch accuracy

Prob. of a correct pitch value (to within ±¼ tone or ±0.5 semitone) over the voiced frames

Raw chroma accuracyProb. that the chroma (i.e. the note name) is correct

over the voiced frames

Overall accuracyProb. of a correct pitch value (via pitch estimation) and

pitched decision (via voicing detection) over all frames

Preprocessing for Pitch Tracking

Some commonly used preprocessing for the audio signals before pitch tracking Pre-filtering the signals Clipping the signals SIFT method for the signals

Preprocessing: Pre-filtering

Observation Range of humans’ pitch: [40, 1000]

Idea Low-pass the signals with a cutoff frequency

between 800 and 1000

Characteristics The effect is yet to be verified

Preprocessing: Clipping

Observation Small signals near zero is likely to cause pitch

tracking error

Idea Clip the signals

Characteristics Save computation for embedded system Overall effect is yet to be verified

Preprocessing: SIFT

Observation Channel effect is likely to cause pitch tracking

error

Idea of SIFT (simple inverse filter tracking) Identify the excitation via LPC Use the excitation for PDF

Characteristics Overall effect is yet to be verified

Example of SIFT

siftAcf01.m

0 50 100 150 200 250 300-0.4

-0.2

0

0.2

0.4Original signal vs. LPC estimate

Original Signal

LPC estimate

0 50 100 150 200 250 300-0.1

-0.05

0

0.05

0.1Residual signal when order = 20

0 50 100 150 200 250 300-1

-0.5

0

0.5

1Normalized ACF curves

Frame index

Normalized ACF on original frame

Normalized ACF on excitation

Example of PT based on SIFT & ACF

ptBySiftAcf01.m

1 2 3 4 5 6 7 8-1

0

1Waveform

0 1 2 3 4 5 6 7 80

100

200Volume

1 2 3 4 5 6 7 8

40

60

Blue: original pitch, black: volume-thresholded pitch)

Time (second)

Sem

itone

Postprocessing for Pitch Tracking

Some commonly used postprocessing for pitch tracking Smoothing to remove abrupt-changing pitch Interpolation to increase pitch precision

Postprocessing: Smoothing

Smoothing by a median filterptWithMedianFilter01.m

1 2 3 4 5 6 7 8-1

0

1Waveform

0 1 2 3 4 5 6 7 80

100

200Volume

1 2 3 4 5 6 7 8

40

60

80Blue: original pitch, black: volume-thresholded pitch)

Time (second)

Sem

itone

Postprocessing: Interpolation

Idea Using the pitch point and

its neighbors to identify the max position

ptWithParabolicFit01.m

0 50 100 150 200 250 30050

60

70

80

Original pitch

Finetuned pitch with parabolic fit

0 50 100 150 200 250 300-0.2

-0.1

0

0.1

0.2Pitch difference

48/44

UPDUDP (1/4)

UPDUDP: Unbroken Pitch Determination Using DP Goal: To take pitch smoothness into consideration

: a given path in the AMDF matrix : Number of frames : Transition penalty : Exponent of the transition difference

n

i

n

i

m

iiii pppamdfm1

1

11,,cost p

mn

ni ppp ,,1p

Jiang-Chun Chen, J.-S. Roger Jang, "TRUES: Tone Recognition Using Extended Segments",ACM Transactions on Asian Language Information Processing, No. 10, Vol. 7, Aug 2008.

UPDUDP (2/4)

Optimum-value function D(i, j): the minimum cost starting from frame 1 to position (i, j)

Recurrent formula:

Initial conditions : Optimum cost :

160,8),(),1( 1 jjamdfjD

),(min

160,8jnD

j

2

160,8),1(min)(),( jkkiDjamdfjiD

ki

160,8,,1 jni

Example of UPDUDP

A typical example (via AMDF)

Robustness of UPDUDP

Insensitivity in

0 0.5 1 1.5 2

-3

-2

-1

0

1

2

3

x 104

Wav

efor

m

xi

x i

lu

l u

chan

ch a nn

sheng

sh ng

chang

ch a ng

0 0.5 1 1.5 2

20

30

40

50

60

70

80

Time (seconds)

Pitc

h (S

emito

nes)

xi

x i

lu

l u

chan

ch a nn

sheng

sh ng

chang

ch a ng

=0

=2000 =4000 =6000 =8000 =10000 =12000 =14000 =16000 =18000 =20000

Another Example of UPDUDP

Example of MATLAB code using UPDUDP (via ACF)

Result

waveFile='arina_short.wav';wObj=waveFile2obj(waveFile);ptOpt=ptOptSet(wObj.fs, wObj.nbits, 1);pitch=pitchTracking(wObj, ptOpt, 1);

1 2 3 4 5 6

-202

x 104 Waveform of arina_short.wav

PF matrix (white dots: DP path, black dots: Pitch after all kinds of thresholding/smoothing

0 1 2 3 4 5 6

20406080

100120

1 2 3 4 5 640

60

80Computed pitch

Pitch

(sem

itone

)

1 2 3 4 5 60

2

4x 10

6 Volume

1 2 3 4 5 60

0.5

1

Time (sec)

Clarity

Frequency to Semitone Conversion

Semitone : A music scale based on A440

Reasonable pitch range: E2 - C6 82 Hz - 1047 Hz ( - )

69440

log12 2

freqsemitone

Unreliable Pitch Removal (1/2)

Pitch removal via volume thresholding

1 2 3 4 5 6 7 8

-100

-50

0

50

100

Waveform of .wav小 毛 驢

1 2 3 4 5 6 70

5000

10000

Volume

1 2 3 4 5 6 7

40

50

60

70

80

Pitch

Time (sec)

Unreliable Pitch Removal (2/2)

Pitch removal via volume/clarity thresholding

1 2 3 4 5 6 7 8

-100

0

100

Waveform of .wav小 毛 驢

1 2 3 4 5 6 70

5000

10000

Volume

1 2 3 4 5 6 70

0.5

1Clarity

1 2 3 4 5 6 7

40

60

80

Pitch

Time (sec)

Rest Handling

0 50 100 150 200 25055

60

65

70Original PV

0 20 40 60 80 100 120 140 160 18055

60

65

70useRest=1

0 50 100 150 200 25055

60

65

70useRest=0

Frame index

Rests are removed. Good for DTW.

Rests are replaced by previous nonzero pitch. Good for LS.

Original pitch vectors with rests.

Typical Result of Pitch Tracking

Pitch tracking via autocorrelation for 茉莉花 (jasmine)聲音

Comparison of Pitch VectorsYellow line : Target pitch vector

Other Pitch Related Demos

Pitch scaling pitchShiftDemo/project1.exe pitchShift-multirate/multirate.m