Speech Assessment: Methods and Applications for Spoken Language Learning...

Speech Assessment: Methods andApplications for Spoken Language Learning

語音評分的方法、應用與分享

J.-S. Roger Jang ( 張智星 )

[email protected]

http://www.cs.nthu.edu.tw/~jang

Multimedia Information Retrieval Lab

CS Dept, Tsing Hua Univ, Taiwan

http://www.cs.nthu.edu.tw/~jang

Outline

Introduction to speech assessmentMethodsUsing learning to rank for speech assessmentDemosConclusions

Intro. to Speech Assessment

Goal Evaluate a person’s utterance based on some

acoustic features, for language learning

Also known as Pronunciation scoring CAPT (computer-assisted pronunciation training)

Four Aspects of Language Learning

Receptive Skill

Productive Skill

SpeechListening（聽）

Speaking（說）

TextReading（讀）

Writing（寫）

Easier for CALL Harder for CALL

SA!Media

Skills

Speech Assessment

Characteristics of ideal SA Assessment levels: as detailed as possible

Syllables, words, sentences, paragraphs

Assessment criteria: as many as possibletimbre, tone, energy, rhythm, co-articulation, …

Feedbacks: as specific as possibleHigh-level correction and suggestions

Basic Assessment Criteria

Timber ( 咬字 / 音色 ) Based on acoustic

models

Tone ( 音調 / 音高 ) Based on tone

recognition (for tonal language)

Based on pitch similarity with the target utterance

Rhythm ( 韻律 / 音長 ) Based on duration

comparison with the target utterance

Energy ( 強度 / 音量 ) Based on energy

comparison with the target utterance

Additional Assessment Criteria

English Stress ( 重音 )

Levels (word or sentence)

Intonation ( 整句音調 )Declarative sentenceInterrogative sentence

Co-articulation ( 連音 ) A red apple.Did you call me?Won’t you go?Raise your hand.

Mandarin Tone ( 聲調 ) Retroflex ( 捲舌音 ) Co-articulation ( 連音 ) 兒化音

Others Pause

Types of SA

Types of SA (ordered by difficulty) Type 1: 有目標文字、有目標語句 Type 2: 有目標文字、無目標語句 Type 3: 無目標文字、有目標語句 Type 4: 無目標文字、無目標語句

We are focusing on type 1 and 2.

第一類：有目標文字、有目標語句方法：

以語音辨識核心為基礎，進行語音和文字的強制對位（ Forced Alignment, FA），再根據每一個 Phone的相似度來進行評分

評分方式音色：和語音辨識核心的語音模型比對音調、韻律、強度：和目標語句比對

特性：由於 FA的準確度很高，因此比較容易得到一致性較高的評分結果

範例： myET (艾爾實驗室 ): www.myet.com Saybot (說寶堂 ): www.saybot.com

第二類：有目標文字、無目標語句方法：

以語音辨識核心為基礎，進行語音和文字的強制對位（ Forced Alignment），再根據每一個 Phone的相似度來進行評分

評分方式音色：和語音辨識核心的語音模型比對音調：對於中文，可以經由文字處理來得到標準音調，再由語音進行

音調辨識與評分。英文則無類似方法。韻律、強度：無法比對

特性：由於 FA的準確度很高，因此比較容易得到一致性較高的評分結果教材準備較容易但無法對韻律及音量進行評分

範例：階梯英文的 speak & score

第三類：無目標文字、有目標語句

方法：以語音辨識核心為基礎，進行語音的自由音節解碼（ Free Syllable Decoding, FSD），再根據每一個音節字串的相似度來進行評分。

評分方式音色：和目標語句音節字串進行比對音調、韻律、強度：由 FSD產生的音節來比對

特性：由於 FSD的辨識率只有 6 ～ 7 成，因此比較難得到一致的評分結

果。也可以直接改用 DTW來進行比對，但由於個人音色差異，評分的

一致性較低。

Our Approach

Basic approach to timbre assessment Lexicon net construction (Usually a sausage net) Forced alignment to identify phone boundaries Phone scoring based on several criteria, such as

ranking, histograms, posterior prob., etc. Weighted average to get syllable/sentence scores

Lexicon Net Construction

Lexicon net for “what are you allergic to?” Sausage net with all possible (and correct)

multiple pronunciations Optional sil between words

Lexicon Net with Confusing PhonesCommon errors for

Japanese learners of Chinese ㄖㄌ

例：天氣熱天氣樂ㄑㄐ

例：打哈欠打哈見ㄘㄗ

例：一次旅行一字旅行ㄢㄤ

例：晚安晚ㄤ

Rule-based approach to creating confusing syllables (phonological rules!) Rules:

Rule 1: re leRule 2: qi ji Rule 3: ci zi Rule 4: an ang

Example欠 (qian)見 (jian) 、嗆

(qiang) 、降 (jiang)

Example of Japanese Learners Speaking Chinese

去年夏天熱死了 Example 1 Example 2

晚安 Example 1 Example 2

坐下來、慢慢吃 Example 1

他不住的打哈欠 Example 1

一次旅行 Example 1

起風 Example 1

休息 Example 1

Lexicon Net with Confusing Phones

Lexicon net for “ 天氣熱、打哈欠”Canonical form: tian qi re da ha qian16 variant paths in the net:

欠

見

嗆

降

氣

記

熱

樂

Automatic Confusing Syllable Id.

強制對位以得到初步切音結果

對華語 411音節進行比對以找出每個音的混淆音

將混淆音節加入辨識網路再進行強制對位及切音

切音結果不再變動？ YesNo 輸出混淆音節及辨識網路

Corpus of Japanese learnersOf Chinese

Error Pattern Identification (EPI)

Common insertions/deletions from users以「朝辭白帝彩雲間」為標準語句

• 任意處結束，例如「朝辭白帝」• 任意處開始，例如「彩雲間」• 任意處開始與結束，例如「白帝彩雲」• 任意處開始與結束，並出現跳字，例如「白彩雲」• 疊字，例如「朝…朝辭白帝彩雲間」• 疊詞，例如「朝辭…朝辭白帝彩雲間」• 疊字加換音，例如「朝（ cao ）…朝（ zhao ）辭白帝彩雲間」• 兩字對調，例如「朝辭彩帝白雲間」• 錯字，例如「朝辭白帝黑山間」

Lexicon Net for EPI (I)

偵測「從頭開始、在任意處結束」的發音

Lexicon Net for EPI (II)

偵測「從任意處開始，在尾端結束」的發音

Lexicon Net for EPI (III)

偵測「從任意處開始，結束於任意處（但不可跳字）」的發音

Lexicon Net for EPI (IV)

偵測「從任意處開始，結束於任意處，而且可以跳字）」的發音

Design Philosophy of Lexicon Nets

We need to strike a balance between recognition and lexicon In the extreme, we can have a net for free syllable

decoding to catch all error patterns. The feasibility of free syllable decoding is offset

by its not-so-high recognition rate.

Scoring Methods for Speech Assessment

Five phone-based scoring methods Duration-distribution scores (durDis) Log-likelihood scores (hmmLike) Log-posterior scores (hmmPost) Log-likelihood-distribution scores (likeDis) Rank ratio scores (rkRatio)

All based on forced alignment to segment phones

Method 1: Duration-distribution Scores

PDF of phone duration Obtained from forced alignment Normalized by speech rate Fitted by log-normal PDF Max PDF score 100

Method 2: Log-likelihood Scores

Log-likelihood of phone with duration of frames :

where is the likelihood of the frame with the observation vector

10

0

|log1ˆ

dt

ttit qyp

dl

iq d

it qyp | tty

Method 3: Log-posterior Scores

Log-posterior of phone with duration :

where

10

0

|log1

ˆdt

ttti yqP

d

iq d

m

jjjt

iitti

qPqyp

qPqypyqP

1

|

||

Method 4: Log-likelihood-distribution Scores

Use CDF of Gaussian for log-likelihoodCDF = 1 score = 100

Method 5: Rank Ratio ScoresRank ratio

RR to score conversion

where parameters a, b are phone specific.

Possible sets of competing phones for x+y *+y *+*

1#

1

phonescompetingof

qrankqrr jj

bj

j

a

qrrbaqscore

1

100,;

Examples of Rank Ratio Scores

0.5 1 1.5 2 2.5 3-1

-0.5

0

0.5

1

C:/Users/jang/AppData/Local/Temp/tpd41ff40f_68c1_4124_8e25_bfc94ff40b39.wav

Sco

re=

91.4

9

df=

[0 0

0 0

0 0

0]

(sil)-1

-1

(yi)一100

i100

(cun)寸83

c50

u100

nn100

(xiang)想100

x100

i100

a100

ng100

(si)思63

s13

ii100

(sil)-1

-1

(yi)一100

i100

(cun)寸100

c100

u100

nn100

(hui)灰100

h100

u100

e100

i100

(sil)-1

-1

0.5 1 1.5 2 2.5 3

60

70

80

Pitc

h

Pitch1: unbroken

Pitch2: segmented

Demo of Our Prototype

ASR toolbox http://mirlab.org/jang/matlab/toolbox/asr

Command: goDemoSa.m

http://mirlab.org/jang/matlab/toolbox/asr

Item 7

Intro. to Learning to Rank

Learning to rank A supervised learning algorithm which generates a

ranking model based on a training set of partially order items. (A task somewhat between classification and regression.)

Item 2

Item 1

Item 7

Item 3

Item 9Rank functionItem 9

Item 3

Orderedbypreference

Learning to Rank: Methods and App.

Methods Pointwise (e.g., Pranking) Pairwise (e.g., RankSVM, RankBoost, RankNet) Listwise (e.g., ListNet)

Applications Webpage ranking Machine translation Protein structure prediction

Application of LTR to SA

Why use LTR for SA? Human scoring is rank-based

Tsing Hua’s grading system is moving from scores (0~100) to ranks (A, B, C, D…).

Combination of features (scores)Features are complementary.

Effective determination of rankingLTR only generates numerical output with a ranking order

as close as possible to the correct order. A optimum DP-approach is proposed.

LTR Score Segmentation

nssss ,,, 21 nrrrr ,,, 21

121 ,,, m

Rank 1 Rank 2

1 2 3 4

Rank 3 Rank 4 Rank 5

Given: LTR scores

We want to find the separating scores

:,s2r swith score-to-rank function

Such that

n

iii srsrJ

1

2 is minimized.

s

(sorted)

Desired rank

LTR Score Segmentation by DP (I)

Formulate the problem in DP framework Optimum-value function D(i,j): The minimum

cost of mapping to rank Recurrent equation

Boundary condition: Optimum cost:

isss ,,, 21 j,,2,1

)1,1(),,1(min),( jiDjiDjrjiD i

],1[,),1( 1 mjjrjD

mnD ,

LTR Score Segmentation by DP (II)

1

2

3

Desiredrank

Computedrank

)1,1(

),1(min||),(

jiD

jiDjrjiD i

jiD ,

1,1 jiD

jiD ,1Recurrent formula:

Local constraint:

4

5

2r 3r 4r 5r 6r 7r 8r 9r 10r 11r 12r 13r1r

232

1

ss

298

3

ss

276

2

ss

21211

4

ss

LTR Score Segmentation with DP (III)

50 100 150 200 250

2

4ve

c1

4 21

1.5

2

2.5

3

3.5

4

4.5

5

vec250 100 150 200 250

1

1.5

2

2.5

3

3.5

4

4.5

5DP total distance = 23

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.50

1

2

3

4

5

6

x1

Cla

ss

Data distribution: DP path:

Flow Charts of Our Experiment

Corpora for Experiments

WSJ08000 training utterances, 84 speakers. For training

biphone acoustic models for forced alignment

MIR-SDRecordings of about 4000 multi-syllable English words

by 22 students (12 females and 10 males.) with an intermediate competence level.

Originally designed for stress detectionAvailable at http://mirlab.org/dataSet/public

http://mirlab.org/dataSet/public

Human Scoring of MIR-SD

Human scoring Only 50 utterances from each speaker of MIR-SD

are scored by 2 humans, making a total of 1100 utterances

Human scoring are consistent:Correlation Inter-rater HR1-GT HR2-GT

Word-based 0.58 0.84 0.89

Speaker-based 0.78 0.96 0.93

Score 1 2 3 4 5Frequency 110 198 259 409 124Percentage 10% 18% 24% 37% 11%

Examples of MIR-SD

Level 5 apparent, paragraphic, constellation

Level 3 additive, timorous, availably

Level 1 ambiguity, auxiliary, anachronism

Performance Indices

Performance indices used in the literature hr = [1 3 5 4 2 2], cr = [2 3 5 2 1 4]

Recognition rate rRate = 33.33%Recognition rate with tolerance 1 = 66.67%Average absolute difference = 1Correlation coef = 0.54

Performance Evaluation of Different Scoring Methods

Raw score

DP-based k-means

inside outside inside outside

durDis

Corr 0.209 0.217 0.189 0.202 0.194rRate 0.342 0.309 0.281 0.276

rRateT1 0.783 0.771 0.701 0.696AADiff 0.906 0.942 1.109 1.122

hmmLike

Corr 0.120 0.168 0.102 0.144 0.154rRate 0.325 0.306 0.258 0.255

rRateT1 0.780 0.757 0.692 0.689AADiff 0.928 0.973 1.158 1.165

hmmPost

Corr 0.084 0.297 0.265 0.192 0.216rRate 0.344 0.330 0.170 0.162

rRateT1 0.811 0.798 0.565 0.561AADiff 0.862 0.893 1.494 1.499

likeDis

Corr 0.141 0.160 0.125 0.141 0.143rRate 0.316 0.308 0.247 0.247

rRate T1 0.789 0.774 0.665 0.671AADiff 0.924 0.948 1.207 1.203

rkRatio

Corr 0.240 0.232 0.198 0.229 0.236rRate 0.333 0.316 0.269 0.268

rRateT1 0.789 0.779 0.699 0.698AADiff 0.898 0.929 1.120 1.124

LTR Combination of Scores

Features for LTR durDis and rkRatio: raw scores hmmLike, hmmPost, likeDis: DP segmentation

LTR RankSVM Linear kernel

Baseline hmmPost with DP-based segmentation

Overall Performance Comparison

Legends Score segmentation

Circles: DPTriangles: k-means

Inside/outside testsSolid lines: InsideDashed lines: Outside

Black lines: Baselines

Summary of the Experiment

Segmentation DP (supervised learning) is betten than k-means

(unsupervised learning)

Performance indices Correlation coefficient is not intuitive (consider [4

5 4] and [1 2 1]) Recog. rate and sum of abs. diff. can be optimized

by LTR and DP segmentation

Demo: Practice of Mandarin Idioms of Length 4 ( 一語中的 )

Level (difficulty) of an idiom is based on it’s freq. via Google search:孤掌難鳴 ===> 260,000鶼鰈情深 ===> 43,300亡鈇意鄰 ===> 22,700舉案齊眉 ===> 235,000

Can be adapted for English learning

Next step: multi-threading, fast decoding via FSM

Demo: Recitation Machine（唸唸不忘）

Support Mandarin & English

Support user-defined recitation script

Next step: multithreading for recording & recognition

Licensing for PC Applications

For Mandarin, English, Japanese

SA for Embedded Systems

Embedded platforms: PMP, iPhone, Androids

Demo: Tangible Companions

Chicken run (落跑雞)

Penguin for Tang Poetry (唐詩企鵝)

Robot Fighter (蘿蔔戰士)

Singing Bass & Dog (大嘴鱸魚和唱歌狗)

http://tw.youtube.com/watch?v=nFsNOKyEs64

http://tw.youtube.com/watch?v=nFsNOKyEs64

http://neural.cs.nthu.edu.tw/jang/mir/vstar-demo.wmv




http://www.youtube.com/watch?v=BBwTBoIqQe4




http://neural.cs.nthu.edu.tw/jang/mir/mir_demo.mpg




Tools and Tutorials

Tools DCPR toolbox

http://mirlab.org/jang/matlab/toolbox/dcpr

SAP toolboxhttp://mirlab.org/jang/mat

lab/toolbox/sap

ASR Toolboxhttp://mirlab.org/jang/mat

lab/toolbox/asr

Tutorials Data clustering and

pattern recognition:http://mirlab.org/jang/boo

ks/dcpr

Audio signal processinghttp://mirlab.org/jang/

books/audioSignalProcessing

Lab page (with demos):http://mirlab.org



http://mirlab.org/jang/matlab/toolbox/sap

http://mirlab.org/jang/matlab/toolbox/sap



http://mirlab.org/jang/books/dcpr

http://mirlab.org/jang/books/dcpr

Other SA Issues to be Addressed

Core technology Other acoustic features for

scoringPitch (tone/intonation),

volume, duration, pause, coarticulation

Error pattern identification Credit assignment for

sentence-level scores Lack of labeled corpora!

Application side Mulimodal GUI Extensions

Slightly adaptationParagraph-level SAText-free SA

Beyond pronunciationTranslation + recognition

+ assessment

Microphone types

Examples

Coarticulation Knock it off! Mom woke her up

Consonant+consonant Bus stop Push Shirley Ask question Jeff flew south through

Tainan Exception

Change jobsWhich chair

Ref: “和英文系學生一起上英語聽說課” , 黃玟君老師

Examples

Changes due to coarticulation Would you like it? Won’t you go? Raise your hand. It makes you look

younger.

Softened sounds Junction Popcorn Fruitful

Can and can’t I can read the letter. I can’t read the letter.

d and t Better Cider

Most Likely to be Mispronounced

Within Taiwan Pleasure/pressure World/war/word Shirt/short Walk/work Flesh/fresh Supply/surprise Some/son Confirm/conform

Cancel/cancer Mouth/mouse Measure/major Police/please Version/virgin

Conclusions

Conclusions SA calls for more cues than ASR SA requires techniques from ML/IR Multi-modal approach to SA is a must

“Popcorn”, “Thursday”

On-going & future work Tone recognition & assessment Reliable error pattern identification

References Witt, S. M. and Young, S. J., “Phone-level Pronunciation Scoring and Assessment for Interactive Language Learning”, Speech

Communication 30, 95-108, 2000. Kim, Y., Franco, H., and Neumeyer, L., “Automatic Pronunciation Scoring of Specific phone Segments for Language

Instruction”, in Proceedings of the 4th European Conference on Speech Communication and Technology (Eurospeech ’97), pp. 649-652, Rhodes, 1997.

Neumeyer, L, Franco, H., Digalakis, V., and Weintraub, M., “Automatic Scoring of Pronunciation Quality”, Speech Communication 30, 83-93, 2000.

Franco, H., Neumeyer, L., Digalakis, V., and Ronen, O., “Combination of Machine Scores for Automatic Grading of Pronunciation Quality”, Speech Communication 30, 121-130, 2000.

Cincared, T., Gruhn, R., Hacker, C., Nöth, E., and Nakamura, S., “Automatic Pronunciation Scoring of Words and Sentences Independent from the Non-Native’s First Language”, Computer Speech and Language 23, 65-88, 2009.

Crammer, K. and Singer, Y., “Pranking with Ranking”, in proceedings of the conference on Neural Information Processing Systems (NIPS), 2001.

Joachims, T., “Optimizing Search Engines using Clickthrough Data”, in proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002.

Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y., “An Efficient Boosting Algorithm for Combining Preferences”, in proceedings of ICML, pp170-178, 1998.

Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G., “Learning to Rank using Gradient Descent”, in proceedings of ICML, pp. 89-96, 2005.

Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., and Li, H., “Learning to Rank: From Pairwise Approach to Listwise Approach”, in proceedings of the 24th International Conference on Machine Learning, pp. 129-136, Corvallis, OR, 2007.

Liang-Yu Chen , Jyh-Shing Roger Jang, “Automatic Pronunciation Scoring using Learning to Rank and DP-based Score Segmentation”, submitted to Interspeech 2010.

Speech Assessment: Methods and Applications for Spoken Language Learning...

Documents

Transcript of Speech Assessment: Methods and Applications for Spoken Language Learning...