A Speaker Veriﬁcation System Under The Scope: Alize Ett system … · Abstract Title: A Speaker...

A Speaker Verification SystemUnder The Scope: Alize

Ett system för talarverifieringunder luppen: Alize

TOMMIE [email protected]

Master’s Thesis in Speech Technology at TMHAdvisor: Håkan Melin

Examiner: Björn Granström

March 24, 2007

Abstract

Title: A Speaker Verification System Under The Scope: Alize

Automatic Speaker Recognition is the collective name for problems that involve identifying aperson or verifying the identity of a person using her voice. The French speaker recognitiontoolkit Alize from University of Avignon has been explored from the viewpoint of a toolkitalready in use at KTH, called GIVES. The exploration has aimed to study the usability ofthe toolkit, its features and performance. Alize was found to be comparable with GIVESin terms of performance, and there are only a few differences in functionality between thetwo. The most notable difference between them is that, for now, the Alize system only usesone type of model (Gaussian Mixture Models), while GIVES can use several types of models(Hidden Markov Models, Gaussian Mixture Models, Artificial Neural Networks).

The paper describes the theoretical background of automatic speaker recognition, Alizeas a starting point for the project, the experiments, and results. The results verify that Alizeand GIVES are equal in performance, while the evaluation of Alize as a »toolkit» shows thatit is not as easy to use as it could have been. It has very little documentation, and for someparts, the documentation does not reflect the current implementation. All-in-all Alize works,but to be used by anyone but the authors as a general system for speaker recognition, itneeds to become more user friendly and become more structurally simple in its design.

Sammanfattning

Titel: Ett system för talarverifiering under luppen: Alize

Automatisk talarigenkänning är det övergripande namnet för problem som kan lösas genomatt identifiera en person eller bekräfta en persons identitet genom hennes röst. Det franskatalarverifieringssystemet Alize har undersökts med utgångspunkt från ett system som redananvänds på KTH, kallat GIVES. Målet med undersökningen har varit att undersöka systemetsanvändbarhet, funktionalitet och prestanda. I fråga om prestanda är Alize och GIVES mycketlika och litet skiljer i deras funktionalitet. Den största skillnaden mellan dem är att Alize justnu bara använder en typ av modeller (Gaussian Mixture Models), medan GIVES kan användaflera typer (Hidden Markov Models, Gaussian Mixture Models, Artifical Neural Networks).

Rapporten beskriver den teoretiska bakgrunden inom automatisk talarigenkänning, Alizesom utgångspunkt för projektet, experiment och resultat. Resultaten styrker att Alize ochGIVES är mycket lika sett till prestanda. Utvärderingen av Alize som en »verktygslåda»pekar mot att Alize kunde vara enklare att arbeta med. Det finns lite dokumentation ochdelar av den stämmer inte med den nuvarande implementationen. Som helhet fungerar Alize,men för att fungera som ett generellt system för talarigenkänning behöver det göras meranvändarvänligt och bli strukturellt enklare.

iii

Contents

Contents iv

1 Introduction 11.1 Project Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 The Art of Automatic Speaker Recognition . . . . . . . . . . . . . . . . . . . 21.3 The Science of Automatic Speaker Verification . . . . . . . . . . . . . . . . . 3

1.3.1 Gaussian Mixture Models in Theory . . . . . . . . . . . . . . . . . . . 31.3.2 Likelihood and A Posteriori Probability . . . . . . . . . . . . . . . . . 51.3.3 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.4 Expectation Maximization on GMMs . . . . . . . . . . . . . . . . . . 6

2 The Method of Speaker Verification 92.1 Audio Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Energy Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Feature Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Background Model and Training . . . . . . . . . . . . . . . . . . . . . . . . . 112.6 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.7 Score Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.7.1 Z-norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.7.2 T-norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.7.3 H-norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.8 Making the Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.8.1 Types of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.8.2 Detection Error Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . 142.8.3 Equal Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.8.4 Defining the Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Alize and LIA_RAL 173.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 The Alize Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Configuration Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2 File I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.3 Feature Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.4 Statistical Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.5 Gaussian Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2.6 Gaussian Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2.7 Feature Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

iv

3.2.8 Line-Based File I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2.9 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2.10 Managers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 The LIA_RAL Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3.1 Energy Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3.2 Feature Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.3 Background Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.4 Target Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.5 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3.6 Score Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3.7 Making the Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3.8 Common Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 The Sum of the Parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Building a System 234.1 Speech Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 SpeechDat and Gandalf . . . . . . . . . . . . . . . . . . . . . . . . . . 234.1.2 PER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.1.3 File Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.1.4 GIVES Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Assembling the Parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3 GIVES Job Descriptions and Corpus Labels . . . . . . . . . . . . . . . . . . . 254.4 Feature Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.4.1 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.5 Structure of the Make-files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.6 Run Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Experiments 295.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.1.1 Development and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 295.1.2 SpeechDat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.1.3 Gandalf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.1.4 PER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2 System Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2.1 Parameter Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2.2 Performance Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Results 336.1 Parameter Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.2 Performance Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7 Comments and Conclusions 397.1 On the Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 397.2 Comments On Alize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397.3 Comments On LIA_RAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407.4 Comments on the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7.4.1 Time Disposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417.4.2 Missing Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

v

Contents

7.4.3 Prior Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417.4.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.6 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

References 43

A Base Line Configuration 45A.1 TrainWorld — Background Model . . . . . . . . . . . . . . . . . . . . . . . . 45A.2 TrainTarget — Speaker Models . . . . . . . . . . . . . . . . . . . . . . . . . . 46A.3 ComputeTest — Segment Testing . . . . . . . . . . . . . . . . . . . . . . . . . 46A.4 ComputeNorm — Score Normalization . . . . . . . . . . . . . . . . . . . . . . 47

B Parameter Exploration 49B.1 Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49B.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

C Performance Optimization 51C.1 TrainWorld — Background Model . . . . . . . . . . . . . . . . . . . . . . . . 51C.2 TrainTarget — Speaker Models . . . . . . . . . . . . . . . . . . . . . . . . . . 52C.3 ComputeTest — Segment Testing . . . . . . . . . . . . . . . . . . . . . . . . . 52C.4 ComputeNorm — Score Normalization . . . . . . . . . . . . . . . . . . . . . . 53

vi

Chapter 1

Introduction

This is the final report of a project focusedon exploring a software toolkit called Al-ize and its companion LIA_RAL1 [1]. Al-ize is a toolkit for automatic speaker recog-nition. As such, it contains algorithmswhich can identify a person from her voice.The LIA_RAL package uses Alize and itwas the interface used during the project.Both packages are currently being developedby the Laboratoire Informatique d’Avignon(Computer Science Laboratory at the Uni-versity of Avignon).

The paper begins with a short introduc-tion to the mathematical statistics used inAlize, and an introduction to the problemdomain of automatic speaker recognition,then continues with a brief exploration of theAlize and LIA_RAL implementations.

Using these building blocks, a workingsetup of Alize is then introduced, with theaim of comparing Alize and GIVES in termsof performance. GIVES (General IdentityVerification System) is another system forautomatic speaker recognition, developed atKTH. This second part also includes a shortdescription of the databases used during theproject. Results from running the system ispresented before conclusions and comments.

Frederic Wils was the author of the ini-tial version of Alize. Alize is now maintainedtogether with Jean-Francois Bonastre, the

author of LIA_RAL. The toolkits are pre-sented in many papers, most notably [2] and[3].

GIVES was written primarily by HåkanMelin, with additional contributions by var-ious others [4]. In particular, the GMM(Gaussian Mixture Model) part used duringthis project was written by Daniel Neiberg.Also, two major efforts to create Swedishspeaker recognition databases resulted inGandalf and PER, both of which are at-tributed to Håkan Melin.

1.1 Project Statement

This project was proposed by Håkan Melinon the website of the Department of Speech,Music and Hearing (Sw. »Tal, musik ochhörsel», or TMH). Alize had been presentedon various occasions by its authors, and thequestion was how this relatively new toolkitperformed compared to an already existingsystem on KTH.

The focus of the project was to builda working speaker recognition system usingthe tools found in Alize and LIA_RAL, andcompare the results and algorithms to thoseof GIVES. Some specific goals were definedin the project prospect, handed in during thefirst month of the project. The goals were

1Despite major efforts to reveal the meaning of »RAL» in »LIA_RAL», none has been found. The prede-cessor to Alize/LIA_RAL was named »AMIRAL», and is described as »A Block-Segmental MultirecognizerArchitecture for Automatic Speaker Recognition». Hence, it would be logical to assume »RAL» meaning»Recognizer Architecture Library», or similar.

1

1. Introduction

chosen with the available time and previouswork in mind:

• Compare the results of an Alize setupwith that of Melin’s GIVES setup.

• Evaluate the Alize system on some ofthe databases available at the depart-ment (which GIVES had been evalu-ated with).

• Describe Alize and LIA_RAL in rela-tion to GIVES.

• Identify what parts of Alize performbetter than the equivalent GIVESparts.

1.2 The Art of AutomaticSpeaker Recognition

In general, speech contains large amountsof information, including gender, feelings, amessage, an identity, and physical health.It is often easy for a human to hear these»dimensions» of information independentlyof each other. A 90-year old woman will»sound like an old woman», and a tired manwill »sound like a tired man», despite theymay have the same message. This extra in-formation, which is not really needed for thetransmission of the message, is used for es-timating the credibility, the relevance andmany other properties of the message.

When moving the analysis from the earand brain to microphones and computers,these dimensions are separated early. Twofields of research have received extra atten-tion in later years, much due to the advancesin computational speed: Voice recognitionand speaker recognition. Voice recognitionis the art of identifying what a person is say-ing. This identification should, of course, betolerant to dialects, illness, and aging. Theimportant information is the message beingdelivered. If two people say the same sen-tence, the computer should deliver the sameanswer.

Speaker recognition, on the other hand,is the art of identifying who is speaking.

What is being said is not of importance,and is discarded. In this area of research,dialects are good since it helps distinguishbetween people. However, as in the case ofvoice recognition, illness, aging and tempo-ral environmental noise can make it harderfor the computer to deliver the identity. Forthis reason, systems must be constructed ina robust fashion.

There are two different approaches tospeaker recognition; text-dependent, andtext-independent. A text-dependent systemrequires that the text spoken when using thesystem must be the same as, or a combina-tion of, the text used during training. Forsimple authentication, where the user startsby stating her identity, and uses a simplepass-phrase or code, a text-dependent sys-tem is enough. However, other situationsrequire other solutions. For instance, whentracking the identity of speakers in a con-ference room, it is often necessary to use atext-independent system. The area of mostcurrent research is the more generic text-independent type, where no assumptions canbe made of what the text contains.

Over the years, researchers have also di-vided the speaker recognition problem intotwo sets describing two different questionsto be answered:

Speaker Verification where the input isan audio segment and a speaker model.The question to answer is »Did thespeaker utter this segment?» and theoutput is »yes» or »no».

Speaker Identification where the input isan audio segment and any number ofspeaker models. The question to an-swer is »Who uttered this segment?»and the output is an identifier describ-ing which model best fitted the seg-ment.

There is, at least theoretically, a big dif-ference in computational complexity be-tween the two classes. In the verificationcase, there is no requirement to check other

2

The Science of Automatic Speaker Verification

speaker models beside the input model. Inpractical applications, however, simply look-ing at one model will not give enough dis-crimination between models for the decision,since the system would not know anythingabout speech by looking at this one modelalone. In other words, it would not be possi-ble for the system to classify the audio seg-ment if no classes have been defined. Defin-ing classes can be accomplished either byusing a pre-calculated composite of models(with, or without the given speaker model),or by running tests against other (so calledimpostor) models.

Speaker identification requires the algo-rithm to check a number of models to findthe one most likely match. This is usuallydone by some kind of scoring, for instanceby applying the audio segment at hand onall speaker models and getting an absolutescore. The output of the speaker identifica-tion would be, for instance, the model withthe highest score.

Since the LIA_RAL toolkit is targetedat Automatic Speaker Verification (ASV),this paper will not delve deeper into speakeridentification.

1.3 The Science of AutomaticSpeaker Verification

During the last ten years, two implemen-tation strategies have been in focus for re-search:

Hidden Markov Model (HMM) is theoldest approach and models eachspeaker as a Markov model.

Gaussian Mixture Model (GMM)where a sum of Gaussian probabil-ity distributions is used to model eachspeaker.

The Hidden Markov Model approach is oftenused on the phoneme level, where one HMMis used to model a single phoneme with afixed set of states. In essence, it is a state

machine using the input audio frames to de-termine the next state.

According to Auckenthaler et al. [5],the GMM approach is more efficient thanphoneme-based HMMs on text-independentspeech. LIA_RAL currently only usesGMM, which is what this paper will discuss.Other techniques, like artificial neural net-works and support vector machines have alsobeen used in various research projects.

1.3.1 Gaussian Mixture Modelsin Theory

The GMM approach to speaker recognitionis purely statistical. Given a number ofspeaker audio features (transformed audiosamples), a model is created. The creation ofa model includes deciding upon fixed modelparameters (such as the number of terms,defined later in this section) and trainingthe model on speech data. Other names forGaussian Mixture Models include »WeightedNormal Distribution Sums», and »RadialBasis Function Approximations».

Two example Gaussian probability den-sity functions (in one dimension) can beseen in Figure 1.1(a). Mathematically theyare called fN (x, µ, σ2). When µ = 0 andσ = 1, it is referred to as a Standard Gaus-sian density. The term normal distributionis a more common name for the Gaussiandistribution. However, since GMM containsthe word »Gaussian», this paper will use thename »Gaussian distribution».

The parameter µ is the mean of the dis-tribution, and σ2 is its variance. Giventhese parameters and a stochastic variableX, X ∼ N(µ, σ2) is equivalent to

p(x) = limδ→0

∫ x+δ

x−δfN (z, µ, σ2) dz

if x is an outcome of X. In essence, p(x)is the probability of seeing an infinitesimalinterval around x in an infinite stream ofoutcomes of X. In this paper, p(x) will beused to denote »the probability of seeing x».This, however, implies »seeing an infinites-

3

1. Introduction

N(-1, 0.5)N(1, 1)

(a) Two distributions in R. (b) An independent distribution in R2.

(c) A dependent (rotated) distributionin R2.

(d) A mixture with three components,approximating the dependent distribu-tion in Figure (c).

Figure 1.1. Three different Gauss distributions and one mixture.

imal hypervolume around x», with multidi-mensional x.

A Gaussian distribution can also be ex-tended to Rn (known as the multi-variateGaussian distribution), where

p(x) = limδ→0

∫ xn+δ

xn−δ. . .

∫ x1+δ

x1−δfN (z, µ,Σ2) dz

is the probability of encountering an in-finitesimal hypervolume around the vector x.A two-dimensional distribution can be seenin Figure 1.1(b). Note that the mean is nowa vector, and the variances are described us-ing the n×n covariance matrix Σ2. The useof a full covariance matrix allows the distri-bution to be rotated around the mean, likein Figure 1.1(c). Restricting Σ to a diagonal

matrix2 removes this possibility and makescalculations both easier and faster. Alize iscapable of both full matrices (dependent di-mensions), and diagonal matrices (indepen-dent dimensions). However, as we shall see,full matrices are rarely used with GMMs.

A mixture is a (finite) sum of N distri-butions, or

pM (x) =N∑i

wipi(x)

where pi are individual distributions, and inthe case of GMMs, Gaussian distributions.The weights wi determine how much influ-ence each distribution has. Note that since0 ≤ pM , pi ≤ 1, the sum of the weights must

2Thus renaming it a variance matrix.

4


be 1, or pM will not be a probability. It iseasily seen that a distribution with a largevariance and a large weight will have a greatinfluence on the mixture, illustrated in Fig-ure 1.2. Figure 1.1(d) shows an approxima-tion of Figure 1.1(c) using a GMM with threeindependent (diagonal variance matrix) dis-tributions.

The application of GMMs to speaker ver-ification requires at least two algorithms.First, some way of determining the means,variances and weights is needed, to make theGMM model a speaker’s voice characteris-tics. Second, a method of »evaluating» au-dio segments using the model must be found,to verify the identity of the claimed speaker.Since the output of the GMM is a singleprobability, it has to be transformed intoa »yes» or »no». This is most easily ac-complished by thresholding the pM (x). Thisraises two related questions:

1. How are the parameters of the mixturedetermined?

2. How is the threshold determined?

Assuming diagonal variance matrix, eachmixture component has three parameters:The mean µi, the variance σ2

i , and the weightwi. In a mixture of N distributions, these3N parameters are usually trained using theExpectation Maximization (EM) algorithm,described later in this chapter.

Thresholding can be thought of as clas-sifying a real value in one of two classes. Incases of probabilities this is often done by us-ing a less-than relation. The threshold canbe either floating or fixed. A floating th-reshold requires the evaluation of audio seg-ments using more than one speaker modeland compares the output probabilities. Thisis a slow approach, however, and it is avoidedif possible as it requires multiple evaluationsof the same audio segment.

A fixed threshold is determined duringtraining and only requires evaluation of oneor two models during deployment. However,this method requires the models to be ro-bust against noise, audio volume and other

variations which could bias the probabilitycalculations. The bias could make impos-tors likely to succeed for some models, whilemaking it nearly impossible for even the truespeaker to succeed on others. Some kind ofnormalization is needed to ensure the audioquality does not affect the results. Normal-ization and thresholding will be discussedfurther in Sections 2.4 and 2.8, respectively.

1.3.2 Likelihood andA Posteriori Probability

Two types of probabilities are used in thetraining and evaluation of GMMs, represent-ing different properties of the data. The firsttype is the likelihood, p(x|HM ), or the prob-ability that the data, x, was generated un-der hypothesis HM . The hypothesis beingthat the claimed speaker is the true speaker,and assuming that the speaker in questionis represented by model M . Given the datafrom a microphone and a GMM, it is triv-ial to calculate this probability. In itself,though, this likelihood is of minor impor-tance in speaker recognition, as the interest-ing problem is the reverse: p(HM |x), or the aposteriori probability. This is the probabilitythat the claimed identity is the true identitygiven a speech segment. Fortunately, Bayes’theorem defines a relationship between thelikelihood and the a posteriori probability,namely

p(HM |x) =p(x|HM )p(HM )

p(x)

where two other probabilities are introduced.The first one, p(HM ) is the a priori prob-

ability that the claimed identity is the trueidentity. This means the system should bebiased with information on which identitiesusers most often uses, i.e. the system willtend to give frequent users a higher prob-ability than infrequent users. Since mostsystems are designed for a large speakerdatabase, it is usually assumed that thisprobability is equal for all speakers, thusmerely introducing a scaling factor. Hence,p(HM ) is often disregarded.

5

1. Introduction

0.6 * N(x, 0, 2)0.3 * N(x, 5, 1)

0.1 * N(x, -6, 0.5)

Figure 1.2. A Gaussian Mixture Model in R with three terms.

The second probability, p(x), is the like-lihood of seeing feature x at all. Thus, a verycommon feature will have less impact thana rare one. Most often an approximation top(x), discussed in Section 2.6, is used.

1.3.3 Expectation Maximization

As declared earlier, the most common tech-nique for training the models is using an Ex-pectation Maximization (EM) algorithm [6].The problem of training involves using a lim-ited set of samples to determine (at least lo-cal) optima for parameters of a model. In thecase of speaker recognition the samples areusually segments of audio where the speakerutters various sentences. Training of themodel is accomplished by iterating over thesegments and finding better and better pa-rameters for the model (weights, means andvariances). By itself, the EM-algorithm isonly a pseudo-algorithm and to be a com-plete algorithm, the object of maximizationmust also be given. In the area of speakerrecognition it is usually either a likelihood oran a posteriori probability. Thus, in statis-tical terms, the EM-algorithm is one way offinding maximum likelihood (ML) or max-imum a posteriori probability (MAP) esti-mates to data under an hypothesis.

For the rest of this paper, θ will denotemodel parameters, θ∗ denotes optimal pa-rameters and θ

′ is used to denote the newvalue of θ after a single optimization iter-ation. Thus, the EM-algorithm iterativelyfinds θ

′ which maximizes the probability of

seeing the observed data, given the currentparameters, θ:

θ′= argmax

ΘE(p(x, Y |Θ)|x, θ)

where x is the available data and Y is astochastic variable denoting some unknown(hidden) data. The expected value is usuallycalled Q(θ) and the calculation of this valueis the first part of each iteration.

1.3.4 Expectation Maximizationon GMMs

For the purpose of this paper, both MAPand ML variants of the EM-algorithm forGMMs are needed. However, the deriva-tion of the equations is very intricate, andthe reader is referred to Bilmes [7] for theML-version of the algorithm, and to the ref-erences provided by Reynolds et al. [8] forderivations of the MAP-version.

Calculating the E- and M-steps for Gaus-sian mixtures (ML) results in three equa-tions for updating the mixture parameters,θ = (w1, . . . , wn, µ1, . . . , µn, σ1, . . . , σn) [7],

w′i =

1n

n∑j

p(i|xj , θ)

µ′i =

∑nj xjp(i|xj , θ)∑n

j p(i|xj , θ)

σ′2i =

∑nj (xj − µ

′i)

2p(i|xj , θ)∑nj p(i|xj , θ)

where wi is the weight for the i:th distribu-tion and p(i|xj , θi) is the likelihood that xj

6


was generated by distribution i, and is usu-ally the weighted average

p(i|xj , θ) =wipi(xt|θi)∑M

k wkpk(xt|, θk)

The (·)2 operator is defined as (·)(·)T for vec-tors.

In the case of MAP-training, the equa-tions become [8]

w′i =

rw + Ni

n + Nrw

µ′i = αµ

i Ei(x) + (1− αµi )µi

σ2′i = ασ

i Ei(x2) ++(1− ασ

i )(σ2i + µ2

i )− µ′2i

where

Ni =n∑j

p(i|xj , θ)

Ei(x) =1Ni

n∑j

p(i|xj , θ)xj

Ei(x2) =1Ni

n∑j

p(i|xj , θ)x2j

with N as the number of mixture distribu-tions and rw and α are constants describedin Section 2.5.

The algorithm is straight forward: Sim-ply repeat the three equations under a con-vergence criterion, or a given number oftimes, over the same sample set. Care mustbe taken not to over-train the model, i.e.adapt it too tightly to the training data.This would otherwise remove the generaliza-tion property of the model, the reason to cre-ate a model in the first place.

7

Chapter 2

The Method of Speaker Verification

This chapter will give the reader a brief in-troduction to the steps of performing auto-matic speaker recognition. It is assumed thereader is familiar with the basic statisticalconcepts introduced in the previous chapter.

During speaker verification using GMMs,a few steps are common in nearly all systems.They include

Audio sampling usually accomplished bymeans of a microphone and an AD-converter. The audio waveforms areconverted into discrete data calledsamples.

Labeling where training data is manuallylabeled with the speaker identity.

Feature Extraction where the audio istransformed from time domain to abetter representation of voice charac-teristics.

Energy Filtering where silence and noiseis filtered out.

Normalization where feature vectors arenormalized based on energy content.

Training the speaker models so they areable to represent one speaker each.

Testing (or evaluating) unknown audiosamples using the trained models.

Score normalization where testing scoresare compared and normalized in order

to avoid environmental differences be-tween training and evaluation.

Computing the decision on whether theclaimed speaker is the true speaker oran impostor.

The »Labeling» step is needed during train-ing, since the GMM approach uses a super-vised training style, where the speaker isknown for each training segment. This couldbe semi-automatic in that the speech seg-ments could be fed into some automatic clas-sifier, thus only requiring the operator toidentify one segment per speaker instead ofall of them. See Figures 2.1 and 2.2.

2.1 Audio SamplingMost ASV (Automatic Speaker Verification)systems use 8 bits and around 8 kHz sam-pling frequency for their input. The largestreason for this is probably that most researchis performed on telephone lines, where thebandwidth is even more limited, so samplingat more than 8 kHz would give minor im-provements. However, as noted in Grassi etal. [9], some systems would perform betterusing a higher sampling rate. This indicatesthat better performance still can be achievedwith higher quality of data. Data stored inA-law format uses a non-linear scale, and 8bits can be used to represent a 12-bit lin-ear dynamic range, with acceptable impacton detail resolution. To accommodate for

9

2. The Method of Speaker Verification

Audio

Identification

During Training

Energy Filter NormalizationExtractionFeature

Figure 2.1. The first steps of transforming audio data.

Testing Normalization Decision

Features

Models

Figure 2.2. The procedure followed during evaluation of unknown audio.

the higher dynamic range, higher values aremore widely spaced than are the lower val-ues. This results in the preserving of bothfine details and a large dynamic range. It isa common representation for speech.

2.2 Feature ExtractionThe plain audio samples retrieved in theprevious step are not directly suitable forspeaker recognition. Environmental noiseis the most obvious problem which totallychanges the audio waveform. Other things,like the speaker having a cold, usually addsnoise to the speech, but rarely changes theentire spectrum. A more robust represen-tation is needed and a variation on Fouriertransforms, called Cepstrum coefficients [10]currently dominate the field. It is definedto be FT log FT f , where f is a windowedframe of audio. The length of the frame isusually in the order of 10–30 ms. Normally,the extraction is further improved by usingthe mel scale (from the word »melody»1), afrequency scale thought to overcome the dif-ference between the physical frequency andthe perceived pitch.

The result from a Cepstrum transformis Cepstral Coefficients [11, p. 162], and in

the case of a Mel scale, they are called Mel-Frequency Cepstral Coefficients, or MFCC.The low coefficients describe the relationshipbetween formants. The higher describe thesource (glottis) [11, p. 163]. A selection ofaround 10 values are taken from the lowercoefficients. They can be used together withother audio information, like the fundamen-tal frequency to form features.

In summary, the audio samples are di-vided into frames. Each frame is trans-formed into »Cepstrum domain» to obtainsimple key figures used as a representationof characteristics of the speech signal.

2.3 Energy FilteringIn the real world, it is often not possibleto assume each feature file to be packedwith speech data. Many times, there arepauses, hesitation and environmental noisethat causes the speech data to be blurred.To avoid training and testing on feature vec-tors that are not speech, most systems doenergy detection. This detection segmentsthe file into one or more segments, each la-beled with speech or not speech. During allsubsequent steps, the non-speech segmentsare ignored.

1According to http://en.wikipedia.org/wiki/Mel_scale.

10

Feature Normalization

2.4 Feature Normalization

The effects of the recording environmentcan be devastating to the results of speakerrecognition. To reduce the negative effectsof having different characteristics from themicrophone, the transmission channel andsuch, the features are normalized for meanand variance. A simple algorithm to do thisis to calculate the mean, µ, and variance, σ2,of feature vectors in the entire file. This isunder the assumption that the environmen-tal characteristics are constant during theentire recording. Alize also has the optionof normalizing individual segments of eachfile. Each feature vector is then normalizedusing

x′=

x− µ

σ

Some systems only normalize on the mean,µ. When using Cepstral coefficients, thisis called Cepstral Mean Subtraction (CMS).Other systems (like the LIA_RAL) normal-ize on both mean and variance.

2.5 Background Modeland Training

One of the basic ideas of an ASV systemis that, given a few training samples for aclient, it should be able to classify samplesin general. This means the training sampleshas to be transformed into a more generalmodel. It is called training or adaptationto training samples. For simple tasks, it canbe as simple as just storing the training dataand do a pattern matching. However, for thepurpose of this project, a GMM is used.

Generally, it would be possible to gener-ate one model per speaker by initializing itwith random values and then train it into ausable model for the speaker. However, ac-cording to recent research [8], this leads toproblems with under-training. If all modelswere generated independently, some distri-butions of the mixture will probably not betrained (due to the properties of the com-mon training algorithm used). This leads

to models with stochastic properties wherea previously unseen feature vector may beclassified according to a randomly generateddistribution. Unfortunately, this opens thepossibility of generating nearly random an-swers if the model is heavily under-trained.

To avoid this, a background model is of-ten used. The background model is trainedusing the EM algorithm (usually on ML cri-terion) for 5 – 10 iterations, normally. Train-ing data is collected from a special set ofspeakers with the only purpose of represent-ing speech in general. This model is thenused as a common starting point for train-ing the various speaker models. By adaptingfrom a single model, the element of chancestill exists if the background model was nottrained enough, but it will be evenly dis-tributed among all of the speaker models.

The most widely used optimization cri-terion for speaker model training is to max-imize the a posteriori probability that theclaimed identity is the true identity giventhe data (max p(HM |x, θ)). This is calledMAP training, and is opposed to ML train-ing where the criterion is to maximize thelikelihood of the data being generated by themodel (max p(x|HM , θ)).

The approach of using a backgroundmodel, trained with the ML criterion andlater adapted to individual speakers us-ing the MAP criterion is referred to asGMM-UBM, or GMM Universal Back-ground Model, by Reynolds et al. [8]. Infact, in the GMM-UBM, a modified versionof the EM algorithm is used for adapting tothe speaker speech data. In this version, toupdate the parameters, θ, the simple expres-sions

P (i|xj) =wipi(xj)∑Mk wkpk(xj)

Ni =n∑j

P (i|xj)

Ei(x) =1Ni

n∑j

P (i,xj)xj

11


Ei(x2) =1Ni

n∑j

P (i,xj)x2j

are calculated. In essence, they are weightsdescribing what distributions to assign thefeature vector to. The parameter update isexpressed in

λi =αw

i Ni

n+ (1− αw

i )wi

w′i =

λi∑Ni λi

µ′i = αµ

i Ei(x) + (1− αµi )µi

σ2′i = ασ

i Ei(x2) ++(1− ασ

i )(σ2i + µ2

i )− µ′2i

where N is the number of distributions in themixture. The αi parameters are calculatedusing a »relevance factor», rρ,

αρi =

Ni

Ni + rρ

This is the standard algorithm used to trainspeaker models in Alize, as described in Sec-tion 3.3.4.

The update equations for µ′ and σ2′ are

equal to the equations described in Section1.3.4 on EM with a MAP criterion. Theequation for w

′ , however, is different and hasbeen found to be better than the standardequation found in Section 1.3.4 [8].

It is interesting to note that adapting allof the parameters (wi, µi, σi) is not alwaysbetter than adapting only a subset. Aftertraining the background model, most sys-tems only adapt the distribution means, orthe means and weights. It has been found[5, p. 44] that adapting anything more thanmerely the means does not generally improveperformance. Instead, the relatively sparsetraining data can result in over-training2

the system. Still, the background model isadapted on all parameters. Since it is initial-ized with random parameter values, trainingonly a subset of the parameters would leaverandom values in the model.

To avoid cases where the backgroundmodel is over-trained, the variances are oftenclamped. Two types of clamping exist: Vari-ance flooring and variance ceiling, clampingon the lower and upper bounds, respectively.

2.6 Testing

After training, the preparation of the sys-tem is complete. The system should nowbe able to verify speakers given models andfeature vectors. Before defining the test al-gorithm, however, a decision has to be madeas to what the output of the testing shouldbe. The simplest output from a GMM is asingle scalar per audio segment and model.More advanced scores can be vectors cover-ing different aspects of the speech. Likeli-hood ratio, defined by the Neyman-PearsonLemma [12], is commonly used to calculatethe score. It is the optimum score in respectto minimizing false negatives (the risk of re-jecting a truth-telling speaker). The ratio isexpressed as

ΛM =p(HM |x, θ)

p(¬HM |x, θ)log ΛM = log p(HM |x, θ)−

− log p(¬HM |x, θ)

where p(¬HM |x, θ) is the probability of HM

not being the right hypothesis given x (M isnot the true speaker of x).

The latter is referred to as the log like-lihood ratio (LLR). As most probability cal-culations in a speaker recognition system in-volve multiplication and division, the LLRis the type of score generally used. Anapproximation is often used for p(¬HM |x),whereby instead of using a class of all mod-els except M , the background model, W , isused. In fact, without this approximation,the LLR would require knowledge about ev-ery possible source of sound in the universe,a problematic task. However, for the sakeof speaker verification, this implies that HM

2Over-trained meaning »loses its generalization properties»

12

Score Normalization

is part of ¬HM even though this is obvi-ously false. This, however, eases computa-tion as the background model was used dur-ing training:

log ΛM = log p(HM |x, θ)−− log p(HW |x, θ)

Likewise, in Section 1.3.2 the a posterioriprobability was introduced with the equa-tion

p(HM |x) =p(x|HM )p(HM )

p(x)

It was noted that p(x) is rarely used inits exact form, as it would require manymore samples than needed for the rest ofthe system. In fact, this too would requireknowledge of the entire sound universe. In-stead, by assuming that all speakers andchannel characteristics that will use the sys-tem are modelled in the background model,the more simple likelihood p(x|HW ) can beused. The assumption can be formalizedas defining p(HW ) = 1, hence leading top(x) = p(x|HW )p(HW ) = p(x|HW ).

2.7 Score NormalizationEven though the previously described steps(sampling, feature transformation, trainingand testing) are enough for a speaker recog-nition system, research has shown that thesystems perform better with some score nor-malization applied. The purpose of this be-ing to make the score as speaker-independentas possible. Allowing an absolute thresholdwhen making the decision allows faster algo-rithms than using relative thresholds.

During recent years, many techniqueshave been proposed. Three of the most im-portant are Z-norm, T-norm and H-norm,discussed by Auckenthaler et al. [5]. Allthree will be described shortly in the follow-ing sections.

2.7.1 Z-normIn Z-norm (Zero Normalization), scores fromthe target models are normalized by using

impostor speech. The idea is to find howlikely impostor speech is to have come fromthe target model. By evaluating a set ofimpostor segments for each model (as partof the setup), the distribution of impos-tor scores can be found using the statisticalequations:

µI =1n

n∑i

log ΛIi

σI =

√√√√ 1n− 1

n∑i

(log ΛIi − µI)2

and the score can be normalized by post-filtering it with

log ΛM,Z =log ΛM − µI

σI

where µI and σI are the statistical proper-ties of the scores from the impostor utter-ance evaluated on the model M . The ad-vantage of this algorithm is that it can beperformed off-line, during training.

2.7.2 T-norm

In T-norm (Test Normalization) [5], the ideais to compare each test utterance againstpossible impostor models. This results in aset of scores used for normalizing the score’smean and variance:

log ΛM,T =log ΛM − µI

σI

where ΛM,T is the T-normed score for modelM (for implied utterance x). Here, the µI

and σI are the statistical properties of thescores from the test utterance evaluated onimpostor models. Of course, this has to bedone on-line for each test utterance and isthus considered a costly approach.

2.7.3 H-norm

The H-norm (Handset Normalization) [13]uses the same technique as Z-norm, exceptthe segments are also divided into channel

13


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

True

Acc

epta

nce

Rate

False Acceptance Rate

(a) A Receiver Operating Characteristics

50

40

30

20

10

5

5040302010520.50.1

False

Rej

ectio

n Ra

te


(b) A Detection Error Tradeoff with thepoint of EER.

Figure 2.3. Plots showing the graphical difference between the ROC plot and the DETplot.

characteristics before use. When used underconditions where channel characteristics arevery dominant (like a database shared be-tween cellular phones and land-line phones),its use can result in a large gain.

2.8 Making the Decision

The final step of the verification procedureis to actually decide which speaker has spo-ken the utterance. In the speaker verifica-tion case, the input contains a hypothesizedmodel and the problem is to compare theoutput of the model to a threshold. Theidentity hypothesis could have, for example,been taken from a speech recognizer that canidentify names from a database.

Instead of finding a threshold, re-searchers often use the DET-plot and EERscalar as measures of the performance oftheir verification system. A short introduc-tion follows, as they will be used to describethe performance of Alize.

2.8.1 Types of Errors

In any speaker verification system, the out-put can be grouped into four types: TrueAcceptances, False Acceptances, True Rejec-tions and False Rejections. The first typeoccurs when a true speaker is accepted, thesecond when an impostor is accepted. Theseare also called true or false positives, respec-tively. The third type occurs when an im-postor is rejected, and the last type is thecase where a true speaker is rejected.

These rates of error are used in variousconstellations to represent the performanceof a speaker verification system. They are,however, rarely used alone and are often vi-sualized in a diagram.

2.8.2 Detection Error Tradeoff

The DET was first defined by Martin et al.[14] in 19983. It replaced the ROC (ReceiverOperating Characteristic) which had beenused for visualizing the relationship betweenthe rates of true positives and false positives.The ROC-curve is not easily understood andis in general not a very good representation

3Though it had been used informally before that.

14

Making the Decision

of the performance of a speaker verificationsystem regarding ease of reading and theability to draw conclusions from it. It is,however, easy to plot and has therefore beenaround since the 1940s.

In DET-plots, two things have changed.Instead of plotting a failure (false accep-tance) against a success (true acceptance)the two possible types of error are plotted:False acceptances and false rejections. Thesecond difference is the scale on the axes.While the ROC-curve uses a linear percent-age scale on each axis, the DET-plot uses atransformation of the axes. The input is stilla percentage, but by applying a transform,the plot can be made easier to read. SampleROC- and DET-plots can be seen in Figure2.3.

The score distribution for true speak-ers and impostors usually approximate twoGaussians, and it is therefore not surprisingthat a Gaussian deviate scale4 results in astraight line in the DET-plot. This, in turn,makes it easy to spot deviations from thestraight line (i.e., where one of the distribu-tions are not Gaussians) and to find differ-ences in variances (where the inclination ofthe line is not −45◦). The use of a DET-plotwith Gaussian deviate transformed axes arethe de facto standard to visualize the perfor-mance of speaker verification systems.

2.8.3 Equal Error RateThe Equal Error Rate (EER) is the stan-dard scalar measure of the performance of abiometric verification system, be it based onspeech or fingerprints. In essence, it is thepoint on the DET curve where the False Ac-ceptance Rate and the False Rejection Rateare equal. For current GMM systems pre-sented during the annual NIST SRE5 evalu-ations, this is often in the area of 1%–5%.

Note that this does not necessarily meanthat all systems with an EER of 5% allow5% of the impostors in. For high securitytasks, it may be required to set the thres-hold higher, thus allowing fewer impostorsin, with the side effect of producing morefalse rejections. This is why the DET-plotis called Detection Error Tradeoff Plot, notjust Detection Error Plot.

2.8.4 Defining the ThresholdThe threshold is decided upon during train-ing, and can, for instance, be set to theEER point (determined during training) as astarting point. In research, the thresholdingstep is often ignored, and instead the EERand DET-plot are used to show the overallsystem performance. During the studies ofAlize, no threshold has been deployed as allresults will be presented either as DET plots,or as EER values.

4See for instance http://www.amex.com/servlet/AmexFnDictionary?pageid=display&titleid=42885National Institute of Standards and Technology Speaker Recognition Evaluation

15

Chapter 3

Alize and LIA_RAL

The following chapter will focus on the twosoftware packages to be evaluated, whereboth implementation and usage informationwill be described. Alize version 1.06 andLIA_RAL version 1.3 [1] have been usedduring the evaluation.

The relationships between, and bordersof, Alize and the LIA_RAL are not easilydefined. Primarily, Alize is the frameworkof mathematics, statistics, and I/O on whichthe LIA_RAL is built. The major categoriesof functionality that exist in Alize are:

• Command line and configuration fileparser

• File I/O abstraction

• Feature file reading and writing

• Statistical functions (including his-tograms, mean, and variance)

• Gaussian distributions (full and diago-nal (co-)variance matrices)

• Gaussian mixtures (including readingand writing to file)

• Feature file labeling and segmentation

• Line-based file I/O

• Vectors and matrices

• Managers for storing named mixtures,feature files and segments

Each of the modules will be discussed fur-ther in the next sections. Note that thereare no algorithms specific to speaker (orspeech) recognition in Alize. Rather, it canbe viewed as a generic statistics frameworkhighly focused on data structures and algo-rithms used in speech research. Key obser-vations to support this is the lack of trainingalgorithms and a specialization on Gaussiandistributions.

3.1 Overview

LIA_RAL is the package containing all codespecific to speaker recognition, althoughmost code would also be useful in speechrecognition. As the LIA_RAL packagewas the primary toolbox during the project,much of the following sections will be tar-geted to the use of Alize in the LIA_RAL.

Code in the LIA_RAL package can bedivided into two categories. First, there isa more generic library (referred to as Spk-Tools), containing the training algorithmsand some specialization of the statisticalfunctions found in Alize, applied on featurevectors.

Second, it contains many small pro-grams, designed to work in a pipelinemethodology. There is essentially one pro-gram per recognition step, defined in Chap-ter 2. However, there is no skeleton to imple-ment an entire speaker verification system.Connecting all programs of the pipeline re-

17

3. Alize and LIA_RAL

quires reading the documentation, exampleconfiguration files and the various unit-teststhat exist for each program.

Both Alize and the LIA_RAL are im-plemented in C++, and Alize uses GNUAutotools1 for platform independence. Inthis chapter, a more thorough description ofthe implementation of the packages will begiven. It is assumed the reader has know-ledge of basic object oriented developmentterminology.

3.2 The Alize LibraryAs stated earlier, Alize can be divided intosmaller modules. These will be describedfrom the viewpoint of how they are used bythe LIA_RAL.

Note that some functionality of Alize andthe LIA_RAL may have been left out fromthis paper, as not all functions were usedduring the experiments of this project.

3.2.1 Configuration ParserThe configuration parser and manager canhandle Java-style property files2 with aslightly stricter syntax. There is also initialsupport for an XML-based file format. Aconfiguration in Alize is simply a map con-taining key–value pairs.

In addition to files, the configuration canalso be given on the command line. A sep-arate class handles parsing of the commandline and inserts the parameters into a con-figuration. So, having a command line of

--nbTrainIt 3

is roughly equivalent to a configuration fileof

nbTrainIt 3

This makes it very easy to define a static con-figuration file and still allow some dynamic

parameters to be defined on the commandline.

3.2.2 File I/O

Two classes handle reading from and writingto files. They are mostly a C++-ificationof FileStreams found in Java and use thefile API from C as the back-end. One no-table difference from Java is that it has func-tions for swapping bytes, for use with dif-ferent byte ordering, or endianness, depend-ing on processor type. This is most usefulwhen reading feature files created on ma-chines with another processor type than thecurrent.

3.2.3 Feature Files

For handling permanent storage of featurevectors, Alize supports some feature file for-mats (most notably HTK and SPro43). Italso supports the transparent concatenationof several feature files into a single stream.This is accomplished using list files, whichare text files containing lines with namesof other feature files. With this technique,many operations can be extended to supportmultiple input files during one run.

Another example of its usefulness is forspeech databases with speaker data dividedinto classes. For instance, if training a modelrequires a single feature file, but each speakerhas one recorded file for each digit 0–9, andone with their name, then instead of con-catenating the feature files on disk, it can bedone in memory with minimal loss in perfor-mance. This feature will be used in Section4.3.

3.2.4 Statistical Functions

Alize can do simple, accumulating, statistics.Accumulating vectors and returning result-ing mean and variance estimates is possible,

1The Autotools consists of Autoconf, Automake, and Libtool.2See http://java.sun.com/j2se/1.5.0/docs/api/java/util/Properties.html on this3HTK is found at http://htk.eng.cam.ac.uk/, and SPro is found at http://www.irisa.fr/metiss/

guig/spro/.

18

The LIA_RAL Library

as well as feeding the data into a histogramgenerator. Despite its lack of training algo-rithms, Alize does have support functions forthe EM algorithm in general. All operationsare performed in double precision.

3.2.5 Gaussian Distributions

Alize can handle Gaussians with both fullcovariance matrices (almost never used withmixtures, as it can be approximated witha group of diagonal variance matrix Gaus-sians), and with diagonal variance matrices.

3.2.6 Gaussian Mixtures

Although Alize is designed to support otherdistributions than Gaussians, only Gaus-sians are implemented and all code is tar-geted to Gaussians. This includes mixtures.There are functions to calculate basic val-ues needed during EM MAP estimation, likevariance matrices and mean vectors from fea-ture vectors. All mixtures, like the distribu-tions, are targeted to multivariate problems,where the input is a vector rather than ascalar.

Mixtures can be read from and writtento disk, though a single distribution cannot.

3.2.7 Feature Segmentation

As noted in section 2.3, there is a need to la-bel audio segments and to filter out segmentsbased on labels. This functionality exists inAlize, with a module working in parallel withthe feature file I/O.

3.2.8 Line-Based File I/O

Except for feature files and mixture files,most files have the structure of text filesdivided into lines and fields. One moduleof Alize is targeted to reading and writingline/field files. This includes list files, con-taining the names of feature files to be con-catenated, and score files.

3.2.9 Vectors and Matrices

Only some basic vector and matrix algo-rithms are implemented. Alize does not im-plement a generic vector or matrix library.

3.2.10 Managers

A group of classes, called Servers in Alize,handle the storage of named entities in mem-ory. The concept is more commonly referredto as maps or managers in other systems.All speaker models, for instance, are keyedon their speaker identity, and are stored inthe Mixture Server. Features are cached inthe Feature Server, and segment labels arestored in the Label Server. This makes iteasy for the user to read definitions of speak-ers and feature files into memory as there isno need to organize the permanent storageof models manually.

3.3 The LIA_RAL LibraryThe main purpose of the LIA_RAL is towrap all algorithms in programs which canbe run from a shell script. Note that theLIA_RAL does not contain functionality forfeature extraction. All input is assumed tobe »usable» feature vectors. The authors ofthe LIA_RAL use the SPro programs to per-form the extraction [3].

A common property of all LIA_RALprograms is that no files are both read andwritten. Input is read, transformed, andthen output to other files. This is very usefulwhen running the system with make-files.

3.3.1 Energy Filtering

The EnergyDetector program reads a listof feature files (or a single feature file) andsegments it based on speech energy. Theenergy measure is assumed to be the firstcomponent of each feature vector (usually re-turned by the SPro programs). A thresholdis calculated and subsequently used to dis-card feature vectors with lower energy thanthe threshold. In its simplest mode (called

19


MeanStd), the threshold is determined bytraining a GMM on the energy componentand finding the top distribution (the distri-bution with the largest weight, wi in Section1.3.1). This distribution is used to computethe energy threshold as

T = µT − ασT

where T is the threshold, (µT , σT ) are theparameters for the the top distribution, andα is an empirical constant.

Another, more involved thresholdingmethod called Weight is also implemented.It uses the weight of the top distribution asthe percentage of frames to pick out. The ac-tual threshold is then determined by build-ing a histogram (with 100 bins) from the fea-ture vectors of the file. The idea is to sumthe bins vector counts from the top (high-est value) until the requested percentage offrames have been searched. From this, thethreshold is taken to be the higher thresholdof the last bin visited. However, in the cur-rent implementation, the Weight method isincorrectly implemented and not used.

Input parameters include the threshold-ing method and α. The latter is used in bothmethods. As with most GMM training inthe LIA_RAL, variance clamping (see Sec-tion 2.5) can be used to reduce the risk ofover-training the (energy threshold) model.

3.3.2 Feature Normalization

The NormFeat program reads a list of fea-ture files (or a single feature file) and nor-malizes them, either on segment basis (givenlabel files), or per file. Two modes of op-eration can be chosen for normalization:Mean/variance normalization and Gaussian-ization. The first mode simply accumulatesstatistics of all feature vectors and normal-izes using the expression in Section 2.4. Thesecond mode performs feature warping [15]to transform the feature vector histogramonto a standard Gaussian distribution. Theinput parameter is mainly the mode selector.

3.3.3 Background Model

Training of the background model withbackground speakers is performed withTrainWorld. The input is a list of featuresegments and the output is a single GMM.Initially, all distributions are set to theglobal variance and mean, and all weightsare equal. Then the model is iterativelytrained with all feature vectors using the EMalgorithm with a ML criterion. For each iter-ation, the program can normalize the modelusing a special technique called model warp-ing [2]. The most notable input parametersto TrainWorld are:

• The number of distributions. Thisfigure will determine the number ofGMM distributions for the rest of thepipeline.

• A feature selection ratio. Not all fea-ture vectors need to be used duringtraining. Feature selection is based onprobability.

• The number of training iterations.

Since the purpose of this program is back-ground training using large amounts ofspeech data, the problem of over-training isnot of a major concern. However, code forvariance clamping exist and is active in theexperiments outlined in Chapter 5. If thespeech data is highly incoherent, there is arisk of very high variance values, why vari-ance ceiling should be employed.

3.3.4 Target Model

Once a background model has been created,the TrainTarget program is run to adaptthe background model for each speaker (us-ing a modified EM with a MAP criterion) tofeature vectors. Input to the program is alist (called an index) with lines like

F1025 F1025/feature1F1026 F1026/feature1 F1026/feature2

20

The LIA_RAL Library

The first column designates the speaker iden-tity, the remaining columns are feature filenames. A prefix and a suffix can be set glob-ally to allow for moving to different file for-mats and locations without having to updatethe indices.

Like in the TrainWorld program, not allfeature vectors are necessarily used in theinitial training iterations. A selection ratiocan be set for using only a selection. TheGMMs can be normalized using the modelwarping algorithm, just like during back-ground model training.

Furthermore, the modified EM algorithmin use (see Section 2.5) can use one of twomodes, called »MapOccDep» and »Map-Const2», respectively. In the first mode, theαi family of variables are calculated using arelevance factor, while in the second mode,αi is fixed. Also, the second mode only al-lows adaptation based on means. Variancesand weights are never trained.

During training, the iterations have beendivided into two parts. The first part canmake use of the feature vector selection ratiodescribed previously. The second part usesall vectors. This functionality has not beenexplored in greater depth in this project.

When using T-norm, this program is alsoused to generate the impostor models. Thisis no different from generating the »true»speaker models, except the model featuresnever contain speech feature vectors from thetrue speaker.

3.3.5 TestingFrom a background model, the speakermodels and a number of feature files, theComputeTest program generates score files.The output score files contain one line pertest (model and feature file, or segment,pair). The program is a generic probabilityevaluator and is also used with both Z-normand T-norm to find scores. It uses the stan-dard LLR measure for scores. One of thevery few settings is the number of top distri-

butions to use in the result. By pruning themixture to, for instance, ten distributions,computational time can be reduced to be in-dependent of mixture size. This is imple-mented by first calculating the LLK valuefor the world model, find the top distribu-tions and then only evaluate the correspond-ing distributions in the target models. Ifonly ten of 512 distributions are evaluatedfor each target model, performance can begreatly improved [8].

3.3.6 Score Normalization

When using a score normalization technique,like Z-norm or T-norm, the ComputeTeststep produces two or three score files. Onefile is the true speaker test, and one or twocontain various impostor tests. ComputeNormthen calculates the actual score normaliza-tion. The output is a single, normalizedscore file.

A restriction can be applied to only usea fixed number of impostor tests during nor-malization, the cohort size. For T-norm thisis the number of impostor models, and forZ-norm it is the number of impostor featurevector segments or files.

3.3.7 Making the Decision

The final decision is made by the Scoringprogram, though ComputeTest also outputsa decision. It can be configured to alwaysaccept exactly one model per segment, forspeaker identification tasks. The normalmode, however, is to input a scoring fileand a threshold. One of several outputmodes must also be given, including »NIST»,»ETF» and »MDTM». A last mode is alsosupported which performs a »light-version»of T-norm, excluding the model with thehighest score. The NIST mode is in compli-ance with the NIST SRE 20044 evaluation,

4The National Institute of Standards and Technology Speech Recognition Evaluation, found at http://www.nist.gov/speech/tests/spk/.

21


ETF and MDTM are designed for ESTER20055.

3.3.8 Common ParametersA number of parameters are common to allLIA_RAL programs. These include settingsfor verbose output and debug output, butalso prefix and suffix for various file types.All files are divided into five file types:

• Feature files

• Mixture files

• Label files

• List files

• Index files

Each type of file can be prefixed and suffixedseparately. This allows for shorter referencesfrom list files and index files, and makes iteasier to move to other file formats.

Alize does not automatically find the fileformat of a feature file, so a setting for theloading and saving types must be specified.The same is true for mixture files (which canbe stored in a compact format, or as XML).

Also, Alize cannot determine the endiannessof files, and a configuration directive mustbe used for this.

All programs that manage mixtures havethresholds for log likelihoods. This thresholdis used to clamp the LLK value in an inter-val, to avoid problems with division-by-zero.

3.4 The Sum of the PartsBecause the LIA_RAL toolkit is built on topof Alize, it is not possible to use only theLIA_RAL. On the other hand, since Alize initself is too generic to be an ASV toolkit, theLIA_RAL is needed in order to evaluate theinner workings of Alize. The packages mustbe used together. Hence, evaluating Alizealone would not be useful for the purposeof ASV. The interesting ASV algorithms arelocated in the LIA_RAL.

Furthermore, as the LIA_RAL packageconsists of small programs rather than a li-brary for use in another C++ application,the user interaction with Alize is small. Infact, nearly all user actions are wrapped inLIA_RAL structures, the only exception be-ing the configuration file parser, which isfound in Alize.

5A French Evaluation of Broadcast News Enriched Transcription Systems, listed at http://www.elda.org/article140.html

22

Chapter 4

Building a System

This chapter contains a description of thesystem developed and used during the eval-uation of the Alize and LIA_RAL packages.It assumes the reader is familiar with theAlize and LIA_RAL toolkits from readingChapter 3. For the remainder of the paper,the name »Alize» will be used to denote bothpackages, as they are never used separately,and the intended package should be evidentfrom context.

4.1 Speech Databases

This section will guide the reader throughthe three speech databases used during thisproject. The databases are the foundationfor proper research in speech. As they mustbe organized manually, building a databaseis probably the most time-consuming task inspeech research. Since more complex parti-tionings and tests require more data, it isadvantageous if the database can be easilyextended.

The three databases are all found at theDepartment of Speech, Music and Hearingat KTH.

4.1.1 SpeechDat and Gandalf

The first database to be used was theSpeechDat FDB 1000 database. It was usedfor training background models during theinitial stages of the project. FDB 1000 isa database containing 1,000 speakers, where

all sessions are recorded using a land-line(fixed) telephone. Though this database isintended for speech recognition, rather thanspeaker recognition, it suits well for back-ground model training. Of course, this as-sumes the target models will be trained withaudio having roughly the same channel char-acteristics. Due to the fact that each speakeronly has one session, there is a very limitedset of data for each speaker. This, unfortu-nately, means that the database cannot beused for training and testing in ASV. Moreinformation on SpeechDat can be found inElenius [16].

The Gandalf database was used for train-ing and testing during the initial stages ofthe project. The database contains a totalof 86 speakers, with 48 males and 38 fe-males. The Gandalf database was createdfor speaker recognition research and has sev-eral sessions (up to 29) per speaker. Gandalfwas used together with SpeechDat while de-veloping the system. More information onGandalf can be found in Melin [4, Sec. 6.2].

4.1.2 PER

The final database to be used was the PERdatabase. It is a database of cellular phone,land-line phone, and wide-band microphonerecordings. The wide-band microphone sam-ples are collected at the gate leading into theTMH (Speech, music and hearing) offices.To support new research in ASV, all partici-pants were divided into three groups: Back-

23

4. Building a System

ground, target, and impostor. The targetand impostor groups were not disjunct. Forthe background group, 51 male and 28 fe-male speakers were recorded. A separate setof 54 speakers participated as target speak-ers (38 male and 16 female). Most of thetarget speakers also participated as impostorspeakers together with some additional newspeakers, for a total of 98 impostor speakers(76 male and 22 female). Most of the effortof this project was placed on experimentingon the gate recordings of PER.

PER has been divided into four condi-tions, depending on the recording condition.These are named

• G8, gate, wide-band microphone inhallway,

• LO, land-line telephone in office,

• MO, cellular telephone in office, and

• MH, cellular telephone in hallway.

Most focus in this project was on the G8condition, with additional tests run on LO.These are presumed to be of the best qual-ity. More information on PER can be foundin Melin [4, Sec. 6.3].

The difference between the MO and MHconditions is perhaps subtle, but theoreti-cally significant. There may be influencesby echoing and envelope in a hallway whichdo not exist in an office environment. Tostudy these differences, Melin [4] includedboth conditions in his research.

4.1.3 File SegmentationAll three databases had label files alreadyprepared by Håkan Melin using different au-tomatic systems. In the case of SpeechDatand Gandalf, the files were generated by theuse of an end-point analyzer, which used anenergy measure together with zero-crossingrate 1 to find the speech contents. The zero-crossing rate detector was setup to include

an entire utterance, including any inter-wordsilences. For most feature files, this resultsin one speech label per file. Some label filescontained no speech label, which had to becorrected (automatically, once the problemwas found) as Alize would abort if no fea-ture vectors could be used from a single file.

For the PER database, more detailed la-bel files were available. These files were cre-ated using a speech recognizer. The besthypotheses from the speech recognizer wascompared to an expected utterance (nameand digits) and the best hypothesis was se-lected. The time coordinates and IPA2 in-formation from the speech recognizer wasrecorded into the label file. The files, calledtri_bst_02, are described in detail in [4, Sec.5.6]. To create these files, a tri-phone basedspeech recognizer was used, and the besthypothesis was selected based on a prioriknowledge about the utterance.

4.1.4 GIVES Software

GIVES is the counterpart to Alize and theLIA_RAL at TMH. In an attempt to bothnarrow the project down, and to be as closeto GIVES as possible, it was decided to usethe GIVES functionality for feature transfor-mation. Since Alize does not have supportfor this, but GIVES does, it was a naturalchoice. This would give Alize the same start-ing point that GIVES had.

The only program used was Aparam,which reads a settings file and audio filesand transforms them into feature vector files.The settings file can contain a variety of fil-ters, and is modelled as a directed acyclicgraph, where the different stages are namedand uses input from earlier stages.

4.2 Assembling the Parts

A fundamental design decision was to makethe entire system as modular as possible,

1Using an algorithm by B. T. Lowerre, found at ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/tools/ep.1.2.tar.gz.

2International Phonetic Alphabet

24

GIVES Job Descriptions and Corpus Labels

since it was not known at the start ofthe project which databases and programshad to be used later. Another aspect wasthat since the interface to Alize consists ofseparate programs, a pipeline architectureseemed like the natural choice. This led tothe decision to use the Unix software buildutility, make, to build the system.

The project was split into different make-files for each Alize program to be easier towork with. A master make-file is responsi-ble for connecting the pieces. Together, thisincludes everything from converting GIVESjob descriptions into Alize format, to cre-ating DET plots for the score files. Dur-ing development, a goal was to minimizethe number of recalculations needed whenupdating a parameter. Changing the listof speaker models used during testing, forinstance, should not require remaking theworld or speaker models, unless they do notexist or are out-of-date.

Because of the assumed big differencesbetween male and female voice, all work wassplit to handle them separately. This alsoimplies that no females were tested againstmale models and vice versa.

4.3 GIVES Job Descriptionsand Corpus Labels

The first task at hand was to transform thejob descriptions (background model train-ing, speaker model training, and testing)used in GIVES to a format Alize could read.This proved to be a challenge, as the threedatabases all have somewhat different direc-tory structures, thus requiring one transfor-mation tool each to be written. The utilitiesto do the transform were implemented usingBash or Perl, depending on the task.

All job descriptions are under a make-fileumbrella. They are fetched from the sourcelocation on the network, transformed andstored on local disk. One of the make-files isalso responsible for building Alize list files forthe Gandalf database. This had to be donebecause the Alize test files are »transposed»

versions of GIVES job descriptions. This hasthe unfortunate side-effect that what is sin-gle feature files in GIVES has to be concate-nated to suite Alize. Hence the usefulnessof being able to create list files in Alize (de-scribed in Section 3.2.3).

An early decision was made as to howenergy filtering of the feature files would beperformed. Since all databases already hadlabel files equivalent to energy information,there seemed to exist no reasons to use theAlize functions for energy filtering. This wasin part due to project time constraints, andin part to make the setup as close to GIVESas possible.

Because of the small size and the staticnature of label files, they were not underthe supervision of a make-file. Instead, theywere converted for the entire database at thesame time. This made it possible to do theconversion before the job descriptions wereavailable.

4.4 Feature TransformationAll databases have its audio data storedin the time domain, mostly A-law encodedwith 8 kHz sampling frequency and 8-bitresolution (corresponding to 12 bits in lin-ear dynamic range terms). These have tobe converted into features before use. TheGIVES program AParam was used for this.All files in the databases were transformedin one pass. The main reason to do thiswas that the database was located on thenetwork, and it was not desirable to streamdata across the network during testing. Inrelation to the other tasks in speaker verifi-cation, feature transformation is fast. Eachof the three databases has approximately 2.5GB of feature data in this setup.

4.4.1 TransformationAn illustration of the over-all transformationalgorithm can be seen in Figure 4.1. Thefirst stage is a band-pass filter bank, set to24 filters in the range 300 – 3,400 Hz. Thefilters are equally spaced in the mel-scale and

25

4. Building a System

FBANK TCOS RASTA

DELTA

Figure 4.1. A schematic view of the feature transformation steps.

0

1

0 500 1000 1500 2000 2500 3000 3500

Ampl

ificat

ion

Frequency (Hz)

Figure 4.2. Equally spaced (in mel-scale) filters used in the FBANK during featuretransformation.

are depicted in Figure 4.2. Output is in fre-quency domain. Then, a cosine transformis applied which reduces the 24 coefficientsfrom the first stage to 12 Cepstrum coeffi-cients. Finally a RASTA [17] filter is appliedto reduce the effects of channel characteris-tics, like echoes and envelope. The outputconsists of 24 feature components, with the12 components from the RASTA filter andtheir first-order differentials (discrete deriva-tives).

The entire feature transformation de-sign was taken from previous GIVES exper-iments, and is designed specifically for text-independent GMM systems. All feature vec-tors have 24 coefficients used by the GMMs.

4.4.2 Normalization

To ensure that the features input to Alizewere conditioned correctly, all feature filesused were normalized using the normalizerfound in Alize, NormFeat. In addition to pre-conditioning, this also served as a check tosee that all feature files needed later couldbe read.

The NormFeat program is slow (in re-lation to AParam), thus an initial attemptto normalize all files was abandoned. In-

stead, one of the make-files generates one fileper database containing all feature files thatwere referenced in any job description. Be-cause of the large number of feature files,this was also run in a single pass outside thescope of make-files. This required some man-ual effort when using a new job description,but it helped speeding up normal operationsin return.

4.5 Structure ofthe Make-files

When all feature files are in place, the rest ofthe steps are performed with make-files. Themarginal time cost for trying a new config-uration is therefore reduced to a minimum.The modules with their own make-file are

• List Generation

• World (Background) Training

• Target Training

• Impostor Training (duplicate of TargetTraining)

• Testing

• T-norm Testing (duplicate of Testing)

26

Run Logs

• Score Normalization

• Master

The List Generator has already been de-scribed in this chapter. The World Trainertakes an Alize configuration file and a list offeature files containing background speakeraudio. The Target Training, in turn, usesthe background model previously created, inaddition to a configuration file and an indexfile of speaker/segment groups. The Testingmodule uses target models, the backgroundmodel, a configuration file and an index fileof segment/speaker groups. Next, the ScoreNormalizer uses data from Testing and, pos-sibly, from the T-norm Testing to create anoverall score file. Finally, the score file is ren-dered as a DET plot and an EER scalar isgenerated using the tools available at NIST’swebsite and Gnuplot3.

All of the different parts are run from theMaster make-file, which also does some ini-tial checks to see that all configuration filesare in place before starting a job. The Mas-ter is also able to run a single job in paral-lel, with the females and males in differentbranches.

4.6 Run LogsIn order to be able to quickly analyze andotherwise use the results from a run, the key

results were stored in a run log. The log usedXML to be easily parseable by both humansand machines. Each log contains informa-tion on

• start time,

• end time,

• which host had run the job,

• module configuration names,

• the DET-plot points,

• EER values, and

• links to pre-rendered DET plots inEPS and PNG formats created usingGnuplot, for use in this paper and forpreviewing.

Additionally, links to the raw score files weresaved.

Since most of the development was doneon non-backuped space on a TMH server, itwas important to save critical data, like testresults and configurations. Subversion4 wasused for revision control, and after each runlog was created, it was automatically addedto the repository. In the event of a diskcrash, this would at least preserve the re-sults and make it possible to rerun the testsafter the feature files had been regenerated.

3Available at http://www.nist.gov/speech/tools/index.htm and http://www.gnuplot.info/ respec-tively.

4Found at http://subversion.tigris.org/.

27

Chapter 5

Experiments

In this chapter, the experimental setup usedfor evaluating the system described in theprevious chapters is presented. Configura-tion files, experiment hypotheses and datasets will all be discussed in depth.

The single most important task dur-ing experimentation was to create a systemwhich could compete with GIVES in termsof EER performance. Since both systemshave been under active development the lastyears, it was expected that they would workequally well.

However, a few concepts in Alize lack acounterpart in GIVES. The most notable ofthese is the use of »model warping» (see Sec-tion 3.3.3), where every trained model can benormalized after each iteration of the train-ing. Another interesting, albeit not unique,feature is the ability to change the MAPapproximation algorithm used during train-ing (see Section 3.3.4). These two featureswere used as the starting point for the ex-periments, together with the obvious studiesof mixture sizes and the number of trainingiterations.

5.1 Data Sets

In this section, the data sets used in thisproject will be described. A data set is asubset of audio files and speakers from a spe-cific database. The division into data setscan have two purposes. First, three sets(preferably disjunct) must be defined that

are given to each of the background train-ing, target training and testing. Despitethat background training and target train-ing could use the same audio files, this is of-ten avoided when there is enough availabledata, to avoid over-training. The trainingdata and the testing data must, of course,always be disjunct.

5.1.1 Development and Evaluation

Another often occurring division is into a de-velopment set and an evaluation set. Theseare depicted in Figure 5.1. Since the currentresearch of speaker verification technology atthe moment require much manual testing,tweaking, and analysis, most research is doneiteratively:

1. Create the system

2. Define some system variables

3. Set these variables

4. Run the system using developmentdata

5. Analyze, and refine the variables

6. Repeat from step four

7. Run the system using evaluation data

8. Use the evaluation output as a perfor-mance measure

29

5. Experiments

Development Set Evaluation Set

Database

Background

Training

Testing

Figure 5.1. The division of a database into two separate sets.

If only development data were used, themodels would be adapted to training data,and the entire system would be adapted toa local optimum for the testing data of thedevelopment set. This way, it may very wellbe that the system performance is due to anadaption of the parameters to the develop-ment data. Thus, the relevance of the exper-iment would be questionable.

The use of disjunct development sets andevaluation sets avoids this problem. Insteadof using the output from the developmentset as a performance measure, a last run, onevaluation data, is made.

However, because the project was tar-geted at the evaluation of Alize, rather thanprimarily finding an optimal configuration,the division into development and evaluationhas not been strictly followed. The exper-iments have been divided into two classes.The first was to explore the parameters ofAlize, with a »what happens if» type of in-vestigation. By far, most time was usedfor this. The second class involved findingan optimal configuration. For this class ofexperiments, SpeechDat was used togetherwith Gandalf for the development, and PERwas used for evaluation. During parameterexploration, all three databases were used tomore or less extent.

It should be noted that all sets were pro-vided by the project supervisor in form ofGIVES job descriptions. This helped assure

that the bases of the experiments on bothGIVES and Alize are the same. The namesof the data sets used in this paper are alsoused by Melin [4]. They have been preservedfor reference.

5.1.2 SpeechDat

The set of background speech used is calledbs3w2, where from each session, one digit se-quence, three sentences and two words areincluded. The set contains a total of 561unique female sessions and 399 unique malesessions, and a total of 3,303 female featurefiles and 2,358 male feature files. Because 40of the speakers in the SpeechDat set are alsoavailable in Gandalf, these were removed us-ing a list of exception sessions.

5.1.3 Gandalf

From Gandalf, a data set resembling thedata set from the PER database, describedlater, was used. During training, the setfs0xd5/per-fiver was used. It contains fiverepetitions of a sentence (simulating thenames in PER) and five repetitions of a tendigit sequence from a single session. For test-ing, the set 1fs+1r4-fs0x-SSOnly was used.It contains one sentence (again, simulatinga name) and four consecutive digits. Fur-thermore, it only tests voices of the same

30

System Setups

gender1.In the end, the training set contained 18

female, and 22 male speakers, with a totalof 266 female feature files and 329 male fea-ture files. The testing set contained another532 female feature files and 437 male featurefiles. Using the models and testing set, 419female and 508 true speaker tests were per-formed. The testing set also contained 36female and 46 male impostor tests, for usein T-norm score normalization.

5.1.4 PER

From PER, three sets were used: Back-ground, speaker training, and testing. Thebackground set was called E2a-gen. Eachbackground speaker contributes with ap-proximately 30 seconds of speech. For train-ing, the accompanying set E2a was used.Finally, the testing set was called S2b. Sincethese sets are divided into conditions, theyall contain at least four subsets (see Section4.1.2).

As a comparative example, the G8 con-dition of the E2a-gen data set contained 277female and 508 male feature files. Thesewere used when training the female and malebackground models. The training set con-tained 16 female and 38 male true speakers,plus 28 female and 51 impostor speakers.

Tests were performed with 1,029 femaleand 3,614 male true speakers. Impostortests, for T-norm, contained 208 female and913 male tests. A complete description ofthese sets, including the other three condi-tions, can be found in Melin [4, App. C].

5.2 System Setups

To test Alize and LIA_RAL, a system hadto be built (see Chapter 4), and parametersfor the system had to be chosen. It was de-cided early in the project that the exper-iments should be divided into two groups:Parameter exploration and performance op-timization. The first of these, the »parame-

ter exploration» was intended as a hands-oncrash course in speech research for learninghow sensitive the various parameters wereand what impact they had on the results.

On the other hand, the normal approachwhile evaluating an hypothesis (or a system)in speech research is to create a configurationwith as high performance as possible. Toshow how Alize compares to GIVES or anyother system using Swedish databases, it wastherefore also important to run performanceoptimization tests, where the goal was tohave Alize and GIVES perform equally well,under roughly equal conditions.

5.2.1 Parameter Exploration

A baseline setup was created with a config-uration as close to the configuration usedwith Alize during NIST’s SRE 2004 evalu-ation [3] as possible, and named plain. Mostof the parameters were taken from the exam-ple scripts available at the LIA_RAL web-site [1]. However, some parameters werefound to be unused in the programs, whileothers were lacking in the configuration file.They also had to be regrouped to supportthe use of make-files in place of a simpleshell-script, thus supporting an iterative ap-proach, rather than a »waterfall».

For the baseline setup, GMMs with512 distributions were used. The back-ground model was trained using ten itera-tions. Speaker model training had the modelwarping feature enabled and did three it-erations using the »MAPOccDep» method,with a regression factor of 14 (see Section3.3.4). The same configuration applies toimpostor model training (used for T-norm).During testing, the top ten distributionswere used for LLR calculations. Finally, incase of T-norm, the T-norm was applied withselection type »noSelect», i.e. all T-normscore data were used. A complete listing ofthe plain configuration can be found in Ap-pendix A.

Both the G8 (gate) and LO (telephone)1It is assumed for most parts of ASV research that it is easy to classify a voice as male or female.

31

5. Experiments

Name Descriptionplain baseline configurationnotnorm without T-normnodistnorm without model warpinghalftopdists 5 LLR distributions instead of 10cohort a selection of T-norm models was usedallworld use all features during world trainingdists/1024 1024 distributions instead of 512dists/256 256 distributionsdists/128 128 distributionsdists/90 90 distributionsdists/64 64 distributionsmapreg/7 reg-factor 7 instead of 14trainits/5 5 training iterations instead of 3trainits/2 2 training iterationsworldits/20 20 world training iterations instead of 10worldits/15 15 world training iterationsworldits/5 5 world training iterations

Table 5.1. Tests performed during the parameter exploration phase.

PER conditions were used, and the basicidea was to change one parameter at a timein the configuration and study the effects onthe EER. Table 5.1 lists the tests performed.Of these tests, the empirically most inter-esting is the nodistnorm attempt, as it ex-plores the model warping feature, currentlynot available in GIVES. Differences betweenbaseline configuration and the various ex-ploration configurations are shown in detailin Appendix B.

While testing GMM sizes, the 90-termGMM was added to investigate the effects ofa size not a power of two. This test was setup to verify that Alize makes no assumptionson the GMM sizes.

5.2.2 Performance OptimizationTo see how the system performed underdevelopment conditions similar to GIVES,a classical development–evaluation test wasperformed. The system was trained withbs3w2 (SpeechDat) background models,fs0xd5/per-fiver (Gandalf) speaker models,and tests were performed with 1fs+1r4-fs0x-SSOnly (Gandalf). Using the knowledge

from the parameter exploration runs of howvarious parameters influence the end result,a local optimum was found for the develop-ment data. Finally, the PER sets E2a-gen,E2a, and S2b were used to run the evalua-tion and to gather EER values. These teststoo were performed on the G8 and LO con-ditions separately.

Three generations were run on develop-ment data with three different configura-tions per generation. The first generationoptimized the number of training iterations,and also helped decide whether to use modelwarping or not. The second generation wasused to optimize the MAP regression fac-tor, used in the speaker model training al-gorithm. The final generation was setup tosee if only using a subset of feature vectorsduring world training would be better.

After each generation, one configurationwas selected by hand. The selection wasbased on EER on both females and males.In the end, one configuration was left andthis was used for the final run on the PERdata sets. The full final configuration is doc-umented in Appendix C.

32

Chapter 6

Results

This chapter will list the results from the ex-periments initially described in the previouschapter. In the first section, the experimen-tal results from parameter exploration willbe presented. They are to be interpretedas a guidance to the sensitivity of variousparameters and served as a starting pointfor the performance optimization tests pre-sented last.

6.1 Parameter Exploration

The key results for the parameter explora-tion on the G8 data set are visualized inFigure 6.1. A number of observations canbe made from this plot and Figure 6.2, cor-responding to the LO data set. The underly-

ing numerical results are presented in TableB.1.

First of all, three runs in Figure 6.1show improvement in all data sets relative toplain: allworld, trainits/5, and worldits/15.It seems like the baseline configuration leadsto under-trained models. Increasing thenumber of feature vectors used during back-ground training (allworld), or increasing thenumber of training iterations results in bet-ter matches (worldits/20), though the im-provement in the case of more backgrounddata is marginal. There also seems to bea limit on the number of background iter-ations, as is shown in the LO case, with asmall change in results between the use of15 and 20 world iterations. This is further

3

4

5

6

7

8

9

10

world

its/1

5

world

its/2

0

train

its/2

train

its/5

map

reg/

7

dist

s/64

dist

s/90

dist

s/12

8

dist

s/25

6

dist

s/10

24

allw

orld

coho

rt

halft

opdi

sts

nodi

stno

rm

notn

orm

plai

n

FemaleMale

Figure 6.1. The EER values, in percent, for all runs on the G8 data set. The horizontallines show the values for the plain setup.

33

6. Results

6

8

10

12

14

16

18

20

22

world

its/1

5

world

its/2

0

train

its/2

train

its/5

map

reg/

7

dist

s/64

dist

s/90

dist

s/12

8

dist

s/25

6

dist

s/10

24

allw

orld

coho

rt

halft

opdi

sts

nodi

stno

rm

notn

orm

plai

n

FemaleMale

Figure 6.2. The EER values, in percent, for all runs on the LO data set. The horizontallines show the values for the plain setup.

6

8

10

12

14

16

18

20

22

1024 512 256 128 90 64

G8 FemaleG8 Male

LO FemaleLO Male

Figure 6.3. EER values, in percent, for six GMM sizes.

indicated by the increase of EER when de-creasing the number of iterations.

Second, removing the model warping(nodistnorm), gives an EER improvement.Finally, any removal of T-norm data resultsin a performance degradation (notnorm, andcohort).

A closer look at the number of distri-butions, seen in Figure 6.3, reveals that,although a greater number of distributionsdoes improve performance, using more than512 distributions only gives slightly betterEER values. The exception to this is the LOmale set, which seems to continue to improveeven beyond 1,024 distributions.

Comparing the G8 and LO results, it canbe seen that the relative changes over thebaseline configuration are generally lower forthe LO (telephone) set. In Figure 6.4, theEER as a function of various numbers oftraining iterations is shown. Figure 6.4(a)shows the number of world training itera-tions, while Figure 6.4(b) depicts the num-ber of speaker model training iterations. Forall data sets of the world training iterationstests, raising the number of iterations above15 only results in little improvement, in re-lation to the major improvement for the it-eration numbers below 10. Changing thenumber of speaker training iterations, on the

34

Performance Optimization

0 5

10 15 20 25 30 35 40

20 15 10 5

EER

(%)

Training Iterations

G8 FemaleG8 Male

LO FemaleLO Male

(a) EER for four different number ofworld training iterations.

5

10

15

20

25

30

5 3 2

EER

(%)

Training Iterations

G8 FemaleG8 Male

LO FemaleLO Male

(b) EER for three different number oftraining iterations.

Figure 6.4. Plots of EER values versus the number of target speaker training iterations.

504030

20

10

5

2

0.5

0.1

5040302010520.50.1

False

Rej

ectio

n Ra

te


G8LO

(a) Plot of the plain results.

504030

20

10

5

2

0.5

0.1

5040302010520.50.1

False

Rej

ectio

n Ra

te


G8LO

(b) Plot of the worldits/20 results.

Figure 6.5. DET-plots for the Female tests of two configurations.

other hand, seems to give little change to theEER in all cases.

In Figure 6.5, standard DET-plotsshow details on the plain (baseline) andworldits/20 configurations. Generally, per-formance is significantly better for the G8condition. The curves are almost straightand have a 45◦ inclination, as is normally ex-pected from DET-plots. Unfortunately, thesparse data available result in low resolutionin the extremes of the plots.

Note that all tests were run on the base-line configuration with only one parameter

changed at a time, thus the results presentedare in no way optimal. For instance, anincrease in the number of world iterationswould most probably result in better overallperformance.

6.2 Performance OptimizationExperimental results on the four conditionsare presented in Figure 6.6, with only themost successful runs of each generation plot-ted. The EER values of these plots, to-gether with comparative results from the

35

6. Results

504030

20

10

5

2

0.5

0.1

5040302010520.50.1

False

Rej

ectio

n Ra

te


G8LOMHMO

(a) Plot of the females.

504030

20

10

5

2

0.5

0.1

5040302010520.50.1

False

Rej

ectio

n Ra

te


G8LOMHMO

(b) Plot of the males.

Figure 6.6. DET-plots of the final run of the performance optimization experiments.

Alize GIVES GMMFemale Male Average Average

G8 5.6 3.9 4.8 4.2LO 14. 6.8 11. 5.2MO 10. 7.0 6.2 5.8MH 5.5 11. 8.1 6.4Average 8.9 5.9 7.4 5.4

Table 6.1. EER-values, in percent, for the final run.

GMM subsystem of GIVES where available,are shown in Table 6.1. Note that GIVESnormally runs a combination of an HMMsubsystem and a GMM subsystem. Hence,GIVES tend to perform better overall. Tomake the comparison more fair, only the re-sults from the GMM subsystem have beenshown. Complete results from GIVES canbe found in Melin [4, Sec. 10.4]. The GIVESfigures for MO and MH are approximate, asthey also cover the HMM subsystem.

Except for the MH condition, thereseems to be consensus on males being easierto verify correctly. On the other hand, thereare big differences in how well the two differ-ent genders of people are verified. Orderingthe male tests with lower EER first resultsin G8, LO, MO, and MH, while an order-ing based on female tests results in MH, G8,

MO, and LO. Hence, any fact that cellularphones (MH and MO) are harder to verifycannot be established in these results.

It is interesting to note the close rela-tionship between the extremes. On the onehand G8, with a wide-band microphone in ahallway, and on the other hand MH with itsnarrow-band microphone and transmissionchannel. In the case of female voice, theyare almost equal, while for the male voices,they are the most extreme.

Additional research was done on the LOand MH conditions as the performance ismuch lower than that of GIVES. Histogramswere plotted for the score distribution in thefour cases (two conditions with two genders).The results are presented in Figures 6.7 and6.8.

As expected, the cases with high EER

36


0 2 4 6 8

10 12 14 16

-5 0 5 10 15 20

Num

ber o

f Tes

ts

Score

TrueImpostor


0 5

10 15 20 25 30 35 40 45

-5 0 5 10 15 20

Num

ber o

f Tes

ts

Score

TrueImpostor


Figure 6.7. Score histograms of the LO condition.

0

5

10

15

20

25

-2 0 2 4 6 8 10 12 14 16 18 20

Num

ber o

f Tes

ts

Score

TrueImpostor


0 5

10 15 20 25 30 35 40 45

-2 0 2 4 6 8 10 12 14 16 18

Num

ber o

f Tes

ts

Score

TrueImpostor


Figure 6.8. Score histograms of the MH condition.

also have scores in a small interval. The fe-males of LO and the males of MH both havevery little separation between true and im-

postor speakers. Unfortunately, the explicitcause of this could not be found within thetime limits of the project.

37

Chapter 7

Comments and Conclusions

This final chapter will focus on the conclu-sions from experimental results, the author’sexperiences and general comments on thesystems used during the project. A few ideasfor future work are also presented.

7.1 On the ExperimentalResults

It should be noted that even though the per-formance optimization was performed by de-velopment experiments on Gandalf, and run-ning the evaluation on PER, the parameterexploration still used PER. This can result ina decrease of reliability in the experimentalresults. However, to get acquainted with theresearch area, the parameter exploration wasnecessary. The rationale for not using Gan-dalf was the greater amount of data availablein PER, and the many types of data sets al-ready available.

It is interesting to see that removing themodel warping, a feature not available inGIVES, results in a minor decrease in EERon the system with higher audio quality (andmore data), and no change in the telephonecase. Since the use of Alize with model warp-ing previously has shown little improvementof performance on NIST data [2], the authoris sceptical to the importance of this algo-rithm.

A straight-forward conclusion from theresults is that a cohort selection during scorenormalization, where only a few T-norm

models are taken into account, is no good.This is perhaps due to the limited numberof impostor models created by the data sets.

In other respects, the results of the ex-periments have been as expected. Differ-ences exist between the three data sets used,but they were anticipated due to the statisti-cal methods used. A single statistical modelcannot model all data sets equally well. Thisis especially true for low-quality audio suchas the PER LO data set.

7.2 Comments On Alize

There are many aspects of using Alize, in-cluding class hierarchy, design patterns andperformance. From a computer sciencepoint of view, Alize leaves much to wishfor. It is a compact statistical kernel, butwith a specialization towards speaker verifi-cation. However, the design patterns usedare mostly taken from Java or similar lan-guages, and does not use the »C++ way»of solving problems. There is, for instance,no use of templates. This means that, foreach new type of vector that needs to beused, it must also be implemented. Nothinguses the Standard Template Library, whichmeans the kernel itself is much larger thannecessary.

Further, the file format readers for fea-ture files are highly specialized and uses anintricate system of class inheritance insteadof separating the file format readers and

39

7. Comments and Conclusions

making them individual. This leads to prob-lems when a new format has to be read,and this will be most obvious when the needarises outside of LIA (Laboratoire Informa-tique d’Avignon). The readers and writersalso do not make use of the standard C++streams for I/O, but has a reimplementedJava-like streams interface, which uses thestandard C library FILE stream.

The feature server, imposes some unnec-essary restrictions on its buffer size. Dur-ing the experimentation a major hurdlewas that the feature normalization program(NormFeat) failed to write features. Aftermuch debugging, it turned out that this wasdue to a too small feature buffer. As thereis no real need to buffer all feature files (dueto independence between files), this was notan obvious cause of error.

Since very little code is unique to speakerrecognition, Alize should be transformedinto a generic statistics and I/O library tobetter reflect its purpose. Alternatively, theLIA_RAL could be integrated into Alize toform one library. A reimplementation of Al-ize using STL and C++ streams would prob-ably make the library both smaller and eas-ier to overview. The apparent Java style isnot well suited for use in C++ and resultsin more code and code which is harder tomaintain.

7.3 Comments On LIA_RALAs it is the LIA_RAL package that containsmost of the speaker recognition algorithms,this seems to be the most natural platformfor continued research. It contains unusedcode for Hidden Markov Models (HMM) andis probably better maintained than Alize.The current implementation does, however,have its drawbacks.

The use of a generic configuration map(key–value) used as an argument to everyfunction imposes problems on syntax check-ing. It is possible that an erroneous config-uration parameter is simply silently ignored

because of the lack of syntactic and seman-tic verification. This is most apparent in thehandling of the verbose and debug parame-ters. They are treated differently in differentprograms. Some programs use the existenceof the configuration parameters, while oth-ers use the boolean value of the parameters.This means that, for some programs, a con-figuration line of verbose false still means»be verbose». This is most confusing, andis hard to spot in the source code as the en-tire free-text configuration is passed aroundwithout checking.

A problem found late in the projectwas the lack of valid exit codes from theprograms. Normally, all programs can re-turn an integer value between -128 and 127,with zero (by convention) meaning »suc-cess». However, it was found that none ofthe programs used in the make-files reporterrors. Though the errors are reported astext on the terminal, the exit code is alwayszero. This fact implies that some parsing ofthe output has to be done in order to deter-mine if the run was successful or not. Dur-ing experimentation the output of one runshowed an EER of 0.2%, due to all tests notbeing run, though all programs (and thus theentire run) reported success (the real figureturned out to be 7% in a more successful re-run).

On the positive side, the use of prefixand suffix configuration parameters for fileaccess proved to be very useful. Moving be-tween different data sets and file formats wasvery easy. A straight-forward generalizationof this concept would be to allow an escape-sequence1 in a single free-form string, insteadof separate prefixes and suffixes. This changewould also reduce the number of configura-tion parameters.

7.4 Comments on the Project

In this section, some thoughts on the projectitself has been put together.

1Something like a variable substitution.

40

Comments on the Project

7.4.1 Time DispositionThe original time estimation did not reflectactual work. Due to part-time work (50%)the project was not completed on time (dueDecember 2005). This is unfortunate andcould have, presumably, been avoided, hadthe project been worked on full-time. Itwas further delayed due to periods where theother work required more than 50% time.

In retrospect, working on the project in-volved seven tasks:

1. Reading and understanding the ASVfield of research.

2. Understanding Alize and LIA_RAL.

3. Converting three TMH databases toAlize format.

4. Creating the make-files to create aspeaker verification system.

5. Creation of a batch job manager.

6. Running tests and evaluate the sys-tems.

7. Presenting the work.

The total duration of the project was fromMay 2005 to February 2007. During 2005,the first three tasks were prioritized. Un-derstanding Alize and LIA_RAL and under-standing ASV required more time than orig-inally estimated. A large amount of timewas spent on reading source code and liter-ature during 2005, but some make-files andthe conversion of SpeechDat was also per-formed during this time.

In Spring 2006, a batch job manager wascreated to assist in running tests. As eachtest creates two DET-plots and two EER-values, it was necessary to structure the jobarchive for easy searching. The manager alsomade it possible to queue jobs, and to detachthe user interface from the server, thus be-ing more fault tolerant than running the jobsdirectly at the terminal.

When the job manager was working, themake-files were completed (Summer 2006).

Finally, testing and writing on the paperwere the major tasks of Autumn 2006. Thisis revision 76 of the paper.

7.4.2 Missing Tasks

The idea of connecting parts of GIVES withparts of Alize was not pursued during theproject. When it was found that the perfor-mance of Alize was comparable to GIVES,it was decided to skip this idea, originallydescribed in the project prospect.

Because of the high EER seen in somecases in the results, more work should havebeen spent on drawing conclusions from theresults. However, as the MO and MH condi-tions were only run very late in the project,there was no time to pursue the causes.

7.4.3 Prior Knowledge

Before this project, a course in speech tech-nology (Sw. »Talteknologi») given at TMHhad been completed. As this course was widerather than deep, the author felt an urge toexpand his knowledge on speech recognition.Also, some courses in machine learning (»Ar-ticiella Neuronnät» and »Maskininlärning»)were completed prior to this project, andproved somewhat useful while dealing withGMMs.

7.4.4 Future Work

There are aspects of Alize still not exploredin the context of Swedish and the databasesavailable at TMH. The most notable as-pect being the HMM module which, at leastin the version used for this project, wasnot fully implemented. Furthermore, a fewhelper programs like the turn detector or en-ergy detector could be of interest.

As noted in Melin [4], the performanceof a system may change if the promptingbehavior changes. Thus, a conversion to areal-time classification system may lead toother results. Due to restrictions in the cur-rent LIA_RAL implementation, this trackwas not pursued.

41

7. Comments and Conclusions

Another library more focused on machinelearning is Torch2 from IDIAP, Switzerland.However, at the writing of this paper, theproject seems to have stalled. It contains farmore types of models than does Alize, forinstance, support vector machines. Explor-ation of this toolkit could be the basis for anew Master Thesis project. Many other tool-kits are used in the NIST Speaker Recogni-tion Evaluations (SRE), and there seems tobe little comparative data on these projects,except for pure performance measures of theNIST SRE.

7.5 ConclusionsAll-in-all, Alize and the LIA_RAL toolkitsseem to fulfill their intended behavior inmost aspects. It does not appear to includeany performance critical features not alreadyavailable in GIVES. However, the vague doc-umentation available makes it hard to use,and much time must be spent reading sourcecode to find answers not available in the doc-umentation.

Performance-wise (EER), the system iscomparable to GIVES with GMMs. Thisis not surprising as they use the same algo-rithms. When GIVES is used with its HMMsubsystem, GIVES performs better. Cur-rently, Alize does not offer any HMM sup-

port and it would be unfair to compare thetwo systems based on results other than fromthe GMM systems. The model warping fea-ture was not found to be useful in the ex-periments of this project, and in some casesreduced the performance somewhat.

However, the Alize and LIA_RAL tool-kits are still young and need more time andeffort to be a useful alternative to GIVES.Using parts of Alize together with GIVESshould be possible, given the modular natureof Alize. Intermixing is merely a matter ofwriting conversion routines for the represen-tations of scores and mixture models. Thisproject has not shown any part of Alize toperform better than the equivalent part inGIVES.

7.6 AcknowledgementsMany thanks to Håkan Melin for patientlyanswering many questions and also for re-viewing and proofreading the paper in detail.Also thanks to Daniel Neiberg at TMH forspreading light on the GMM part of GIVESand its configuration. This project wouldnot have been possible without their timeand knowledge.

Finally, thanks go to Martin Nygren andIbrahim Ayata for their time while proof-reading the paper.

2Available at http://www.torch.ch/.

42

References

[1] Bonastre, J.-F., Wils, F., Alize: a free, open tool for speaker recognition, http://www.lia.univ-avignon.fr/heberges/ALIZE/.

[2] Bonastre, J.-F., Scheffer, N., Fredouille, C., Matrouf, D., NIST’04 Speaker RecognitionEvaluation Campaign: New LIA Speaker Detection Platform Based on Alize Toolkit,NIST Speaker Recognition Evaluation 2004.

[3] Bonastre, J.F., Wils, F., Meignier, S. Alize, A Free Toolkit for Speaker Recognition,Proceedings of ICASSP 2005.

[4] Melin, H., Automatic Speaker Verification On Site and by Telephone: Methods, Appli-cations and Assessment, Doctoral Thesis, Department of Speech, Music and Hearing,KTH, 2006.

[5] Auckenthaler, R., Carey, M., and Lloyd-Thomas, H., Score Normalization for Text-Independent Speaker Verification Systems, Digital Signal Processing 10 (2000), pp. 42–54.

[6] Dempster, A.P., Laird, N.M., Rubin, D.B., Maximum-Likelihood From Incomplete Datavia the EM Algorithm, Journal of the Royal Statistical Society, Ser. B., 39, 1977.

[7] Bilmes, J., A Gentle Tutorial of the EM Algorithm and its Application to ParameterEstimation for Gaussian Mixture and Hidden Markov Models, International ComputerScience Institute, 1998.

[8] Reynolds, D. A., Quatieri, T. F., Dunn, R. B., Speaker Verification Using AdaptedGaussian Mixture Models, Digital Signal Processing, 2000.

[9] Grassi, S., Ansorge, M., Pellandini, F., Farine, P.-A., Distributed Speaker RecognitionUsing the ETSI AURORA Standard, Proc. 3rd COST 276 Workshop on Information andKnowledge Management for Integrated Media Communication, pp. 120–125, 2002.

[10] Bogert, B. P., Healy, M. J. R., Tukey, J. W., The Quefrency Analysis of Time SeriesFor Echoes: Cepstrum, Pseudo-Autocovariance, Cross-cepstrum, and Saphe Cracking,Proceedings of the Symposium on Time Series Analysis, Ch. 15, pp. 209–243, 1963.

[11] Holmes, J., Holmes, W., Speech Synthesis and Recognition, 2nd ed., Taylor and Francis,2001.

[12] Jiang, H., Confidence Measures For Speech Recognition; A Survey, Speech Communica-tion, Volume 45 No. 4, pp. 455–470, 2005.

[13] Reynolds, D.A., Comparison of Background Normalization Methods for Text-Indepen-dent Speaker Verification, Proceedings of Eurospeech, 1997, pp. 963–966.

43

References

[14] Martin, A., Doddington, G., Kamm, T., Ordowski, M., Przybocki, .M, The DET CurveIn Assessment of Detection Task Performance, Proceedings of Eurospeech 1997, Volume4, pp. 1895–1898.

[15] Pelecanos, J., Sridharan, S., Feature Warping for Robust Speaker Verification, Proc.ISCA Workshop on Speaker Recognition - 2001: A Speaker Odyssey, 2001.

[16] Elenius, K., Experiences From Collecting Two Swedish Telephone Speech Databases, In-ternational Journal of Speech Technology, Volume 3, 2000.

[17] Hermansky, H., Morgan, N., Bayya, A., Kohn, P., RASTA-PLP Speech Analysis, ICSITechnical Report TR-91-069, Berkley, 1991.

44

Appendix A

Base Line Configuration

This appendix is intended as a reference for the Experiments chapter. It is a listing of theconfiguration directives used for the baseline setup, plain, during the parameter explorationphase.

Note that some configuration directives are dynamically generated in the calling make-file.However, these are directives for search paths and gender declaration, and should not be ofimportance for the results. Also note that these files were taken from the PER configuration.Similar files exist for the SpeechDat/Gandalf combination. Finally, the TrainTarget andComputeTest configurations for impostor model training are verbatim copies of the targetmodels, hence they have been left out.

Note that the exclamation point present on the debug and verbose lines is a fix to aproblem described in Section 7.3.

A.1 TrainWorld — Background Model

featureFilesPath ../../per/param/eval02/mixtureFilesPath ./gmm/labelFilesPath ../../per/labels/energy/eval02/loadMixtureFileFormat RAWloadMixtureFileExtension .gmmsaveMixtureFileFormat RAWsaveMixtureFileExtension .gmmloadFeatureFileFormat SPRO4loadFeatureFileExtension .norm.sprobigEndian falsefeatureServerBufferSize ALL_FEATURESdistribType GDframeLength 0.01vectSize 24labelSelectedFrames speech!debug true!verbose truefileInit falseinputWorldFilename xxxnormalizeModel true

45

A. Base Line Configuration

mixtureDistribCount 512baggedFrameProbabilityInit 0.04maxLLK 200minLLK -200baggedFrameProbability 0.1

nbTrainIt 6initVarianceFlooring 0.5initVarianceCeiling 10.0

nbTrainFinalIt 4finalVarianceFlooring 0.5finalVarianceCeiling 10.0

A.2 TrainTarget — Speaker ModelsdistribType GDloadMixtureFileExtension .gmmsaveMixtureFileExtension .gmmloadFeatureFileExtension .norm.spromaxLLK 200minLLK -200bigEndian falsesaveMixtureFileFormat RAWloadMixtureFileFormat RAWloadFeatureFileFormat SPRO4featureServerBufferSize ALL_FEATURESfeatureFilesPath ../../per/param/eval02/mixtureFilesPath ./labelFilesPath ../../per/labels/energy/eval02/labelSelectedFrames speechframeLength 0.01mixtureServer /unwritable.ms Hardcoded not used in TrainTarget.cpp!verbose true!debug truevectSize 24

nbTrainIt 3nbTrainFinalIt 0MAPAlgo MAPOccDepMAPRegFactor 14baggedFrameProbability 1normalizeModel true

A.3 ComputeTest — Segment TestingdistribType GDsaveMixtureFileExtension .gmm

46

ComputeNorm — Score Normalization

loadMixtureFileExtension .gmmloadFeatureFileExtension .norm.spromaxLLK 200minLLK -200bigEndian falsesaveMixtureFileFormat RAWloadMixtureFileFormat RAWloadFeatureFileFormat SPRO4featureServerBufferSize ALL_FEATURESfeatureFilesPath ../../../data/per/param/eval02/lstPath ./labelFilesPath ../../../data/per/labels/energy/eval02/!verbose true!debug truevectSize 24labelSelectedFrames speechframeLength 0.01

topDistribsCount 10computeLLKWithTopDistribs COMPLETEsegmentalMode completeLLR

A.4 ComputeNorm — Score Normalizationverbose truedebug truemaxScoreDistribNb 500maxIdNb 1000maxSegNb 5000selectType noSelecttnormCohortNb 0znormCohortNb 0

47

Appendix B

Parameter Exploration

This appendix contains the configurations and hard results from the parameter explorationtests. They are presented here for reference, and as a guide to the relevance of variousdimensions in speaker recognition algorithms.

B.1 Configurations

All of the tests were performed by changing the value of one baseline parameter. For thisreason, only the value of the changed parameter will be presented here. For the rest of theconfiguration, refer to Appendix A.

notnorm

The modules impostor, tnorm and score were completely removed. The plain configurationused all modules.

nodistnorm

The property normalizeModel was set to false for the modules world, target, and impostor.The property was set to true in the plain configuration.

halftopdists

The property topDistribsCount was set to 5 for the modules test and tnorm (10 in theplain configuration).

cohort

The property selectType was set to selectNBestByTestSegment, and tnormCohortNb wasset to 15 for the module score (all score values were used in the plain configuration).

allworld

The property baggedFrameProbability was set to 1 for the module world. This propertywas 0.1 (10%) in the plain configuration.

49

B. Parameter Exploration

dists/The property mixtureDistribCount in module world was set to the corresponding value.The mixture size was 512 distributions in the plain configuration.

mapreg/7The property MAPRegFactor was set to 7 for the modules target and impostor (as opposedto 14 in the plain configuration).

trainits/The property nbTrainIt in modules target and impostor was set to the corresponding value.The number of training iterations in the plain configuration was set to 3.

worldits/The property nbTrainFinalIt in module world was set to the corresponding value. Thisproperty had the value 4 in the plain configuration.

B.2 Results

EER ChangeG8 LO G8 LO

Name F M F M F M F Mplain 7.8 7.3 16.7 9.0 0.0 0.0 0.0 0.0notnorm 8.5 8.3 16.0 12.0 9.0 13.7 -4.2 33.3nodistnorm 7.5 7.1 16.7 9.0 -3.9 -2.7 0.0 0.0halftopdists 8.2 7.3 17.2 9.5 5.1 0.0 3.0 5.6cohort 8.9 8.7 16.7 10.5 14.1 19.2 0.0 16.7allworld 7.8 7.2 16.2 8.9 0.0 -1.4 -3.0 -1.1dists/1024 7.5 7.6 16.6 8.2 -3.9 4.1 -0.6 -8.9dists/256 7.2 7.0 17.4 9.7 -7.7 -4.1 4.2 7.8dists/128 8.7 7.5 16.6 10.1 11.5 2.74 -0.6 12.2dists/90 8.7 8.1 17.1 10.3 11.5 11.0 2.4 14.4dists/64 9.7 8.2 18.3 11.5 24.4 12.3 9.6 27.8mapreg/7 7.0 6.0 17.4 8.5 -10.26 -17.81 4.19 -5.56trainits/5 7.4 6.9 16.2 8.7 -5.1 -5.5 -3.-3 3.trainits/2 8.0 7.0 18.4 10.3 2.6 -4.1 10.2 14.4worldits/20 5.2 3.6 12.2 7.4 -33.3 -50.7 -27.0 -17.8worldits/15 5.9 4.1 12.2 7.2 -24.4 -43.8 -27.0 -20.0worldits/5 17.2 33.3 30.9 36.6 121.0 356.0 85.0 307.0

Table B.1. To the left, the resulting EER values, in percent, for the parameter explora-tion tests. To the right, relative change, in percent over plain, of EER for the parameterexploration tests. Lower is better in both cases.

50

Appendix C


This appendix is intended as a reference for the Experiments chapter. It is a listing of theconfiguration directives used during the final run of the performance optimization experi-ments.

Note that some configuration directives are dynamically generated in the calling make-file.However, these are directives for search paths and gender declaration, and should not be ofimportance for the results. Also note that these files were taken from the PER configuration.Similar files exist for the SpeechDat/Gandalf combination. Finally, the TrainTarget andComputeTest configurations for impostor model training are verbatim copies of the targetmodels, hence they have been left out.

Note that the exclamation point present on the debug and verbose lines is a fix to aproblem described in Section 7.3.

C.1 TrainWorld — Background Model

# created by AlizeActiveJob# Wed Nov 29 10:26:52 CET 2006baggedFrameProbabilityInit 0.04saveMixtureFileFormat RAWinitVarianceCeiling 10.0nbTrainIt 5loadFeatureFileFormat SPRO4featureServerBufferSize ALL_FEATURESmixtureFilesPath ./gmm/finalVarianceFlooring 0.5featureFilesPath ../../per/param/eval02/nbTrainFinalIt 10maxLLK 200loadMixtureFileExtension .gmmvectSize 24saveMixtureFileExtension .gmmmixtureDistribCount 512baggedFrameProbability 0.2initVarianceFlooring 0.5loadFeatureFileExtension .norm.sprolabelFilesPath ../../per/labels/energy/eval02/

51

C. Performance Optimization

loadMixtureFileFormat RAWbigEndian falsefileInit falsefinalVarianceCeiling 10.0labelSelectedFrames speechminLLK -200inputWorldFilename xxxnormalizeModel falsedistribType GDframeLength 0.01

C.2 TrainTarget — Speaker Models# created by AlizeActiveJob# Wed Nov 29 10:26:52 CET 2006mixtureServer /unwritable.mssaveMixtureFileFormat RAWnbTrainIt 5MAPRegFactor 7loadFeatureFileFormat SPRO4featureServerBufferSize ALL_FEATURESmixtureFilesPath ./featureFilesPath ../../per/param/eval02/nbTrainFinalIt 0maxLLK 200loadMixtureFileExtension .gmmvectSize 24saveMixtureFileExtension .gmmbaggedFrameProbability 1loadFeatureFileExtension .norm.sprolabelFilesPath ../../per/labels/energy/eval02/loadMixtureFileFormat RAWbigEndian falselabelSelectedFrames speechMAPAlgo MAPOccDepminLLK -200normalizeModel falsedistribType GDframeLength 0.01

C.3 ComputeTest — Segment TestingdistribType GDsaveMixtureFileExtension .gmmloadMixtureFileExtension .gmmloadFeatureFileExtension .norm.spromaxLLK 200minLLK -200

52

ComputeNorm — Score Normalization

bigEndian falsesaveMixtureFileFormat RAWloadMixtureFileFormat RAWloadFeatureFileFormat SPRO4featureServerBufferSize ALL_FEATURESfeatureFilesPath ../../../data/per/param/eval02/lstPath ./labelFilesPath ../../../data/per/labels/energy/eval02/!verbose true!debug truevectSize 24labelSelectedFrames speechframeLength 0.01

topDistribsCount 10computeLLKWithTopDistribs COMPLETEsegmentalMode completeLLR

C.4 ComputeNorm — Score Normalizationverbose truedebug truemaxScoreDistribNb 500maxIdNb 1000maxSegNb 5000selectType noSelecttnormCohortNb 0znormCohortNb 0

53

A Speaker Veriﬁcation System Under The Scope: Alize Ett system … · Abstract Title: A Speaker...

Documents

Transcript of A Speaker Veriﬁcation System Under The Scope: Alize Ett system … · Abstract Title: A Speaker...