0511_Reynolds1

7/28/2019 0511_Reynolds1

1/8

IEEE TRANS. ON SAP, VOL. ?, NO. ??, ????? 2006 1

An Overview of Automatic Speaker Diarisation

SystemsSue Tranter, Member, IEEE, Douglas Reynolds*, Senior Member, IEEE

Abstract Audio diarisation is the process of annotating aninput audio channel with information that attributes (possiblyoverlapping) temporal regions of signal energy to their specificsources. These sources can include particular speakers, music,background noise sources and other signal source/channel char-acteristics. Diarisation can be used for helping speech recognition,facilitating the searching and indexing of audio archives andincreasing the richness of automatic transcriptions, making themmore readable. In this paper we provide an overview of theapproaches currently used in a key area of audio diarisation,namely speaker diarisation, and discuss their relative meritsand limitations. Performances using the different techniques are

compared within the framework of the speaker diarisation task inthe DARPA EARS Rich Transcription evaluations. We also lookat how the techniques are being introduced into real BroadcastNews systems and their portability to other domains and taskssuch as meetings and speaker verification.

Index Terms Speaker diarisation, Speaker segmentation andclustering

EDICS Category: SPE-SMIR; SPE-SPKR

I. INTRODUCTION

T

HE continually decreasing cost of and increasing access

to processing power, storage capacity and network band-

width is facilitating the amassing of large volumes of audio,

including broadcasts, voice mails, meetings and other spoken

documents. There is a growing need to apply automatic

Human Language Technologies to allow efficient and effec-

tive searching, indexing and accessing of these information

sources. Extracting the words being spoken in the audio

using speech recognition technology provides a sound base

for these tasks, but the transcripts are often hard to read and

do not capture all the information contained within the audio.

Other technologies are needed to extract meta-data which

can make the transcripts more readable and provide context

and information beyond a simple word sequence. Speaker

turns and sentence boundaries are examples of such meta-data, both of which help provide a richer transcription of

the audio, making transcripts more readable and potentially

This work was supported by the Defense Advanced Research Agencyunder grant MDA972-02-1-0013 and Air Force Contract FA8721-05-C-0002.Opinions, interpretations, conclusions and recommendations are those of theauthors and are not necessarily endorsed by the United States Government.This paper is based on the ICASSP 2005 HLT special session paper [1]

Sue Tranter is with the Cambridge University Engineering Dept, Trumping-ton Street, Cambridge, CB2 1PZ, UK. e-mail: [email protected], phone:+44 1223 766972, fax: +44 1223 332662

Douglas Reynolds is with M.I.T. Lincoln Laboratory, 244 Wood Street,Lexington, MA 02420-9185, USA. e-mail: [email protected], phone: +1 781981 4494, fax: +1 781 981 0186

helping with other tasks such as summarisation, parsing or

machine translation.

In general, a spoken document is a single channel recording

that consists of multiple audio sources. Audio sources may be

different speakers, music segments, types of noise, etc. For

example, a broadcast news programme consists of speech from

different speakers as well as music segments, commercials

and sounds used to segue into reports (see Figure1). Audio

diarisation is defined as the task of marking and categorising

the audio sources within a spoken document. The types and

details of the audio sources are application specific. At thesimplest, diarisation is speech versus non-speech, where non-

speech is a general class consisting of music, silence, noise,

etc., that need not be broken out by type. A more complicated

diarisation would further mark where speaker changes occur

in the detected speech and associate segments of speech (a

segment is a section of speech bounded by non-speech or

speaker change points) coming from the same speaker. This

is usually referred to as speaker diarisation (a.k.a. who spoke

when) or speaker segmentation and clustering and is the focus

of most current research efforts in audio diarisation. This

paper discusses the techniques commonly used for speaker

diarisation, which allows searching audio by speaker, makes

transcripts easier to read and provides information which couldbe used within speaker adaptation in speech recognition sys-

tems. Other audio diarisation tasks, such as explicitly detecting

the presence of music (e.g. [2]), helping find the structure of

a broadcast programme (e.g. [3]), or locating commercials to

eliminate unrequired audio (e.g. [4]), also have many potential

benefits but fall outside the scope of this paper.

Fig. 1. Example of audio diarisation on Broadcast News. Annotatedphenomena may include different structural regions such as commercials,different acoustic events such as music or noise and different speakers.

There are three primary domains which have been used for

speaker diarisation research and development: broadcast news

audio, recorded meetings and telephone conversations. The

data from these domains differs in the quality of the recordings

(bandwidth, microphones, noise), the amount and types of

7/28/2019 0511_Reynolds1

2/8

2 IEEE TRANS. ON SAP, VOL. ?, NO. ??, ????? 2006

non-speech sources, the number of speakers, the durations and

sequencing of speaker turns and the style/spontaneity of the

speech. Each domain presents unique diarisation challenges,

although often high-level system techniques tend to generalise

well over several domains [5], [6]. The NIST Rich Transcrip-

tion speaker evaluations [7] have primarily used both broadcast

news and meeting data, whereas the NIST speaker recognition

evaluations [8] have primarily used conversational telephone

speech with summed sides (a.k.a two-wire).

The diarisation task is also defined by the amount of

specific prior knowledge allowed. There may be specific prior

knowledge via example speech from the speakers in the audio,

such as in a recording of a regular staff meeting. The task

then becomes more like speaker detection or tracking tasks

[9]. Specific prior knowledge could also be example speech

from just a few of the speakers such as common anchors

on particular news stations, or knowledge of the number of

speakers in the audio, perhaps for a teleconference over a

known number of lines, or maybe the structure of the audio

recording (e.g. music followed by story). However, for a more

portable speaker diarisation system, it is desired to operatewithout any specific prior knowledge of the audio. This is

the general task definition used in the Rich Transcription

diarisation evaluations, where only the broadcaster and date

of broadcast are known in addition to having the audio data

and we adopt this scenario when discussing speaker diarisation

systems.

The aim of this paper is to provide an overview of current

speaker diarisation approaches and to discuss performance

and potential applications. In the next section we outline the

general framework of diarisation systems and discuss different

implementations of the key components within current sys-

tems. Performance is measured in terms of the diarisation error

rate (DER) using the DARPA EARS Rich Transcription Fall2004 (RT-04F) speaker diarisation evaluation data. Section IV

looks at the use of these methods in real applications and the

future directions for diarisation research.

II. DIARISATION SYSTEM FRAMEWORK

In this section we review the key sub-tasks used to build

current speaker diarisation systems. Most diarisation systems

perform these tasks separately although it is possible to

perform some of the stages jointly (for example speaker seg-

mentation and clustering) and the ordering of the stages often

varies from system to system. A prototypical combination

of the key components of a diarisation system is shown inFigure 2. For each task we provide a brief description of

the common approaches employed and some of the issues in

applying them.

A. Speech Detection

The aim of this step is to find the regions of speech in the

audio stream. Depending on the domain data being used, non-

speech regions to be discarded can consist of many acoustic

phenomena such as silence, music, room noise, background

noise or cross-talk.

Cluster Recombination

Resegmentation

Cluster Cluster Cluster Cluster

Speech Detection

Change Detection

Gender/Bandwidth Classification

speaker times/labels

WMTM TF

WF

audio

Fig. 2. A prototypical diarisation system. Most diarisation systems havecomponents to perform speech detection, gender and/or bandwidth segmen-tation, speaker segmentation, speaker clustering and final resegmentation orboundary refinement.

The general approach used is maximum likelihood classi-

fication with Gaussian Mixture Models (GMMs) trained on

labelled training data, although different class models can

be used, such as multi-state HMMs. The simplest system

uses just speech/non-speech models such as in [10], whilst

[11] is similar but four speech models are used for the

possible gender/bandwidth combinations. Noise and music

are explicitly modelled in [12][14] which have classes for

speech, music, noise, speech+music, and speech+noise, whilst

[15], [16] use wideband speech, narrowband speech, music

and speech+music. The extra speech+xx models are used to

help minimise the false rejection of speech occurring in the

presence of music or noise, and this data is subsequently

reclassified as speech.

When operating on un-segmented audio, Viterbi segmenta-tion, (single pass or iterative with optional adaptation) using

the models is employed to identify speech regions. If an initial

segmentation is already available (for example the ordering of

the key components may allow change point detection before

non-speech removal), each segment is individually classified.

Minimum length constraints [10] and heuristic smoothing rules

[11], [14] may also be applied. An alternative approach which

does not use Viterbi decoding, but instead a best model search

with morphological rules is described in [17].

Silence can be removed in this early stage, using a phone

recogniser (as in [16]) or energy constraint, or in a final stage

processing using a word recogniser (as in [13]) or energy

constraint (as in the MIT system for RT-03 [18]). Regionswhich contain commercials and thus are of no interest for the

final output can also be automatically detected and removed

at this early stage [4], [18].

For broadcast news audio, speech detection performance is

typically about 1% miss (speech in reference but not in the

hypothesis) and 1-2% false alarm (speech in the hypothesis

but not in the reference). When the speech detection phase is

run early in a system, or the output is required for further

processing such as for transcription, it is more important

to minimise speech miss than false alarm rates, since the

former are unrecoverable errors in most systems. However,

7/28/2019 0511_Reynolds1

3/8

Tranter et al.: AN OVERVIEW OF AUTOMATIC SPEAKER DIARISATION SYSTEMS 3

the diarisation error rate (DER), used to evaluate speaker

diarisation performance, treats both forms of error equally.

For telephone audio, typically some form of standard en-

ergy/spectrum based speech activity detection is used since

non-speech tends to be silence or noise sources although the

GMM approach has also been successful in this domain [19].

For meeting audio, the non-speech can be from a variety of

noise sources, like paper shuffling, coughing, laughing, etc.

When supported, multiple channel meeting audio can be used

to help speech activity detection [20].

B. Change Detection

The aim of this step is to find points in the audio stream

likely to be change points between audio sources. If the

input to this stage is the un-segmented audio stream, then

the change detection looks for both speaker and speech/non-

speech change points. If a speech detector or gender/bandwidth

classifier has been run first, then the change detector looks for

speaker change points within each speech segment.

Two main approaches have been used for change detection.They both involve looking at adjacent windows of data and

calculating a distance metric between the two, then deciding

whether the windows originate from the same or a different

source. The differences between them lie in the choice of

distance metric and thresholding decisions.

The first general approach used for change detection, used

in [14], is a variation on the Bayesian Information Criterion

(BIC) technique introduced in [21]. This technique searches

for change points within a window using a penalised likelihood

ratio test of whether the data in the window is better modelled

by a single distribution (no change point) or two different

distributions (change point). If a change is found, the window

is reset to the change point and the search restarted. If nochange point is found, the window is increased and the search

is redone. Some of the issues in applying the BIC change

detector are: (a) it has high miss rates on detecting short

turns (< 2-5 seconds), so can be problematic to use on fast

interchange speech like conversations, and (b) the full search

implementation is computationally expensive (order N2), so

most systems employ some form of computation reductions

(e.g. [22]).

A second technique used first in [23] and later in [12],

[16], [24] uses fixed length windows and represents each

window by a Gaussian and the distance between them by the

Gaussian Divergence (symmetric KL-2 distance). The step-by-

step implementation in [17] is similar but uses the generalisedlog likelihood ratio as the distance metric. The peaks in the

distance function are then found and define the change points if

their absolute value exceeds a pre-determined threshold chosen

on development data. Smoothing the distance distribution or

eliminating the smaller of neighbouring peaks within a certain

minimum duration prevent the system overgenerating change

points at true boundaries. Single Gaussians are generally

preferred to GMMs due to the simplified distance calculations.

Typical window sizes are 1-2 or 2-5 seconds when using a

diagonal or full covariance Gaussian respectively. As with BIC

the window length constrains the detection of short turns.

Since the change point detection often only provides an

initial base segmentation for diarisation systems, which will

be clustered and often resegmented later, being able to run the

change point detection very fast (typically less than 0.01xRT

for a diagonal covariance system) is often more important

than any performance degradation. In fact [10] and [17] found

no significant performance degradation when using a simple

initial uniform segmentation within their systems.

Both change detection techniques require a detection thresh-

old to be empirically tuned for changes in audio type and

features. Tuning the change detector is a tradeoff between the

desires to have long, pure segments to aid in initialising the

clustering stage, and minimising missed change points which

produce contaminations in the clustering.

Alternatively, or in addition, a word or phone decoding step

with heuristic rules may be used to help find putative speaker

change points such as in the Cambridge RT-03 system [18].

However, this approach over-segments the speech data and

requires some additional merging and clustering to form viable

speech segments, and can miss boundaries in fast speaker

interchanges where there is no silence or gender changebetween speakers.

C. Gender/Bandwidth Classification

The aim of this stage is to partition the segments into

common groupings of gender (male or female) and bandwidth

(low-bandwidth: narrowband/telephone or high-bandwidth:

studio). This is done to reduce the load on subsequent cluster-

ing, provide more flexibility in clustering settings (for example

female speakers may have different optimal parameter settings

to male speakers), and supply more side information about

the speakers in the final output. If the partitioning can be

done very accurately and assuming no speaker appears in thesame broadcast in different classes (for example both in the

studio and via a pre-recorded field report) then performing

this partitioning early on in the system can also help improve

performance whilst reducing the computational load [25].

The potential drawback in this partitioning stage however

is if a subset of a speakers segments is misclassified the

errors can be unrecoverable, although it is possible to allow

these classifications to change in a subsequent re-segmentation

stage, such as in [17].

Classification for both gender and bandwidth is typically

done using maximum likelihood classification with GMMs

trained on labelled training data. Either two classifiers are

run (one for gender and one for bandwidth) or joint modelsfor gender and bandwidth are used. This can be done either

in conjunction with the speech/non-speech detection process

or after the initial segmentation. Bandwidth classification can

also be done using a test on the ratio of spectral energy

above and below 4 kHz. An alternative method of gender

classification, used in [16], aligns the word recognition output

of a fast ASR system with gender dependent models and

assigns the most likely gender to each segment. This has a

high accuracy but is unnecessarily computationally expensive

if a speech recognition output is not already available and

segments ideally should be of a reasonable size (typically

7/28/2019 0511_Reynolds1

4/8


between 1 and 30 seconds). Gender classification error rates

are around 1-2% and bandwidth classification error rates are

around 3-5% for broadcast news audio.

D. Clustering

The purpose of this stage is to associate or cluster seg-

ments from the same speaker together. The clustering ideally

produces one cluster for each speaker in the audio with allsegments from a given speaker in a single cluster. The pre-

dominant approach used in diarisation systems is hierarchical,

agglomerative clustering with a BIC based stopping criterion

[21] consisting of the following steps:

0) Initialise leaf clusters of tree with speech segments.

1) Compute pair-wise distances between each cluster.

2) Merge closest clusters.

3) Update distances of remaining clusters to new cluster.

4) Iterate steps 1-3 until stopping criterion is met.

The clusters are generally represented by a single full

covariance Gaussian [5], [11], [14], [16], [24], but GMMs have

also been used [10], [17], [26], sometimes being built usingmean-only MAP adaptation of a GMM of the entire test file

to each cluster for increased robustness. The standard distance

metric between clusters is the generalised likelihood ratio. It is

possible to use other representations or distance metrics, but

these have been found the most successful within the BIC

clustering framework. The stopping criterion compares the

BIC statistic from the two clusters being considered, x and

y, with that of the parent cluster, z, should they be merged,

the formulation being for the full covariance Gaussian case:

BIC=L1

2M logN

BIC=

1

2 [Nz log(|Sz|) Nx log(|Sx|) Ny log(|Sy|)] P

P=

d(d + 3)

4

log(Nz)

where M is the number of free parameters, N the number

of frames, S the covariance matrix and d the dimension of

the feature vector. If the pair of clusters are best described

by a single full covariance Gaussian, the BIC will be low,whereas if there are two separate distributions, implying two

speakers, the BIC will be high. For each step, the pair ofclusters with the lowest BIC is merged and the statisticsare recalculated. The process is generally stopped when the

lowest BIC is greater than a specified threshold, usually 0.

The use of the number of frames in the parent cluster, Nz ,in the penalty factor, P, represents a local BIC decision i.e.

just considering the clusters being combined. This has been

shown to perform better than the corresponding global BIC

implementation which uses the number of frames in the whole

show, Nf instead [18], [24], [27].

Slight variations of this technique have also been used. For

example, the system described in [10], [28] removes the need

for tuning the penalty weight, , on development data, by

ensuring that the number of parameters (M) in the merged

and separate distributions are equal, although the base number

of Gaussians and hence number of free parameters needs to

be chosen carefully for optimal effect. Alternatives to the

penalty term, such as using a constant [29], or weighted sum

of the number of clusters and number of segments [12] have

also had moderate success but the BIC method has generally

superseded these. Adding a Viterbi resegmentation between

multiple iterations of clustering [24] or within a single iteration

[10] has also been used to increase performance at the penalty

of increased computational cost.

An alternative approach described in [30] uses a Euclidean

distance between MAP-adapted GMMs and notes this is highly

correlated with a Monte-Carlo estimation of the Gaussian

Divergence (symmetric KL-2) distance whilst also being an

upper bound to it. The stopping criterion uses a fixed threshold,

chosen on the development data, on the distance metric. The

performance is comparable to the more conventional BIC

method.

A further method described in [14] uses proxy speakers.

A set of proxy models is applied to map segments into a

vector space, then a Euclidean distance metric and an ad

hoc occupancy stopping criterion are used, but the overall

clustering framework remains the same. The proxy models canbe built by adapting a universal background model (UBM)

such as a 128 mixture GMM to the test data segments

themselves, thus making the system portable to different shows

and domains whilst still giving consistent performance gain

over the BIC method.

Regardless of the clustering employed, the stopping crite-

rion is critical to good performance and depends on how the

output is to be used. Under-clustering fragments speaker data

over several clusters, whilst over-clustering produces contam-

inated clusters containing speech from several speakers. For

indexing information by speaker, both are suboptimal. How-

ever, when using cluster output to assist in speaker adaptation

of speech recognition models, under-clustering may be suitablewhen a speaker occurs in multiple acoustic environments and

over-clustering may be advantageous in aggregating speech

from similar speakers or acoustic environments.

E. Joint Segmentation and Clustering

An alternative approach to running segmentation and clus-

tering stages separately is to use an integrated scheme. This

was first done in [12] by employing a Viterbi decode between

iterations of agglomerative clustering, but an initial segmen-

tation stage was still required. A more recent completely

integrated scheme, based on an evolutive-HMM (E-HMM)

where detected speakers help influence both the detection ofother speakers and the speaker boundaries, was introduced in

[31] and developed in [17], [32]. The recording is represented

by an ergodic HMM in which each state represents a speaker

and the transitions model the changes between speakers. The

initial HMM contains only one state and represents all of the

data. In each iteration, a short speech segment assumed to

come from a non-detected speaker is selected and used to build

a new speaker model by Bayesian adaptation of a UBM. A

state is then added to the HMM to reflect this new speaker and

the transitions probabilities are modified accordingly. A new

segmentation is then generated from a Viterbi decode of the

7/28/2019 0511_Reynolds1

5/8


data with the new HMM, and each model is adapted using the

new segmentation. This resegmentation phase is repeated until

the speaker labels no longer change. The process of adding

new speakers is repeated until there is no gain in terms of

comparable likelihood or there is no data left to form a new

speaker. The main advantages of this integrated approach are

to use all the information at each step and to allow the use of

speaker recognition based techniques, like Bayesian adaptation

of the speaker models from a UBM.

F. Cluster Re-combination

In this relatively recent approach [24], state-of-the-art

speaker recognition modelling and matching techniques are

used as a secondary stage for combining clusters. The signal

processing and modelling used in the clustering stage of

section II-D are usually simple: no channel compensation,

such as RASTA, since we wish to take advantage of com-

mon channel characteristics among a speakers segments, and

limited parameter distribution models, since the model needs

to work with small amounts of data in the clusters at the start.With cluster recombination, clustering is run to under-

cluster the audio but still produce clusters with a reasonable

amount of speech (> 30s). A UBM is built on training data to

represent general speakers. Feature normalisation is applied to

help reduce the effect of the acoustic environment. Maximum

A Posteriori (MAP) adaptation (usually mean-only) is then

applied on each cluster from the UBM to form a single model

per cluster. The cross likelihood ratio (CLR) between any two

given clusters is defined [24], [33]:

CLR(ci, cj) = log

L(xi|j) L(xj |i)

L(xi|ubm) L(xj |ubm)

where L(xi|j) is the average likelihood per frame of data xigiven the model j . The pair of clusters with the highest CLR

is merged and a new model is created. The process is repeated

until the highest CLR is below a predefined threshold chosen

from development data. Because of the computational load

at this stage, each gender/bandwidth combination is usually

processed separately, which also allows more appropriate

UBMs to be used for each case.

Different types of feature normalisation have been used with

this process, namely RASTA-filtered cepstra with 10s feature

mean and variance normalisation [14] and feature warping

[34] using a sliding window of 3 seconds [13], [16]. In [16]

it was found the feature normalisation was necessary to getsignificant gain from the cluster recombination technique.

When the clusters are merged, a new speaker model can

be trained with the combined data and distances updated (as

in [13], [16]) or standard clustering rules can be used with

a static distance matrix (as in [14]). This recombination can

be viewed as fusing intra and inter [33] audio file speaker

clustering techniques. On the RT-04F evaluation it was found

that this stage significantly improves performance, with further

improvements being obtained subsequently by using a variable

prior iterative MAP approach for adapting the UBMs, and

building new UBMs including all of the test data [16].

G. Re-segmentation

The last stage found in many diarisation systems is a

re-segmentation of the audio via Viterbi decoding (with or

without iterations) using the final cluster models and non-

speech models. The purpose of this stage is to refine the

original segment boundaries and/or to fill in short segments

that may have been removed for more robust processing in

the clustering stage. Filtering the segment boundaries usinga word or phone recogniser output can also help reduce the

false-alarm component of the error rate [24].

H. Finding Identities

Although current diarisation systems are only evaluated

using relative speaker labels (such as spkr1), it is often

possible to find the true identities of the speakers (such as Ted

Koppel). This can be achieved by a variety of methods, such

as building speaker models for people who are likely to be

in the news broadcasts (such as prominent politicians or main

news anchors and reporters) and including these models in the

speaker clustering stage or running speaker-tracking systems.

An alternative approach, introduced in [35] uses linguisticinformation contained within the transcriptions to predict the

previous, current or next speaker. Rules are defined based on

category and word N-grams chosen from the training data,

and are then applied sequentially on the test data until the

speaker names have been found. Blocking rules are used to

stop rules firing in certain contexts, for example the sequence

[name] reports * assigns the next speaker to be [name] unless

* is the word that. An extension of this system described

in [36], learns many rules and their associated probability of

being correct automatically from the training data and then ap-

plies these simultaneously on the test data using probabilistic

combination. Using automatic transcriptions and automatically

found speaker turns naturally degrades performance but poten-

tially 85% of the time can be correctly assigned to the true

speaker identity using this method.

Although primarily used for identifying the speaker names

given a set of speaker clusters, this technique can associate

the same name for more than one input cluster and therefore

could be thought of as a high-level cluster-combination stage.

I. Combining Different Diarisation Methods

Combining methods used in different diarisation systems

could potentially improve performance over the best single

diarisation system. It has been shown that the word error rate

(WER) of an automatic speech recogniser can be consistentlyreduced when combining multiple segmentations even if the

individual segmentations themselves do not offer state of the

art performance in either DER or resulting WER [37]. Indeed

it seems that diversity between the segmentation methods is

just as important as the segmentation quality when being

combined. It is expected that gains in DER are also possible

by combining different diarisation modules or systems.

Several methods of combining aspects of different diarisa-

tion systems have been tried, for example the hybridization

or piped CLIPS/LIA systems of [26], [38] and the Plug

and Play CUED/MIT-LL system of [18] which both combine

7/28/2019 0511_Reynolds1

6/8


components of different systems together. A more integrated

merging method is described in [38], whilst [26] describes

a way of using the 2002 NIST speaker segmentation error

metric to find regions in two inputs which agree and then

uses these to train potentially more accurate speaker models.

These systems generally produce performance gains, but tend

to place some restriction on the systems being combined,

such as the required architecture or equalising the number of

speakers. An alternative approach introduced in [39] uses a

cluster voting technique to compare the output of arbitrary

diarisation systems, maintaining areas of agreement and voting

using confidences or an external judging scheme in areas of

conflict.

III. EVALUATION OF PERFORMANCE

In this section we briefly describe the NIST RT-04F speaker

diarisation evaluation and present the results when using

the key techniques discussed in this paper on the RT-04F

diarisation evaluation data.

A. Speaker Diarisation Error MeasureA system hypothesizes a set of speaker segments each of

which consists of a (relative) speaker-id label such as Mspkr1

or Fspkr2 and the corresponding start and end times. This is

then scored against reference ground-truth speaker segmenta-

tion which is generated using the rules given in [40]. Since the

hypothesis speaker labels are relative, they must be matched

appropriately to the true speaker names in the reference.

To accomplish this, a one-to-one mapping of the reference

speaker IDs to the hypothesis speaker IDs is performed so

as to maximise the total overlap of the reference and (cor-

responding) mapped hypothesis speakers. Speaker diarisation

performance is then expressed in terms of the miss (speaker

in reference but not in hypothesis), false alarm (speaker in

hypothesis but not in reference), and speaker-error (mapped

reference speaker is not the same as the hypothesized speaker)

rates. The overall diarisation error (DER) is the sum of these

three components. A complete description of the evaluation

measure and scoring software implementing it can be found

at http://nist.gov/speech/tests/rt/rt2004/fall.

It should be noted that this measure is time-weighted, so

the DER is primarily driven by (relatively few) loquacious

speakers and it is therefore more important to get the main

speakers complete and correct than to accurately find speakers

who do not speak much. This scenario models some tasks,

such as tracking anchor speakers in Broadcast News for textsummarisation, but there may be other tasks (such as for

speaker adaptation within automatic transcription, or ascer-

taining the opinions of several speakers in a quick debate)

for which it is less appropriate. The same formulation can be

modified to be speaker weighted instead of time weighted if

necessary, but this is not discussed here. The utility of either

weighting depends on the application of the diarisation output.

B. Data

The RT-04F speaker diarisation data consists of one 30

minute extract from 12 different US broadcast news shows.

These were derived from TV shows: three from ABC, three

from CNN, two from CNBC, two from PBS, one from CSPAN

and one from WBN. The style of show varied from a set

of lectures from a few speakers (CSPAN) to rapid headline

news reporting (CNN Headline News). Details of the exact

composition of the data sets can be found in [40].

C. Results

The results from the main diarisation techniques are shown

in Figure 3. Using a top-down clustering approach with full

covariance models, AHS distance metric and BIC stopping

criterion gave a DER of between 20.5% and 22.5% [29].

The corresponding performance on the 6-show RT diarisation

development data sets ranged from 15.9% to 26.9%, showing

that the top-down method seems more unpredictable than

the agglomerative method. This is thought to be because

the initial clusters contain many speakers and segments may

thus be assigned incorrectly early on, leading to an unre-

coverable error. In contrast, the agglomerative scheme grows

clusters from the original segments and should not contain

impure multi-speaker clusters until very late in the clusteringprocess. The agglomerative clustering BIC-based scheme got

around 17-18% DER [10], [14], [16], [24], with Viterbi re-

segmentation between each step providing a slight benefit to

16.4% [10]. Further improvements to around 13% were made

using CLR cluster recombination and resegmentation [14].

The CLR cluster recombination stage which included feature

warping produced a further reduction to around 8.5-9.5% [13],

[16], and using the whole of the RT-04F evaluation data in

the UBM build of the CLR cluster recombination stage gave

a final performance of 6.9% [16]. The proxy model technique

performed better than the equivalent BIC stages, giving 14.1%

initially and 11.0% after CLR cluster recombination and

resegmentation [14]. The E-HMM system, despite not beingtuned on the US Broadcast News development sets, gave

16.1% DER.

BIC Proxy EHMM*6

8

10

12

14

16

18

20

22

24

DER(%)

Topdown BIC

Bottomup BIC

+CLR recluster (MIT)

+CLR recluster (LIMSI)

Proxy Models

+CLR recluster (MIT)

EHMM*

Fig. 3. DERs for different methods on the RT-04F evaluation data. Eachdot represents a different version of a system built using the indicated coretechnique. *E-HMM not tuned on the US Broadcast News development sets.

For the system with the lowest DER, the per-show results

are given in Figure 4. Typical of most systems, there is a

7/28/2019 0511_Reynolds1

7/8


large variability in performance over the shows, reflecting

the variability in the number of speakers, the dominance of

speakers, and the style and structure of the speech. Most of

the variability is from the speaker error component due to

over or under clustering. Reducing this variability is a source

of on-going work. Certain techniques or parameter settings

can perform better for different styles of show. Systems may

potentially be improved by either automatically detecting the

type of show and modifying the choice of techniques or

parameters accordingly, or by combining different systems

directly as discussed in section II-I.

0

2

4

6

8

10

12

CNN

CNBC

ABC

CNN

CSPAN

ABC

PBS

CNNHL

WBN

ABC

PBS

CNBC

ALL

%

DER

MISSFALARMSPKRERR

Fig. 4. DER per show for the system with lowest DER. There is a largevariability between the different styles of show.

IV. CONCLUSION AND FUTURE DIRECTIONS

There has been tremendous progress in task definition, dataavailability, scoring measures and technical approaches for

speaker diarisation over recent years. The methods developed

for Broadcast News diarisation are now being deployed across

other domains, for example finding speakers within meeting

data [6] and improving speaker recognition performance using

multi-speaker train and test data in conversational telephone

speech [14]. Other tasks, such as finding true speaker iden-

tities (see section II-H) or speaker tracking (e.g. in [41]) are

increasingly using diarisation output as a starting point and

performance is approaching a level where it can be considered

useful for real human interaction.

Additions to applications which display audio and option-

ally transcriptions, such as wavesurfer1 (see Figure 5) or

transcriber2 (see Figure 6), allow users to see the current

speaker information, understand the general flow of speakers

throughout the broadcast or search for a particular speaker

within the show. Experiments are also underway to ascertain

if additional tasks, such as the process of annotating data, can

be facilitated using diarisation output.

The diarisation tasks of the future will cover a wider scope

than currently, both in terms of the amount of data (hundreds

1Wavesurfer is available from www.speech.kth.se/wavesurfer2Transcriber is available from http://trans.sourceforge.net/

Fig. 5. Accessing Broadcast News Audio from automatically derived speakernames via a wavesurfer plugin.

Fig. 6. Diarisation information, including automatically found speakeridentities, being used in the Transcriber tool to improve readability andfacilitate searching by speaker.

of hours) and information required (speaker identity, speaker

characteristics, or potentially even emotion). Current tech-

niques will provide a firm base to start from, but new methods,

particularly combining information from many different ap-

proaches (as is currently done in the speaker recognition field)

will need to be developed to allow diarisation to be maximally

beneficial to real users and potential downstream processing

such as machine translation and parsing. Additonally further

development of tools to allow user interactions with diarisa-

tion output for specific jobs will help focus the research to

contribute to high utility human language technology.

V. ACKNOWLEDGEMENTS

The authors would like to thank Claude Barras, Jean-

Francois Bonastre, Corinne Fredouille, Patrick Nguyen and

Chuck Wooters for their help in the construction of this paper.

REFERENCES

[1] D. A. Reynolds and P. Torres-Carrasquillo, Approaches and Applica-tions of Audio Diarization, in Proc. IEEE Int. Conference on Acoustics,Speech and Signal Processing (ICASSP), vol. V, Philadelphia, PA, March2005, pp. 953956.

7/28/2019 0511_Reynolds1

8/8


[2] J. Saunders, Real-time Discrimination of Broadcast Speech/Music, inProc. IEEE Int. Conference on Acoustics, Speech and Signal Processing(ICASSP), vol. II, Atlanta, GA, May 1996, pp. 993996.

[3] Z. Liu, Y. Wang, and T. Chen, Audio Feature Extraction and Analysisfor Scene Segmentation and Classification, Journal of VLSI SignalProcessing Systems, vol. 20, no. 1-2, pp. 6179, October 1998.

[4] S. E. Johnson and P. C. Woodland, A Method for Direct AudioSearch with Applications to Indexing and Retrieval, in Proc. IEEE

Int. Conference on Acoustics, Speech and Signal Processing (ICASSP),vol. 3, Istanbul, Turkey, June 2000, pp. 14271430.

[5] Y. Moh, P. Nguyen, and J.-C. Junqua, Towards Domain IndependentClustering, in Proc. IEEE Int. Conference on Acoustics, Speech andSignal Processing (ICASSP), vol. II, Hong Kong, April 2003, pp. 8588.

[6] X. Anguera, C. Wooters, B. Peskin, and M. Aguilo, Robust SpeakerSegmentation for Meetings: The ICSI-SRI Spring 2005 DiarizationSystem, in Proceedings of MLMI/NIST Workshop, Edinburgh, UK, July2005, p. to appear.

[7] NIST, Benchmark Tests : Rich Transcription (RT). [Online].Available: http://www.nist.gov/speech/tests/rt/

[8] , Benchmark Tests : Speaker Recognition. [Online]. Available:http://www.nist.gov/speech/tests/spk/

[9] A. Martin and M. Przybocki, Speaker Recognition in a Multi-SpeakerEnvironment, in Proc. European Conference on Speech Communicationand Technology (Eurospeech), vol. 2, Aalborg, Denmark, September2001, pp. 787790.

[10] C. Wooters, J. Fung, B. Peskin, and X. Anguera, Towards Robust

Speaker Segmentation: The ICSI-SRI Fall 2004 Diarization System,in Proc. Fall 2004 Rich Transcription Workshop (RT-04), Palisades, NY,November 2004.

[11] P. Nguyen, L. Rigazio, Y. Moh, and J. C. Junqua, Rich Transcription2002 Site Report. Panasonic Speech Technology Laboratory (PSTL),in Proc. 2002 Rich Transcription Workshop (RT-02), Vienna, VA, May2002.

[12] J.-L. Gauvain, L. Lamel, and G. Adda, Partitioning and Transcriptionof Broadcast News Data, in Proc. Int. Conference on Spoken LanguageProcessing (ICSLP), vol. 4, Sydney, Australia, December 1998, pp.13351338.

[13] X. Zhu, C. Barras, S. Meignier, and J.-L. Gauvain., Combining SpeakerIdentification and BIC for Speaker Diarization, in Proc. EuropeanConference on Speech Communication and Technology (Interspeech),Lisbon, Portugal, September 2005, pp. 24412444.

[14] D. A. Reynolds and P. Torres-Carrasquillo, The MIT Lincoln Labo-ratory RT-04F Diarization Systems: Applications to Broadcast Audio

and Telephone Conversations, in Proc. Fall 2004 Rich TranscriptionWorkshop (RT-04), Palisades, NY, November 2004.

[15] T. Hain, S. E. Johnson, A. Tuerk, P. C. Woodland, and S. J. Young,Segment Generation and Clustering in the HTK Broadcast News Tran-scription System, in Proc. 1998 DARPA Broadcast News Transcriptionand Understanding Workshop, Lansdowne, VA, 1998, pp. 133137.

[16] R. Sinha, S. E. Tranter, M. J. F. Gales, and P. C. Woodland, TheCambridge University March 2005 speaker diarisation system, in Proc.

European Conference on Speech Communication and Technology (In-terspeech), Lisbon, Portugal, September 2005, pp. 24372440.

[17] S. Meignier, D. Moraru, C. Fredouille, J.-F. Bonastre, and L. Besacier,Step-by-Step and Integrated Approaches in Broadcast News SpeakerDiarization, Computer Speech and Language, p. to appear, 2005.

[18] S. E. Tranter and D. A. Reynolds, Speaker Diarisation for BroadcastNews, in Proc. Odyssey Speaker and Language Recognition Workshop,Toledo, Spain, June 2004, pp. 337344.

[19] S. E. Tranter, K. Yu, G. Evermann, and P. C. Woodland, Generating

and Evaluating Segmentations for Automatic Speech Recognition ofConversational Telephone Speech, in Proc. ICASSP, vol. I, Montreal,Canada, May 2004, pp. 753756.

[20] T. Pfau, D. Ellis, and A. Stolcke, Multispeaker Speech Activity Detec-tion for the ICSI Meeting Recorder, in Proc. IEEE ASRU Workshop,Trento, Italy, December 2001.

[21] S. S. Chen and P. S. Gopalakrishnam, Speaker, Environment andChannel Change Detection and Clustering via the Bayesian InformationCriterion, in Proc. 1998 DARPA Broadcast News Transcription andUnderstanding Workshop, Lansdowne, VA, 1998, pp. 127132.

[22] B. Zhou and J. Hansen, Unsupervised Audio Stream Segmentationand Clustering via the Bayesian Information Criterion, in Proc. Int.Conference on Spoken Language Processing (ICSLP), vol. 3, Beijing,China, October 2000, pp. 714717.

[23] M. A. Siegler, U. Jain, B. Raj, and R. M. Stern, Automatic Segmen-tation, Classification and Clustering of Broadcast News, in Proc. 1997

DARPA Speech Recognition Workshop, Chantilly, VA, February 1997,pp. 9799.

[24] C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain, Improving SpeakerDiarization, in Proc. Fall 2004 Rich Transcription Workshop (RT-04),Palisades, NY, November 2004.

[25] S. Meignier, D. Moraru, C. Fredouille, L. Besacier, and J.-F. Bonastre,Benefits of Prior Acoustic Segmentation for Automatic Speaker Seg-mentation, in Proc. IEEE Int. Conference on Acoustics, Speech andSignal Processing (ICASSP), vol. I, Montreal, Canada, May 2004, pp.397400.

[26] D. Moraru, S. Meignier, L. Besacier, J.-F. Bonastre, and I. Magrin-Chagnolleau, The ELISA Consortium Approaches in Speaker Seg-mentation during the NIST 2002 Speaker Recognition Evaluation, inProc. IEEE Int. Conference on Acoustics, Speech and Signal Processing

(ICASSP), vol. 2, Hong Kong, April 2003, pp. 8992.[27] M. Cettolo, Segmentation, Classification and Clustering of an Italian

Broadcast News Corpus, in Proc. Recherche dInformation Assiste parOrdinateur (RIAO), Paris, France, April 2000.

[28] J. Ajmera and C. Wooters, A Robust Speaker Clustering Algorithm, inProc. IEEE ASRU Workshop, St. Thomas, U.S. Virgin Islands, November2003, pp. 411416.

[29] S. E. Tranter, M. J. F. Gales, R. Sinha, S. Umesh, and P. C. Woodland,The Development of the Cambridge University RT-04 DiarisationSystem, in Proc. Fall 2004 Rich Transcription Workshop (RT-04),Palisades, NY, November 2004.

[30] M. Ben, M. Betser, F. Bimbot, and G. Gravier, Speaker Diarizationusing Bottom-up Clustering based on a Parameter-derived Distance

between Adapted GMMs, in Proc. Int. Conference on Spoken LanguageProcessing (ICSLP), Jeju Island, Korea, October 2004.

[31] S. Meignier, J.-F. Bonastre, C. Fredouille, and T. Merlin, EvolutiveHMM for Multi-Speaker Tracking System, in Proc. IEEE Int. Confer-ence on Acoustics, Speech and Signal Processing (ICASSP), Istanbul,Turkey, June 2000.

[32] S. Meignier, J.-F. Bonastre, and S. Igounet, E-HMM Approach forLearning and Adapting Sound Models for Speaker Indexing, in Proc.Odyssey Speaker and Language Recognition Workshop, Crete, Greece,June 2001.

[33] D. Reynolds, E. Singer, B. Carlson, J. OLeary, J. McLaughlin, andM. Zissman, Blind Clustering of Speech Utterances Based on Speakerand Language Characteristics, in Proc. Int. Conference on Spoken

Language Processing (ICSLP), vol. 7, Sydney, Australia, December1998, pp. 31933196.

[34] J. Pelecanos and S. Sridharan, Feature Warping for Robust SpeakerVerification, in Proc. Odyssey Speaker and Language Recognition

Workshop, Crete, Greece, June 2001.[35] L. Canseco-Rodriguez, L. Lamel, and J.-L. Gauvain, Speaker Diariza-

tion from Speech Transcripts, in Proc. Int. Conference on SpokenLanguage Processing (ICSLP), Jeju Island, Korea, October 2004, pp.12721275.

[36] S. E. Tranter and P. C. Woodland, Who Really Spoke When? - FindingSpeaker Turns and Identities in Broadcast News Audio, in Proc. IEEE

Int. Conference on Acoustics, Speech and Signal Processing (ICASSP),Toulouse, France, May 2006, p. submitted to.

[37] D. Y. Kim, M. J. F. Gales, and P. C. Woodland, Performance Im-provement in Broadcast News Transcription, IEEE Trans. on SAP., p.submitted to, 2006.

[38] D. Moraru, S. Meignier, C. Fredouille, L. Besacier, and J.-F. Bonas-tre, The ELISA Consortium Approaches in Broadcast News SpeakerSegmentation during the NIST 2003 Rich Transcription Evaluation, inProc. IEEE Int. Conference on Acoustics, Speech and Signal Processing(ICASSP), vol. 1, Montreal, May 2004, pp. 373376.

[39] S. E. Tranter, Two-way Cluster Voting to Improve Speaker DiarisationPerformance, in Proc. IEEE Int. Conference on Acoustics, Speech andSignal Processing (ICASSP), vol. I, Philadelphia, PA, March 2005, pp.753756.

[40] J. G. Fiscus, J. S. Garofolo, A. Le, A. F. Martin, D. S. Pallett, M. A.Przybocki, and G. Sanders, Results of the Fall 2004 STT and MDEEvaluation, in Proc. Fall 2004 Rich Transcription Workshop (RT-04),Palisades, NY, November 2004.

[41] D. Istrate, N. Scheffler, C. Fredouille, and J.-F. Bonastre, BroadcastNews Speaker Tracking for ESTER 2005 Campaign, in Proc. EuropeanConference on Speech Communication and Technology (Interspeech),Lisbon, Portugal, September 2005, pp. 24452448.

0511_Reynolds1

Documents

Transcript of 0511_Reynolds1