0511_Reynolds1

download 0511_Reynolds1

of 8

Transcript of 0511_Reynolds1

  • 7/28/2019 0511_Reynolds1

    1/8

    IEEE TRANS. ON SAP, VOL. ?, NO. ??, ????? 2006 1

    An Overview of Automatic Speaker Diarisation

    SystemsSue Tranter, Member, IEEE, Douglas Reynolds*, Senior Member, IEEE

    Abstract Audio diarisation is the process of annotating aninput audio channel with information that attributes (possiblyoverlapping) temporal regions of signal energy to their specificsources. These sources can include particular speakers, music,background noise sources and other signal source/channel char-acteristics. Diarisation can be used for helping speech recognition,facilitating the searching and indexing of audio archives andincreasing the richness of automatic transcriptions, making themmore readable. In this paper we provide an overview of theapproaches currently used in a key area of audio diarisation,namely speaker diarisation, and discuss their relative meritsand limitations. Performances using the different techniques are

    compared within the framework of the speaker diarisation task inthe DARPA EARS Rich Transcription evaluations. We also lookat how the techniques are being introduced into real BroadcastNews systems and their portability to other domains and taskssuch as meetings and speaker verification.

    Index Terms Speaker diarisation, Speaker segmentation andclustering

    EDICS Category: SPE-SMIR; SPE-SPKR

    I. INTRODUCTION

    T

    HE continually decreasing cost of and increasing access

    to processing power, storage capacity and network band-

    width is facilitating the amassing of large volumes of audio,

    including broadcasts, voice mails, meetings and other spoken

    documents. There is a growing need to apply automatic

    Human Language Technologies to allow efficient and effec-

    tive searching, indexing and accessing of these information

    sources. Extracting the words being spoken in the audio

    using speech recognition technology provides a sound base

    for these tasks, but the transcripts are often hard to read and

    do not capture all the information contained within the audio.

    Other technologies are needed to extract meta-data which

    can make the transcripts more readable and provide context

    and information beyond a simple word sequence. Speaker

    turns and sentence boundaries are examples of such meta-data, both of which help provide a richer transcription of

    the audio, making transcripts more readable and potentially

    This work was supported by the Defense Advanced Research Agencyunder grant MDA972-02-1-0013 and Air Force Contract FA8721-05-C-0002.Opinions, interpretations, conclusions and recommendations are those of theauthors and are not necessarily endorsed by the United States Government.This paper is based on the ICASSP 2005 HLT special session paper [1]

    Sue Tranter is with the Cambridge University Engineering Dept, Trumping-ton Street, Cambridge, CB2 1PZ, UK. e-mail: [email protected], phone:+44 1223 766972, fax: +44 1223 332662

    Douglas Reynolds is with M.I.T. Lincoln Laboratory, 244 Wood Street,Lexington, MA 02420-9185, USA. e-mail: [email protected], phone: +1 781981 4494, fax: +1 781 981 0186

    helping with other tasks such as summarisation, parsing or

    machine translation.

    In general, a spoken document is a single channel recording

    that consists of multiple audio sources. Audio sources may be

    different speakers, music segments, types of noise, etc. For

    example, a broadcast news programme consists of speech from

    different speakers as well as music segments, commercials

    and sounds used to segue into reports (see Figure1). Audio

    diarisation is defined as the task of marking and categorising

    the audio sources within a spoken document. The types and

    details of the audio sources are application specific. At thesimplest, diarisation is speech versus non-speech, where non-

    speech is a general class consisting of music, silence, noise,

    etc., that need not be broken out by type. A more complicated

    diarisation would further mark where speaker changes occur

    in the detected speech and associate segments of speech (a

    segment is a section of speech bounded by non-speech or

    speaker change points) coming from the same speaker. This

    is usually referred to as speaker diarisation (a.k.a. who spoke

    when) or speaker segmentation and clustering and is the focus

    of most current research efforts in audio diarisation. This

    paper discusses the techniques commonly used for speaker

    diarisation, which allows searching audio by speaker, makes

    transcripts easier to read and provides information which couldbe used within speaker adaptation in speech recognition sys-

    tems. Other audio diarisation tasks, such as explicitly detecting

    the presence of music (e.g. [2]), helping find the structure of

    a broadcast programme (e.g. [3]), or locating commercials to

    eliminate unrequired audio (e.g. [4]), also have many potential

    benefits but fall outside the scope of this paper.

    Fig. 1. Example of audio diarisation on Broadcast News. Annotatedphenomena may include different structural regions such as commercials,different acoustic events such as music or noise and different speakers.

    There are three primary domains which have been used for

    speaker diarisation research and development: broadcast news

    audio, recorded meetings and telephone conversations. The

    data from these domains differs in the quality of the recordings

    (bandwidth, microphones, noise), the amount and types of

  • 7/28/2019 0511_Reynolds1

    2/8

    2 IEEE TRANS. ON SAP, VOL. ?, NO. ??, ????? 2006

    non-speech sources, the number of speakers, the durations and

    sequencing of speaker turns and the style/spontaneity of the

    speech. Each domain presents unique diarisation challenges,

    although often high-level system techniques tend to generalise

    well over several domains [5], [6]. The NIST Rich Transcrip-

    tion speaker evaluations [7] have primarily used both broadcast

    news and meeting data, whereas the NIST speaker recognition

    evaluations [8] have primarily used conversational telephone

    speech with summed sides (a.k.a two-wire).

    The diarisation task is also defined by the amount of

    specific prior knowledge allowed. There may be specific prior

    knowledge via example speech from the speakers in the audio,

    such as in a recording of a regular staff meeting. The task

    then becomes more like speaker detection or tracking tasks

    [9]. Specific prior knowledge could also be example speech

    from just a few of the speakers such as common anchors

    on particular news stations, or knowledge of the number of

    speakers in the audio, perhaps for a teleconference over a

    known number of lines, or maybe the structure of the audio

    recording (e.g. music followed by story). However, for a more

    portable speaker diarisation system, it is desired to operatewithout any specific prior knowledge of the audio. This is

    the general task definition used in the Rich Transcription

    diarisation evaluations, where only the broadcaster and date

    of broadcast are known in addition to having the audio data

    and we adopt this scenario when discussing speaker diarisation

    systems.

    The aim of this paper is to provide an overview of current

    speaker diarisation approaches and to discuss performance

    and potential applications. In the next section we outline the

    general framework of diarisation systems and discuss different

    implementations of the key components within current sys-

    tems. Performance is measured in terms of the diarisation error

    rate (DER) using the DARPA EARS Rich Transcription Fall2004 (RT-04F) speaker diarisation evaluation data. Section IV

    looks at the use of these methods in real applications and the

    future directions for diarisation research.

    II. DIARISATION SYSTEM FRAMEWORK

    In this section we review the key sub-tasks used to build

    current speaker diarisation systems. Most diarisation systems

    perform these tasks separately although it is possible to

    perform some of the stages jointly (for example speaker seg-

    mentation and clustering) and the ordering of the stages often

    varies from system to system. A prototypical combination

    of the key components of a diarisation system is shown inFigure 2. For each task we provide a brief description of

    the common approaches employed and some of the issues in

    applying them.

    A. Speech Detection

    The aim of this step is to find the regions of speech in the

    audio stream. Depending on the domain data being used, non-

    speech regions to be discarded can consist of many acoustic

    phenomena such as silence, music, room noise, background

    noise or cross-talk.

    Cluster Recombination

    Resegmentation

    Cluster Cluster Cluster Cluster

    Speech Detection

    Change Detection

    Gender/Bandwidth Classification

    speaker times/labels

    WMTM TF

    WF

    audio

    Fig. 2. A prototypical diarisation system. Most diarisation systems havecomponents to perform speech detection, gender and/or bandwidth segmen-tation, speaker segmentation, speaker clustering and final resegmentation orboundary refinement.

    The general approach used is maximum likelihood classi-

    fication with Gaussian Mixture Models (GMMs) trained on

    labelled training data, although different class models can

    be used, such as multi-state HMMs. The simplest system

    uses just speech/non-speech models such as in [10], whilst

    [11] is similar but four speech models are used for the

    possible gender/bandwidth combinations. Noise and music

    are explicitly modelled in [12][14] which have classes for

    speech, music, noise, speech+music, and speech+noise, whilst

    [15], [16] use wideband speech, narrowband speech, music

    and speech+music. The extra speech+xx models are used to

    help minimise the false rejection of speech occurring in the

    presence of music or noise, and this data is subsequently

    reclassified as speech.

    When operating on un-segmented audio, Viterbi segmenta-tion, (single pass or iterative with optional adaptation) using

    the models is employed to identify speech regions. If an initial

    segmentation is already available (for example the ordering of

    the key components may allow change point detection before

    non-speech removal), each segment is individually classified.

    Minimum length constraints [10] and heuristic smoothing rules

    [11], [14] may also be applied. An alternative approach which

    does not use Viterbi decoding, but instead a best model search

    with morphological rules is described in [17].

    Silence can be removed in this early stage, using a phone

    recogniser (as in [16]) or energy constraint, or in a final stage

    processing using a word recogniser (as in [13]) or energy

    constraint (as in the MIT system for RT-03 [18]). Regionswhich contain commercials and thus are of no interest for the

    final output can also be automatically detected and removed

    at this early stage [4], [18].

    For broadcast news audio, speech detection performance is

    typically about 1% miss (speech in reference but not in the

    hypothesis) and 1-2% false alarm (speech in the hypothesis

    but not in the reference). When the speech detection phase is

    run early in a system, or the output is required for further

    processing such as for transcription, it is more important

    to minimise speech miss than false alarm rates, since the

    former are unrecoverable errors in most systems. However,

  • 7/28/2019 0511_Reynolds1

    3/8

    Tranter et al.: AN OVERVIEW OF AUTOMATIC SPEAKER DIARISATION SYSTEMS 3

    the diarisation error rate (DER), used to evaluate speaker

    diarisation performance, treats both forms of error equally.

    For telephone audio, typically some form of standard en-

    ergy/spectrum based speech activity detection is used since

    non-speech tends to be silence or noise sources although the

    GMM approach has also been successful in this domain [19].

    For meeting audio, the non-speech can be from a variety of

    noise sources, like paper shuffling, coughing, laughing, etc.

    When supported, multiple channel meeting audio can be used

    to help speech activity detection [20].

    B. Change Detection

    The aim of this step is to find points in the audio stream

    likely to be change points between audio sources. If the

    input to this stage is the un-segmented audio stream, then

    the change detection looks for both speaker and speech/non-

    speech change points. If a speech detector or gender/bandwidth

    classifier has been run first, then the change detector looks for

    speaker change points within each speech segment.

    Two main approaches have been used for change detection.They both involve looking at adjacent windows of data and

    calculating a distance metric between the two, then deciding

    whether the windows originate from the same or a different

    source. The differences between them lie in the choice of

    distance metric and thresholding decisions.

    The first general approach used for change detection, used

    in [14], is a variation on the Bayesian Information Criterion

    (BIC) technique introduced in [21]. This technique searches

    for change points within a window using a penalised likelihood

    ratio test of whether the data in the window is better modelled

    by a single distribution (no change point) or two different

    distributions (change point). If a change is found, the window

    is reset to the change point and the search restarted. If nochange point is found, the window is increased and the search

    is redone. Some of the issues in applying the BIC change

    detector are: (a) it has high miss rates on detecting short

    turns (< 2-5 seconds), so can be problematic to use on fast

    interchange speech like conversations, and (b) the full search

    implementation is computationally expensive (order N2), so

    most systems employ some form of computation reductions

    (e.g. [22]).

    A second technique used first in [23] and later in [12],

    [16], [24] uses fixed length windows and represents each

    window by a Gaussian and the distance between them by the

    Gaussian Divergence (symmetric KL-2 distance). The step-by-

    step implementation in [17] is similar but uses the generalisedlog likelihood ratio as the distance metric. The peaks in the

    distance function are then found and define the change points if

    their absolute value exceeds a pre-determined threshold chosen

    on development data. Smoothing the distance distribution or

    eliminating the smaller of neighbouring peaks within a certain

    minimum duration prevent the system overgenerating change

    points at true boundaries. Single Gaussians are generally

    preferred to GMMs due to the simplified distance calculations.

    Typical window sizes are 1-2 or 2-5 seconds when using a

    diagonal or full covariance Gaussian respectively. As with BIC

    the window length constrains the detection of short turns.

    Since the change point detection often only provides an

    initial base segmentation for diarisation systems, which will

    be clustered and often resegmented later, being able to run the

    change point detection very fast (typically less than 0.01xRT

    for a diagonal covariance system) is often more important

    than any performance degradation. In fact [10] and [17] found

    no significant performance degradation when using a simple

    initial uniform segmentation within their systems.

    Both change detection techniques require a detection thresh-

    old to be empirically tuned for changes in audio type and

    features. Tuning the change detector is a tradeoff between the

    desires to have long, pure segments to aid in initialising the

    clustering stage, and minimising missed change points which

    produce contaminations in the clustering.

    Alternatively, or in addition, a word or phone decoding step

    with heuristic rules may be used to help find putative speaker

    change points such as in the Cambridge RT-03 system [18].

    However, this approach over-segments the speech data and

    requires some additional merging and clustering to form viable

    speech segments, and can miss boundaries in fast speaker

    interchanges where there is no silence or gender changebetween speakers.

    C. Gender/Bandwidth Classification

    The aim of this stage is to partition the segments into

    common groupings of gender (male or female) and bandwidth

    (low-bandwidth: narrowband/telephone or high-bandwidth:

    studio). This is done to reduce the load on subsequent cluster-

    ing, provide more flexibility in clustering settings (for example

    female speakers may have different optimal parameter settings

    to male speakers), and supply more side information about

    the speakers in the final output. If the partitioning can be

    done very accurately and assuming no speaker appears in thesame broadcast in different classes (for example both in the

    studio and via a pre-recorded field report) then performing

    this partitioning early on in the system can also help improve

    performance whilst reducing the computational load [25].

    The potential drawback in this partitioning stage however

    is if a subset of a speakers segments is misclassified the

    errors can be unrecoverable, although it is possible to allow

    these classifications to change in a subsequent re-segmentation

    stage, such as in [17].

    Classification for both gender and bandwidth is typically

    done using maximum likelihood classification with GMMs

    trained on labelled training data. Either two classifiers are

    run (one for gender and one for bandwidth) or joint modelsfor gender and bandwidth are used. This can be done either

    in conjunction with the speech/non-speech detection process

    or after the initial segmentation. Bandwidth classification can

    also be done using a test on the ratio of spectral energy

    above and below 4 kHz. An alternative method of gender

    classification, used in [16], aligns the word recognition output

    of a fast ASR system with gender dependent models and

    assigns the most likely gender to each segment. This has a

    high accuracy but is unnecessarily computationally expensive

    if a speech recognition output is not already available and

    segments ideally should be of a reasonable size (typically

  • 7/28/2019 0511_Reynolds1

    4/8

    4 IEEE TRANS. ON SAP, VOL. ?, NO. ??, ????? 2006

    between 1 and 30 seconds). Gender classification error rates

    are around 1-2% and bandwidth classification error rates are

    around 3-5% for broadcast news audio.

    D. Clustering

    The purpose of this stage is to associate or cluster seg-

    ments from the same speaker together. The clustering ideally

    produces one cluster for each speaker in the audio with allsegments from a given speaker in a single cluster. The pre-

    dominant approach used in diarisation systems is hierarchical,

    agglomerative clustering with a BIC based stopping criterion

    [21] consisting of the following steps:

    0) Initialise leaf clusters of tree with speech segments.

    1) Compute pair-wise distances between each cluster.

    2) Merge closest clusters.

    3) Update distances of remaining clusters to new cluster.

    4) Iterate steps 1-3 until stopping criterion is met.

    The clusters are generally represented by a single full

    covariance Gaussian [5], [11], [14], [16], [24], but GMMs have

    also been used [10], [17], [26], sometimes being built usingmean-only MAP adaptation of a GMM of the entire test file

    to each cluster for increased robustness. The standard distance

    metric between clusters is the generalised likelihood ratio. It is

    possible to use other representations or distance metrics, but

    these have been found the most successful within the BIC

    clustering framework. The stopping criterion compares the

    BIC statistic from the two clusters being considered, x and

    y, with that of the parent cluster, z, should they be merged,

    the formulation being for the full covariance Gaussian case:

    BIC=L1

    2M logN

    BIC=

    1

    2 [Nz log(|Sz|) Nx log(|Sx|) Ny log(|Sy|)] P

    P=

    d(d + 3)

    4

    log(Nz)

    where M is the number of free parameters, N the number

    of frames, S the covariance matrix and d the dimension of

    the feature vector. If the pair of clusters are best described

    by a single full covariance Gaussian, the BIC will be low,whereas if there are two separate distributions, implying two

    speakers, the BIC will be high. For each step, the pair ofclusters with the lowest BIC is merged and the statisticsare recalculated. The process is generally stopped when the

    lowest BIC is greater than a specified threshold, usually 0.

    The use of the number of frames in the parent cluster, Nz ,in the penalty factor, P, represents a local BIC decision i.e.

    just considering the clusters being combined. This has been

    shown to perform better than the corresponding global BIC

    implementation which uses the number of frames in the whole

    show, Nf instead [18], [24], [27].

    Slight variations of this technique have also been used. For

    example, the system described in [10], [28] removes the need

    for tuning the penalty weight, , on development data, by

    ensuring that the number of parameters (M) in the merged

    and separate distributions are equal, although the base number

    of Gaussians and hence number of free parameters needs to

    be chosen carefully for optimal effect. Alternatives to the

    penalty term, such as using a constant [29], or weighted sum

    of the number of clusters and number of segments [12] have

    also had moderate success but the BIC method has generally

    superseded these. Adding a Viterbi resegmentation between

    multiple iterations of clustering [24] or within a single iteration

    [10] has also been used to increase performance at the penalty

    of increased computational cost.

    An alternative approach described in [30] uses a Euclidean

    distance between MAP-adapted GMMs and notes this is highly

    correlated with a Monte-Carlo estimation of the Gaussian

    Divergence (symmetric KL-2) distance whilst also being an

    upper bound to it. The stopping criterion uses a fixed threshold,

    chosen on the development data, on the distance metric. The

    performance is comparable to the more conventional BIC

    method.

    A further method described in [14] uses proxy speakers.

    A set of proxy models is applied to map segments into a

    vector space, then a Euclidean distance metric and an ad

    hoc occupancy stopping criterion are used, but the overall

    clustering framework remains the same. The proxy models canbe built by adapting a universal background model (UBM)

    such as a 128 mixture GMM to the test data segments

    themselves, thus making the system portable to different shows

    and domains whilst still giving consistent performance gain

    over the BIC method.

    Regardless of the clustering employed, the stopping crite-

    rion is critical to good performance and depends on how the

    output is to be used. Under-clustering fragments speaker data

    over several clusters, whilst over-clustering produces contam-

    inated clusters containing speech from several speakers. For

    indexing information by speaker, both are suboptimal. How-

    ever, when using cluster output to assist in speaker adaptation

    of speech recognition models, under-clustering may be suitablewhen a speaker occurs in multiple acoustic environments and

    over-clustering may be advantageous in aggregating speech

    from similar speakers or acoustic environments.

    E. Joint Segmentation and Clustering

    An alternative approach to running segmentation and clus-

    tering stages separately is to use an integrated scheme. This

    was first done in [12] by employing a Viterbi decode between

    iterations of agglomerative clustering, but an initial segmen-

    tation stage was still required. A more recent completely

    integrated scheme, based on an evolutive-HMM (E-HMM)

    where detected speakers help influence both the detection ofother speakers and the speaker boundaries, was introduced in

    [31] and developed in [17], [32]. The recording is represented

    by an ergodic HMM in which each state represents a speaker

    and the transitions model the changes between speakers. The

    initial HMM contains only one state and represents all of the

    data. In each iteration, a short speech segment assumed to

    come from a non-detected speaker is selected and used to build

    a new speaker model by Bayesian adaptation of a UBM. A

    state is then added to the HMM to reflect this new speaker and

    the transitions probabilities are modified accordingly. A new

    segmentation is then generated from a Viterbi decode of the

  • 7/28/2019 0511_Reynolds1

    5/8

    Tranter et al.: AN OVERVIEW OF AUTOMATIC SPEAKER DIARISATION SYSTEMS 5

    data with the new HMM, and each model is adapted using the

    new segmentation. This resegmentation phase is repeated until

    the speaker labels no longer change. The process of adding

    new speakers is repeated until there is no gain in terms of

    comparable likelihood or there is no data left to form a new

    speaker. The main advantages of this integrated approach are

    to use all the information at each step and to allow the use of

    speaker recognition based techniques, like Bayesian adaptation

    of the speaker models from a UBM.

    F. Cluster Re-combination

    In this relatively recent approach [24], state-of-the-art

    speaker recognition modelling and matching techniques are

    used as a secondary stage for combining clusters. The signal

    processing and modelling used in the clustering stage of

    section II-D are usually simple: no channel compensation,

    such as RASTA, since we wish to take advantage of com-

    mon channel characteristics among a speakers segments, and

    limited parameter distribution models, since the model needs

    to work with small amounts of data in the clusters at the start.With cluster recombination, clustering is run to under-

    cluster the audio but still produce clusters with a reasonable

    amount of speech (> 30s). A UBM is built on training data to

    represent general speakers. Feature normalisation is applied to

    help reduce the effect of the acoustic environment. Maximum

    A Posteriori (MAP) adaptation (usually mean-only) is then

    applied on each cluster from the UBM to form a single model

    per cluster. The cross likelihood ratio (CLR) between any two

    given clusters is defined [24], [33]:

    CLR(ci, cj) = log

    L(xi|j) L(xj |i)

    L(xi|ubm) L(xj |ubm)

    where L(xi|j) is the average likelihood per frame of data xigiven the model j . The pair of clusters with the highest CLR

    is merged and a new model is created. The process is repeated

    until the highest CLR is below a predefined threshold chosen

    from development data. Because of the computational load

    at this stage, each gender/bandwidth combination is usually

    processed separately, which also allows more appropriate

    UBMs to be used for each case.

    Different types of feature normalisation have been used with

    this process, namely RASTA-filtered cepstra with 10s feature

    mean and variance normalisation [14] and feature warping

    [34] using a sliding window of 3 seconds [13], [16]. In [16]

    it was found the feature normalisation was necessary to getsignificant gain from the cluster recombination technique.

    When the clusters are merged, a new speaker model can

    be trained with the combined data and distances updated (as

    in [13], [16]) or standard clustering rules can be used with

    a static distance matrix (as in [14]). This recombination can

    be viewed as fusing intra and inter [33] audio file speaker

    clustering techniques. On the RT-04F evaluation it was found

    that this stage significantly improves performance, with further

    improvements being obtained subsequently by using a variable

    prior iterative MAP approach for adapting the UBMs, and

    building new UBMs including all of the test data [16].

    G. Re-segmentation

    The last stage found in many diarisation systems is a

    re-segmentation of the audio via Viterbi decoding (with or

    without iterations) using the final cluster models and non-

    speech models. The purpose of this stage is to refine the

    original segment boundaries and/or to fill in short segments

    that may have been removed for more robust processing in

    the clustering stage. Filtering the segment boundaries usinga word or phone recogniser output can also help reduce the

    false-alarm component of the error rate [24].

    H. Finding Identities

    Although current diarisation systems are only evaluated

    using relative speaker labels (such as spkr1), it is often

    possible to find the true identities of the speakers (such as Ted

    Koppel). This can be achieved by a variety of methods, such

    as building speaker models for people who are likely to be

    in the news broadcasts (such as prominent politicians or main

    news anchors and reporters) and including these models in the

    speaker clustering stage or running speaker-tracking systems.

    An alternative approach, introduced in [35] uses linguisticinformation contained within the transcriptions to predict the

    previous, current or next speaker. Rules are defined based on

    category and word N-grams chosen from the training data,

    and are then applied sequentially on the test data until the

    speaker names have been found. Blocking rules are used to

    stop rules firing in certain contexts, for example the sequence

    [name] reports * assigns the next speaker to be [name] unless

    * is the word that. An extension of this system described

    in [36], learns many rules and their associated probability of

    being correct automatically from the training data and then ap-

    plies these simultaneously on the test data using probabilistic

    combination. Using automatic transcriptions and automatically

    found speaker turns naturally degrades performance but poten-

    tially 85% of the time can be correctly assigned to the true

    speaker identity using this method.

    Although primarily used for identifying the speaker names

    given a set of speaker clusters, this technique can associate

    the same name for more than one input cluster and therefore

    could be thought of as a high-level cluster-combination stage.

    I. Combining Different Diarisation Methods

    Combining methods used in different diarisation systems

    could potentially improve performance over the best single

    diarisation system. It has been shown that the word error rate

    (WER) of an automatic speech recogniser can be consistentlyreduced when combining multiple segmentations even if the

    individual segmentations themselves do not offer state of the

    art performance in either DER or resulting WER [37]. Indeed

    it seems that diversity between the segmentation methods is

    just as important as the segmentation quality when being

    combined. It is expected that gains in DER are also possible

    by combining different diarisation modules or systems.

    Several methods of combining aspects of different diarisa-

    tion systems have been tried, for example the hybridization

    or piped CLIPS/LIA systems of [26], [38] and the Plug

    and Play CUED/MIT-LL system of [18] which both combine

  • 7/28/2019 0511_Reynolds1

    6/8

    6 IEEE TRANS. ON SAP, VOL. ?, NO. ??, ????? 2006

    components of different systems together. A more integrated

    merging method is described in [38], whilst [26] describes

    a way of using the 2002 NIST speaker segmentation error

    metric to find regions in two inputs which agree and then

    uses these to train potentially more accurate speaker models.

    These systems generally produce performance gains, but tend

    to place some restriction on the systems being combined,

    such as the required architecture or equalising the number of

    speakers. An alternative approach introduced in [39] uses a

    cluster voting technique to compare the output of arbitrary

    diarisation systems, maintaining areas of agreement and voting

    using confidences or an external judging scheme in areas of

    conflict.

    III. EVALUATION OF PERFORMANCE

    In this section we briefly describe the NIST RT-04F speaker

    diarisation evaluation and present the results when using

    the key techniques discussed in this paper on the RT-04F

    diarisation evaluation data.

    A. Speaker Diarisation Error MeasureA system hypothesizes a set of speaker segments each of

    which consists of a (relative) speaker-id label such as Mspkr1

    or Fspkr2 and the corresponding start and end times. This is

    then scored against reference ground-truth speaker segmenta-

    tion which is generated using the rules given in [40]. Since the

    hypothesis speaker labels are relative, they must be matched

    appropriately to the true speaker names in the reference.

    To accomplish this, a one-to-one mapping of the reference

    speaker IDs to the hypothesis speaker IDs is performed so

    as to maximise the total overlap of the reference and (cor-

    responding) mapped hypothesis speakers. Speaker diarisation

    performance is then expressed in terms of the miss (speaker

    in reference but not in hypothesis), false alarm (speaker in

    hypothesis but not in reference), and speaker-error (mapped

    reference speaker is not the same as the hypothesized speaker)

    rates. The overall diarisation error (DER) is the sum of these

    three components. A complete description of the evaluation

    measure and scoring software implementing it can be found

    at http://nist.gov/speech/tests/rt/rt2004/fall.

    It should be noted that this measure is time-weighted, so

    the DER is primarily driven by (relatively few) loquacious

    speakers and it is therefore more important to get the main

    speakers complete and correct than to accurately find speakers

    who do not speak much. This scenario models some tasks,

    such as tracking anchor speakers in Broadcast News for textsummarisation, but there may be other tasks (such as for

    speaker adaptation within automatic transcription, or ascer-

    taining the opinions of several speakers in a quick debate)

    for which it is less appropriate. The same formulation can be

    modified to be speaker weighted instead of time weighted if

    necessary, but this is not discussed here. The utility of either

    weighting depends on the application of the diarisation output.

    B. Data

    The RT-04F speaker diarisation data consists of one 30

    minute extract from 12 different US broadcast news shows.

    These were derived from TV shows: three from ABC, three

    from CNN, two from CNBC, two from PBS, one from CSPAN

    and one from WBN. The style of show varied from a set

    of lectures from a few speakers (CSPAN) to rapid headline

    news reporting (CNN Headline News). Details of the exact

    composition of the data sets can be found in [40].

    C. Results

    The results from the main diarisation techniques are shown

    in Figure 3. Using a top-down clustering approach with full

    covariance models, AHS distance metric and BIC stopping

    criterion gave a DER of between 20.5% and 22.5% [29].

    The corresponding performance on the 6-show RT diarisation

    development data sets ranged from 15.9% to 26.9%, showing

    that the top-down method seems more unpredictable than

    the agglomerative method. This is thought to be because

    the initial clusters contain many speakers and segments may

    thus be assigned incorrectly early on, leading to an unre-

    coverable error. In contrast, the agglomerative scheme grows

    clusters from the original segments and should not contain

    impure multi-speaker clusters until very late in the clusteringprocess. The agglomerative clustering BIC-based scheme got

    around 17-18% DER [10], [14], [16], [24], with Viterbi re-

    segmentation between each step providing a slight benefit to

    16.4% [10]. Further improvements to around 13% were made

    using CLR cluster recombination and resegmentation [14].

    The CLR cluster recombination stage which included feature

    warping produced a further reduction to around 8.5-9.5% [13],

    [16], and using the whole of the RT-04F evaluation data in

    the UBM build of the CLR cluster recombination stage gave

    a final performance of 6.9% [16]. The proxy model technique

    performed better than the equivalent BIC stages, giving 14.1%

    initially and 11.0% after CLR cluster recombination and

    resegmentation [14]. The E-HMM system, despite not beingtuned on the US Broadcast News development sets, gave

    16.1% DER.

    BIC Proxy EHMM*6

    8

    10

    12

    14

    16

    18

    20

    22

    24

    DER(%)

    Topdown BIC

    Bottomup BIC

    +CLR recluster (MIT)

    +CLR recluster (LIMSI)

    Proxy Models

    +CLR recluster (MIT)

    EHMM*

    Fig. 3. DERs for different methods on the RT-04F evaluation data. Eachdot represents a different version of a system built using the indicated coretechnique. *E-HMM not tuned on the US Broadcast News development sets.

    For the system with the lowest DER, the per-show results

    are given in Figure 4. Typical of most systems, there is a

  • 7/28/2019 0511_Reynolds1

    7/8

    Tranter et al.: AN OVERVIEW OF AUTOMATIC SPEAKER DIARISATION SYSTEMS 7

    large variability in performance over the shows, reflecting

    the variability in the number of speakers, the dominance of

    speakers, and the style and structure of the speech. Most of

    the variability is from the speaker error component due to

    over or under clustering. Reducing this variability is a source

    of on-going work. Certain techniques or parameter settings

    can perform better for different styles of show. Systems may

    potentially be improved by either automatically detecting the

    type of show and modifying the choice of techniques or

    parameters accordingly, or by combining different systems

    directly as discussed in section II-I.

    0

    2

    4

    6

    8

    10

    12

    CNN

    CNBC

    ABC

    CNN

    CSPAN

    ABC

    PBS

    CNNHL

    WBN

    ABC

    PBS

    CNBC

    ALL

    %

    DER

    MISSFALARMSPKRERR

    Fig. 4. DER per show for the system with lowest DER. There is a largevariability between the different styles of show.

    IV. CONCLUSION AND FUTURE DIRECTIONS

    There has been tremendous progress in task definition, dataavailability, scoring measures and technical approaches for

    speaker diarisation over recent years. The methods developed

    for Broadcast News diarisation are now being deployed across

    other domains, for example finding speakers within meeting

    data [6] and improving speaker recognition performance using

    multi-speaker train and test data in conversational telephone

    speech [14]. Other tasks, such as finding true speaker iden-

    tities (see section II-H) or speaker tracking (e.g. in [41]) are

    increasingly using diarisation output as a starting point and

    performance is approaching a level where it can be considered

    useful for real human interaction.

    Additions to applications which display audio and option-

    ally transcriptions, such as wavesurfer1 (see Figure 5) or

    transcriber2 (see Figure 6), allow users to see the current

    speaker information, understand the general flow of speakers

    throughout the broadcast or search for a particular speaker

    within the show. Experiments are also underway to ascertain

    if additional tasks, such as the process of annotating data, can

    be facilitated using diarisation output.

    The diarisation tasks of the future will cover a wider scope

    than currently, both in terms of the amount of data (hundreds

    1Wavesurfer is available from www.speech.kth.se/wavesurfer2Transcriber is available from http://trans.sourceforge.net/

    Fig. 5. Accessing Broadcast News Audio from automatically derived speakernames via a wavesurfer plugin.

    Fig. 6. Diarisation information, including automatically found speakeridentities, being used in the Transcriber tool to improve readability andfacilitate searching by speaker.

    of hours) and information required (speaker identity, speaker

    characteristics, or potentially even emotion). Current tech-

    niques will provide a firm base to start from, but new methods,

    particularly combining information from many different ap-

    proaches (as is currently done in the speaker recognition field)

    will need to be developed to allow diarisation to be maximally

    beneficial to real users and potential downstream processing

    such as machine translation and parsing. Additonally further

    development of tools to allow user interactions with diarisa-

    tion output for specific jobs will help focus the research to

    contribute to high utility human language technology.

    V. ACKNOWLEDGEMENTS

    The authors would like to thank Claude Barras, Jean-

    Francois Bonastre, Corinne Fredouille, Patrick Nguyen and

    Chuck Wooters for their help in the construction of this paper.

    REFERENCES

    [1] D. A. Reynolds and P. Torres-Carrasquillo, Approaches and Applica-tions of Audio Diarization, in Proc. IEEE Int. Conference on Acoustics,Speech and Signal Processing (ICASSP), vol. V, Philadelphia, PA, March2005, pp. 953956.

  • 7/28/2019 0511_Reynolds1

    8/8

    8 IEEE TRANS. ON SAP, VOL. ?, NO. ??, ????? 2006

    [2] J. Saunders, Real-time Discrimination of Broadcast Speech/Music, inProc. IEEE Int. Conference on Acoustics, Speech and Signal Processing(ICASSP), vol. II, Atlanta, GA, May 1996, pp. 993996.

    [3] Z. Liu, Y. Wang, and T. Chen, Audio Feature Extraction and Analysisfor Scene Segmentation and Classification, Journal of VLSI SignalProcessing Systems, vol. 20, no. 1-2, pp. 6179, October 1998.

    [4] S. E. Johnson and P. C. Woodland, A Method for Direct AudioSearch with Applications to Indexing and Retrieval, in Proc. IEEE

    Int. Conference on Acoustics, Speech and Signal Processing (ICASSP),vol. 3, Istanbul, Turkey, June 2000, pp. 14271430.

    [5] Y. Moh, P. Nguyen, and J.-C. Junqua, Towards Domain IndependentClustering, in Proc. IEEE Int. Conference on Acoustics, Speech andSignal Processing (ICASSP), vol. II, Hong Kong, April 2003, pp. 8588.

    [6] X. Anguera, C. Wooters, B. Peskin, and M. Aguilo, Robust SpeakerSegmentation for Meetings: The ICSI-SRI Spring 2005 DiarizationSystem, in Proceedings of MLMI/NIST Workshop, Edinburgh, UK, July2005, p. to appear.

    [7] NIST, Benchmark Tests : Rich Transcription (RT). [Online].Available: http://www.nist.gov/speech/tests/rt/

    [8] , Benchmark Tests : Speaker Recognition. [Online]. Available:http://www.nist.gov/speech/tests/spk/

    [9] A. Martin and M. Przybocki, Speaker Recognition in a Multi-SpeakerEnvironment, in Proc. European Conference on Speech Communicationand Technology (Eurospeech), vol. 2, Aalborg, Denmark, September2001, pp. 787790.

    [10] C. Wooters, J. Fung, B. Peskin, and X. Anguera, Towards Robust

    Speaker Segmentation: The ICSI-SRI Fall 2004 Diarization System,in Proc. Fall 2004 Rich Transcription Workshop (RT-04), Palisades, NY,November 2004.

    [11] P. Nguyen, L. Rigazio, Y. Moh, and J. C. Junqua, Rich Transcription2002 Site Report. Panasonic Speech Technology Laboratory (PSTL),in Proc. 2002 Rich Transcription Workshop (RT-02), Vienna, VA, May2002.

    [12] J.-L. Gauvain, L. Lamel, and G. Adda, Partitioning and Transcriptionof Broadcast News Data, in Proc. Int. Conference on Spoken LanguageProcessing (ICSLP), vol. 4, Sydney, Australia, December 1998, pp.13351338.

    [13] X. Zhu, C. Barras, S. Meignier, and J.-L. Gauvain., Combining SpeakerIdentification and BIC for Speaker Diarization, in Proc. EuropeanConference on Speech Communication and Technology (Interspeech),Lisbon, Portugal, September 2005, pp. 24412444.

    [14] D. A. Reynolds and P. Torres-Carrasquillo, The MIT Lincoln Labo-ratory RT-04F Diarization Systems: Applications to Broadcast Audio

    and Telephone Conversations, in Proc. Fall 2004 Rich TranscriptionWorkshop (RT-04), Palisades, NY, November 2004.

    [15] T. Hain, S. E. Johnson, A. Tuerk, P. C. Woodland, and S. J. Young,Segment Generation and Clustering in the HTK Broadcast News Tran-scription System, in Proc. 1998 DARPA Broadcast News Transcriptionand Understanding Workshop, Lansdowne, VA, 1998, pp. 133137.

    [16] R. Sinha, S. E. Tranter, M. J. F. Gales, and P. C. Woodland, TheCambridge University March 2005 speaker diarisation system, in Proc.

    European Conference on Speech Communication and Technology (In-terspeech), Lisbon, Portugal, September 2005, pp. 24372440.

    [17] S. Meignier, D. Moraru, C. Fredouille, J.-F. Bonastre, and L. Besacier,Step-by-Step and Integrated Approaches in Broadcast News SpeakerDiarization, Computer Speech and Language, p. to appear, 2005.

    [18] S. E. Tranter and D. A. Reynolds, Speaker Diarisation for BroadcastNews, in Proc. Odyssey Speaker and Language Recognition Workshop,Toledo, Spain, June 2004, pp. 337344.

    [19] S. E. Tranter, K. Yu, G. Evermann, and P. C. Woodland, Generating

    and Evaluating Segmentations for Automatic Speech Recognition ofConversational Telephone Speech, in Proc. ICASSP, vol. I, Montreal,Canada, May 2004, pp. 753756.

    [20] T. Pfau, D. Ellis, and A. Stolcke, Multispeaker Speech Activity Detec-tion for the ICSI Meeting Recorder, in Proc. IEEE ASRU Workshop,Trento, Italy, December 2001.

    [21] S. S. Chen and P. S. Gopalakrishnam, Speaker, Environment andChannel Change Detection and Clustering via the Bayesian InformationCriterion, in Proc. 1998 DARPA Broadcast News Transcription andUnderstanding Workshop, Lansdowne, VA, 1998, pp. 127132.

    [22] B. Zhou and J. Hansen, Unsupervised Audio Stream Segmentationand Clustering via the Bayesian Information Criterion, in Proc. Int.Conference on Spoken Language Processing (ICSLP), vol. 3, Beijing,China, October 2000, pp. 714717.

    [23] M. A. Siegler, U. Jain, B. Raj, and R. M. Stern, Automatic Segmen-tation, Classification and Clustering of Broadcast News, in Proc. 1997

    DARPA Speech Recognition Workshop, Chantilly, VA, February 1997,pp. 9799.

    [24] C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain, Improving SpeakerDiarization, in Proc. Fall 2004 Rich Transcription Workshop (RT-04),Palisades, NY, November 2004.

    [25] S. Meignier, D. Moraru, C. Fredouille, L. Besacier, and J.-F. Bonastre,Benefits of Prior Acoustic Segmentation for Automatic Speaker Seg-mentation, in Proc. IEEE Int. Conference on Acoustics, Speech andSignal Processing (ICASSP), vol. I, Montreal, Canada, May 2004, pp.397400.

    [26] D. Moraru, S. Meignier, L. Besacier, J.-F. Bonastre, and I. Magrin-Chagnolleau, The ELISA Consortium Approaches in Speaker Seg-mentation during the NIST 2002 Speaker Recognition Evaluation, inProc. IEEE Int. Conference on Acoustics, Speech and Signal Processing

    (ICASSP), vol. 2, Hong Kong, April 2003, pp. 8992.[27] M. Cettolo, Segmentation, Classification and Clustering of an Italian

    Broadcast News Corpus, in Proc. Recherche dInformation Assiste parOrdinateur (RIAO), Paris, France, April 2000.

    [28] J. Ajmera and C. Wooters, A Robust Speaker Clustering Algorithm, inProc. IEEE ASRU Workshop, St. Thomas, U.S. Virgin Islands, November2003, pp. 411416.

    [29] S. E. Tranter, M. J. F. Gales, R. Sinha, S. Umesh, and P. C. Woodland,The Development of the Cambridge University RT-04 DiarisationSystem, in Proc. Fall 2004 Rich Transcription Workshop (RT-04),Palisades, NY, November 2004.

    [30] M. Ben, M. Betser, F. Bimbot, and G. Gravier, Speaker Diarizationusing Bottom-up Clustering based on a Parameter-derived Distance

    between Adapted GMMs, in Proc. Int. Conference on Spoken LanguageProcessing (ICSLP), Jeju Island, Korea, October 2004.

    [31] S. Meignier, J.-F. Bonastre, C. Fredouille, and T. Merlin, EvolutiveHMM for Multi-Speaker Tracking System, in Proc. IEEE Int. Confer-ence on Acoustics, Speech and Signal Processing (ICASSP), Istanbul,Turkey, June 2000.

    [32] S. Meignier, J.-F. Bonastre, and S. Igounet, E-HMM Approach forLearning and Adapting Sound Models for Speaker Indexing, in Proc.Odyssey Speaker and Language Recognition Workshop, Crete, Greece,June 2001.

    [33] D. Reynolds, E. Singer, B. Carlson, J. OLeary, J. McLaughlin, andM. Zissman, Blind Clustering of Speech Utterances Based on Speakerand Language Characteristics, in Proc. Int. Conference on Spoken

    Language Processing (ICSLP), vol. 7, Sydney, Australia, December1998, pp. 31933196.

    [34] J. Pelecanos and S. Sridharan, Feature Warping for Robust SpeakerVerification, in Proc. Odyssey Speaker and Language Recognition

    Workshop, Crete, Greece, June 2001.[35] L. Canseco-Rodriguez, L. Lamel, and J.-L. Gauvain, Speaker Diariza-

    tion from Speech Transcripts, in Proc. Int. Conference on SpokenLanguage Processing (ICSLP), Jeju Island, Korea, October 2004, pp.12721275.

    [36] S. E. Tranter and P. C. Woodland, Who Really Spoke When? - FindingSpeaker Turns and Identities in Broadcast News Audio, in Proc. IEEE

    Int. Conference on Acoustics, Speech and Signal Processing (ICASSP),Toulouse, France, May 2006, p. submitted to.

    [37] D. Y. Kim, M. J. F. Gales, and P. C. Woodland, Performance Im-provement in Broadcast News Transcription, IEEE Trans. on SAP., p.submitted to, 2006.

    [38] D. Moraru, S. Meignier, C. Fredouille, L. Besacier, and J.-F. Bonas-tre, The ELISA Consortium Approaches in Broadcast News SpeakerSegmentation during the NIST 2003 Rich Transcription Evaluation, inProc. IEEE Int. Conference on Acoustics, Speech and Signal Processing(ICASSP), vol. 1, Montreal, May 2004, pp. 373376.

    [39] S. E. Tranter, Two-way Cluster Voting to Improve Speaker DiarisationPerformance, in Proc. IEEE Int. Conference on Acoustics, Speech andSignal Processing (ICASSP), vol. I, Philadelphia, PA, March 2005, pp.753756.

    [40] J. G. Fiscus, J. S. Garofolo, A. Le, A. F. Martin, D. S. Pallett, M. A.Przybocki, and G. Sanders, Results of the Fall 2004 STT and MDEEvaluation, in Proc. Fall 2004 Rich Transcription Workshop (RT-04),Palisades, NY, November 2004.

    [41] D. Istrate, N. Scheffler, C. Fredouille, and J.-F. Bonastre, BroadcastNews Speaker Tracking for ESTER 2005 Campaign, in Proc. EuropeanConference on Speech Communication and Technology (Interspeech),Lisbon, Portugal, September 2005, pp. 24452448.