Soft missing-feature mask generation for robot audition

download Soft missing-feature mask generation for robot audition

If you can't read please download the document

Transcript of Soft missing-feature mask generation for robot audition

  • PALADYN Journal of Behavioral Robotics

    Research Article DOI: 10.2478/s13230-010-0005-1 JBR 1(1) 2010 37-47

    Soft missing-feature mask generation for Robot Audition

    Toru Takahashi1, Kazuhiro

    Nakadai23 , Kazunori Komatani1,

    Tetsuya Ogata1, Hiroshi G. Okuno1

    1 Department of Intelligence and Science

    and Technology

    Graduate School of Informatics,

    Kyoto University, Kyoto 606-8501, Japan

    2 Honda Research Institute Japan Co., Ltd.,

    8-1 Honcho, Wako, Saitama 351-0114, Japan,

    3 Mechanical and Environmental Informatics,

    Graduate School of Information Science and

    Engineering, Tokyo Institute of Technology,

    Tokyo, 152-8552, Japan

    Received 21 February 2010

    Accepted 19 March 2010

    Abstract

    This paper describes an improvement in automatic speech recognition (ASR) for robot audition by introducing Miss-

    ing Feature Theory (MFT) based on soft missing feature masks (MFM) to realize natural human-robot interaction. In

    an everyday environment, a robots microphones capture various sounds besides the users utterances. Although

    sound-source separation is an effective way to enhance the users utterances, it inevitably produces errors due to

    reflection and reverberation. MFT is able to cope with these errors. First, MFMs are generated based on the reli-

    ability of time-frequency components. Then ASR weighs the time-frequency components according to the MFMs.

    We propose a new method to automatically generate soft MFMs, consisting of continuous values from 0 to 1 based

    on a sigmoid function. The proposed MFM generation was implemented for HRP-2 using HARK, our open-sourced

    robot audition software. Preliminary results show that the soft MFM outperformed a hard (binary) MFM in recogniz-

    ing three simultaneous utterances. In a human-robot interaction task, the interval limitations between two adjacent

    loudspeakers were reduced from 60 degrees to 30 degrees by using soft MFMs.

    Keywords

    Robot Audition HARK missing-feature-theory soft mask generation simultaneous speech recognition Automatic Speech Recog-nition sound source separation sound localization

    1. Introduction

    Human-robot interaction (HRI) is one of the most essential topics in be-

    havioral robotics. HRI is improved by the inclusion of a natural speech

    communication function with robot-embedded microphones because

    we generally use speech in our daily communication. In an everyday

    environment a user may barge in or interrupt a robot while it is speak-

    ing, or several users may speak at the same time, which is termed si-

    multaneous speech. In addition, the robot itself generates sounds due

    to its fans and actuators, so the robot must be able to deal with multi-

    ple sound sources simultaneously. A conventional approach in human-

    robot interaction is to use microphones near the speakers mouth to col-

    lect only the desired speech. Kismet of MIT has a pair of microphones

    with pinnae, but a human partner still used a microphone close to the

    speakers mouth [4]. A group communication robot, Robita of Waseda

    University, assumes that each human participant uses a headset micro-

    phone [16]. Thus, Robot Audition was proposed to realize the hearing

    capability that allows a robot to listen to several things simultaneously

    by using the its embedded microphones in [18].

    Robot audition has now been actively studied for more than ten years,

    as typified by organized sessions on robot audition at the IEEE/RAS

    International Conferences on Intelligent Robots and Systems (IROS

    20042009), and also a special session on robot audition at the IEEE

    International Conference on Acoustics Speech and Signal Processing

    E-mail: {tall,komatani,ogata,okuno}@kuis.kyoto-u.ac.jpE-mail: [email protected]

    (ICASSP 2009) of the Signal Processing Society. Sound source sepa-

    ration as pre-processing of automatic speech recognition (ASR) is an

    actively-studied research topic in this field.

    Hara et al. reported a humanoid robot, HRP-2, which uses a micro-

    phone array to localize and separate a mixture of sounds, and which is

    capable of recognizing speech commands in a noisy environment [12].

    HRP-2 can recognize one speakers utterance under noisy or interfering

    speakers. Nakadai et al. reported SIG, a humanoid robot which uses a

    pair of microphones to separate multiple speech signals through an ac-

    tive direction-pass filter, and recognizes each separated speech phrase

    using ASR [20]. They demonstrated that even when three speakers

    utter words at the same time, SIG was able to recognize what each

    speaker said. However, since their system used 51 acoustic models

    trained under different conditions at the same time, the system incurs

    a high computational cost, and performance deteriorates in an environ-

    ment with unexpected and/or dynamically changing noises. Kim et al.

    have developed another binaural sound-source localization and sep-

    aration method by integrating sound-source localization obtained by

    CSP (Cross-power Spectrum Phase) and that obtained by visual infor-

    mation with an EM algorithm [14]. This system assumes that only one

    predominant sound exists in each time frame. Valin et al. have devel-

    oped sound-source localization and separation by Geometric Source

    Separation, and a multi-channel post-filter with 8 microphones to per-

    form speaker tracking [32, 33].

    Sound-source separation is an ill-posed problem, however, because

    it is impossible to perfectly estimate the effect of reverberation and

    environmental noises which change dynamically using microphones

    embedded in a mobile robot. Thus, sound-source separation pro-

    duces separation errors. To remove such errors, a non-linear speech

    enhancement method such as Minima Controlled Recursive Average

    37

  • PALADYN Journal of Behavioral Robotics

    (MCRA) [5] or Minimum Mean Square Error (MMSE) [10] is often used.

    Indeed, non-linear speech enhancement removes the separation er-

    rors, but it also generates some distortions like musical noise, which

    drops ASR performance.

    ASR systems, on the other hand, assume that the input speech is clean

    or contaminated with a known noise source, because their target is

    mainly telephony applications, which generally involve a high signal-to-

    noise ratio (SNR).

    There is, therefore, a mismatch between pre-processing sound source

    separation and ASR systems. It follows that one of the most important

    issues in robot audition is integration between preprocessing andASR.

    1.1. Missing Feature Theory

    Missing Feature Theory (MFT) is a promising approach for integration

    between pre-processing and ASR. MFT is a technique which is know

    to improve the noise-robustness of speech recognition by masking out

    unreliable acoustic features using a so-called missing feature mask

    (MFM) [6, 15, 27]. The effectiveness of MFT has been widely reported

    in connected digit recognition for telephony applications [1, 8], speaker

    verification [9, 23], de-reverberation [24], and recognition of separated

    speech in a binaural way [35].

    Yamamoto and Nakadai et al. are the first research group to introduce

    a Missing Feature Theory (MFT) to integrate ASR with a binaural robot

    audition system [39]. First, the reliability of each time-frequency (TF)

    component was estimated by comparing separated speech with the

    corresponding clean speech. Then, a hard MFM consisting of 0 or 1

    for each TF component was generated based on the reliability using

    a manually-defined threshold. Since this mask generation algorithm

    used reference speech signals to estimate the reliability, the generated

    MFM is called a priori MFM. Although they used a priori hard MFM,they showed a remarkable improvement in the speech recognition of

    separated sounds. This showed the effectiveness of MFT approaches.

    Automatic MFM generation rises as an issue; actually, this is the pri-mary issue in MFT approaches, and remains an open question despite

    numerous MFT studies. Although most works on automatic MFM gen-

    eration focus on single-channel input, or on binaural input, Yamamoto

    and Valin et al. have developed an automatic MFM generation process

    based on microphone-array processing [37].

    First, they showed that unreliable features generated by pre-processing

    are mainly caused by energy leakage from other sound sources. A

    microphone-array-based technique was developed to estimate the re-

    liability of each time-frequency component from this energy leakage, by

    considering the properties of a multi-channel post-filter process and en-

    vironmental noises. Their automatic MFM generation was able to cor-

    rectly estimate around 70% of unreliable TF components, compared to

    a prioriMFM. Thus, the ASR performance drastically improved, and si-multaneous speech recognition of three voices was attained. However,

    they still used a hard binary MFM consisting of a value equal to 0 or 1,

    while the reliability of each TF component is estimated as a continuous

    value in the range 0 to 1.

    This means that some useful information which is contained in the es-

    timated reliability may be lost if hard MFM is used.

    1.2. Soft Missing Feature Mask

    A soft MFM with a continuous value from 0 to 1 was reported as a

    better masking approach[27] than hard MFM,both because soft mask-

    ing can directly deal with the reliability of an input signal, and because

    probabilistic methods can be applied at the same time. Bayesian mask

    estimation algorithms were proposed in [28, 29], while Barker et al.[2]

    M-SourceSimultaneous

    Speeches

    MFM

    SeparatedSignal

    GeometricSource

    Separation

    Multi-channel

    Post-Filter

    AcousticFeature

    Extraction

    Results

    MFT-BasedASR

    ym sm^

    sN

    s1

    Noise-supressed

    Signal

    AcousticFeatureAutomatic MFM

    Generation

    LeakNoise

    Estimation

    BackgroundNoise

    Estimation

    bnm

    N-channel

    SigmoidFunction

    R

    Figure 1. Geometric source separation with multi-channel post-filter.

    used a sigmoid function to estimate a soft MFM. We therefore believe

    that a soft MFM also improves the performance of robot audition in

    the recognition of pre-processed (separated) speeches. A hard MFM

    approach may work when a small number of time-frequency compo-

    nents are overlapped between the target speech and a noise, but in

    speech-noise cases such as barge-in or simultaneous speech, many

    time-frequency components are overlapped. Since a soft masking ap-

    proach directly uses reliability, it can also deal with overlapped time-

    frequency components properly.

    In this paper, we present an automatic, soft-MFM generation method

    based on a sigmoid function which is then implemented as a mod-

    ule of our open-sourced robot audition software, HARK [21]. To show

    the validity of the proposed soft-MFM method, we demonstrate its ef-

    fectiveness through tasks including simultaneous speech recognition,

    and human-robot interaction involving a humanoid HRP-2 robot taking

    a meal order.

    The rest of this paper is organized as follows: Section 2 describes the

    design of a soft-MFM-generation algorithm for robot audition. Section 3

    describes the implementation of a robot-audition system with the pro-

    posed soft-MFM generation method, using HARK, our robot audition

    software. Section 4 illustrates how HRP-2 receives a meal order by

    means of robot audition functions. Section 5 evaluates our proposed

    soft-MFM-generation method through recognition of three simultane-

    ous speeches and a human-robot interaction scenario. The last section

    concludes this paper.

    2. The Design of Soft Missing FeatureMask

    This section describes the design of our soft MFM which is based on

    reliability estimation for time-frequency components. First, the reliability

    of the time-frequency component is defined, then separated speeches

    are analyzed based on the measured reliability in order to model soft-

    MFM generation. Parameter optimization for the modeled soft-MFM

    generation is also shown.

    2.1. Definition of reliability

    Figure 1 shows the core steps of pre-processing in HARK, i.e. Geomet-

    ric Source Separation (GSS) [25] and multi-channel post-filtering. GSS

    38

  • PALADYN Journal of Behavioral Robotics

    is a hybrid sound-source separation method between beam forming

    and blind-source separation. Thus, an N-channel input signal whichconsists ofM sound sources sm is separated into each sound source,ym. We use an 8-channel microphone array (N = 8), and the num-ber of sound sources, M, is decided in a sound localization module(see Sec.3.1). As mentioned in the previous section, however, sound-

    source separation is an ill-posed problem, and thus ym still includesnon-stationary cross-talk (leakage) and stationary background noises.

    Multi-channel post-filtering suppresses both of these types of noise and

    produces a noise-suppressed signal sm.The reliability of sm for each time-frequency component (frame and fre-quency indices are omitted for simplification) was defined by

    R = sm + bnym . (1)

    where bn is a background noise which is separately estimated usingMCRA [5]. Note that R corresponds to leakage level because leakageis a dominant factor in making a time-frequency component unreliable,

    as mentioned in the previous section.

    2.2. Analysis of separated speech based on reliability

    We analyzed the characteristics of R and found that there are twopeaks in the histogram for separated speeches when three speeches

    were uttered simultaneously.

    One peak corresponds to the leakage components, and the other

    matches target-speech components. We checked several intervals

    from 10, 20, , 80, 90 degrees, and found the same tendency forevery interval.

    2.3. Modeling a soft mask

    In hard masking, a hard MFM is generated by thresholding as follows:

    HMm ={

    1, R > TMFM0, otherwise (2)

    where TMFM is a threshold. Dynamic acoustic features, called fea-tures, are commonly used with static acoustic features to improve ASR

    performance. features are calculated by linear regression of five con-secutive frames. Let static acoustic features be m(k), features arethen defined by

    m(k) = 12i=2 i2

    2i=2 i m(k + i), (3)

    where k represents frequency indices. Thus, hard masks for featuresare defined in the same way.

    HMm(k) =k+2

    i=k2,i 6=kHMm(i). (4)

    where k now shows the frame index.

    Such a linear discrimination with TMFM , however, leads to misclassi-fied time-frequency components; we therefore decided to introduce

    soft masking. We assume that these two groups follow Gaussian dis-

    tributions. The distribution function for a Gaussian is defined by

    d(x) = 12

    (1 + erf

    ( x 2

    ))(5)

    erf(x) = 2 x0 e

    t2dt (6)

    Let the distribution functions for leakage and target speech be dn(R)and ds(R), respectively. A normalized speech reliability can be definedby

    B(R) = ds(R)ds(R) + dn(R) (7)

    =1 + erf

    (Rss2)

    2 + erf(Rss2) erf

    (Rnn2) (8)

    This is a sigmoid-like function defined using error functions erf(). Sincethere is a high calculation cost for B(R), we decided to use a typicalsigmoid function Q(R) rather than to use this complicated function di-rectly. We then defined a soft MFM based on Q(R) as follows [30]:

    SMm = w1Q(R|a, b), (9)Q(x|a, b) =

    { 11+exp(a(xb)) , x > b0, otherwise , (10)

    where w1 is an weight factor for static features (0.0 w1). Q(|a, b)is a modified sigmoid function which has two tunable parameters; acorresponds to a trend of the sigmoid function while b represents anx-offset. We also defined soft masks for features as

    SMm(k) = w2k+2

    i=k2,i6=kQ(R(i|a, b)). (11)

    where w2 is an weight factor for dynamic () features (0.0 w2).

    2.4. Parameter optimization for soft masking

    Figure 2 shows the relationship between soft and hard MFMs. When

    a is infinity and w = 1.0 in Equation (10), a soft MFM works as a hardMFM. In this case, b works as threshold, TMFM . Parameters a and bcan be derived from Eqs. (10) and (7), but it is difficult to attain analytical

    solutions for them. In addition, for w1 and w2, we have no theoreticalevidence for parameter estimation. We thus measured the recognition

    performance of three simultaneous speech signals in order to optimize

    these parameters for a robot having eight omni-directional microphones

    as shown in Figure 8. Simultaneous speech signals were recorded in a

    room with RT20 = 0.35. Three different words were played simultane-ously at the same volume from three loudspeakers located 2m away

    39

  • PALADYN Journal of Behavioral Robotics

    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Reliability

    Soft

    mas

    k

    a = 140, b = 0.15

    a = 80, b = 0.15

    a = 20, b = 0.15

    Figure 2. Sigmoid function (Eq.(10) for soft mask generation when b = 0.15and a = 20, 80, 140.

    Table 1. A search space for soft MFM parameters

    parameter range stepa 40 b 0.1 1.5 0.1w1 0.1 1.5 0.1w2 0.1 1.5 0.1

    from the robot. Each word was selected from the ATR phonetically bal-

    anced wordset consisting of 216 Japanese words. The direction of one

    loudspeaker was fixed in front of the robot, and the others were located

    at 10, 20, , 80, 90 degrees to the robot. For each configu-ration, 200 combinations of the three different words were played.

    Table 1 shows a search space for a soft MFM parameter set p =(a, b, w1, w2). Figure 3 shows an example of w1-w2 parameter op-timization for the center speaker when the loud speakers were located

    at 0, 90, and -90 degrees. For other conditions, we obtained a similar

    tendency forw1-w2 parameter optimization. We also performed param-eter optimization for a and b, and found that a similar result is obtainedfor every layout. We therefore obtained the optimized parameter set

    popt defined by

    popt = argmaxp

    1

    9

    90=10

    1

    3(WC(a, b, w1, w2) +

    +WR(a, b, w1, w2) + WL(a, b, w1, w2)) (12)

    where WC , WR , and WL indicate the number of correct words foreach of the center, right and left loudspeakers where their locations are

    (0, ,) degrees, respectively.Finally, we attained the optimal parameter set for the soft MFM as

    popt = (40, 0.5, 0.1, 0.2).

    76

    76

    78

    78

    78

    78

    80

    80

    80 80

    80

    82

    82

    82

    82

    82

    82

    82

    82

    84

    84

    84

    84

    86

    86

    8688

    delta weight (w2)

    sta

    tic

    wei

    gh

    t (w

    1)

    Center speaker in 90deg. condition

    0.5 1 1.5

    0.5

    1

    Figure 3. ASR Performance for the center loudspeaker in a word correct rate.This is the case where three loudspeakers were located at (0. 90.-90). This shows the results for the parameters w1 and w2.

    3. A Robot Audition System

    Our robot audition system consists of five major components shown

    in Figure 1. Our proposed soft-MFM generation was described in the

    previous section. This section explains the other four components:

    A: Geometric Source Separation,

    B: Multi-channel post-filter,

    C: Acoustic feature extraction,

    D: Missing-Feature-Theory-based Automatic Speech Recognition

    (MFT-ASR).

    Besides the five components, our robot audition system uses several

    techniques such as sound-source localization and tracking, which are

    described in [38].

    3.1. Geometric source separation

    GSS is a hybrid algorithm of Blind Source Separation (BSS) and beam-

    forming [25]. BSS has a number of limitations such as permutation and

    scaling problems, which can be relaxed in GSS by the introduction of

    geometric constraints. These are obtained from the locations of mi-

    crophones and sound sources. Unlike the Linearly Constrained Mini-

    mum Variance (LCMV) beamformer which minimizes the output power

    subject to a distortion-less constraint, GSS explicitly minimizes cross-

    talk, leading to faster adaptation. The method is also interesting for

    use in the mobile robotics context because it allows easy addition and

    removal of sources. Using some approximations, it is also possible to

    implement separation with relatively low complexity.

    Our GSS was modified so as to provide faster adaptation using

    stochastic-gradient and shorter time-frame estimation. The locations

    of sound sources are estimated with Multiple Signal Classification (MU-

    SIC). This is a frequency-domain adaptive beamforming method which

    40

  • PALADYN Journal of Behavioral Robotics

    produces a sharp local peak corresponding to a sound-source direc-

    tion, thus its noise robustness improves in the real world.

    To formulate GSS, suppose that there are M sources and N ( M)microphones. A spectrum vector of M sources at frequency , s(),is denoted as [s1()s2() . . . sM ()]

    T , and a spectrum vector of sig-

    nals captured by the N microphones at frequency , x(), is denotedas [x1()x2() . . . xN ()]

    T , where T represents a transpose operator.x() is, then, calculated as

    x() = H()s(), (13)

    where H() is a transfer function matrix. Each component Hnm of thismatrix represents the transfer function from them-th source to the n-thmicrophone. The source separation is generally formulated as

    y() = W ()x(), (14)

    whereW () is called a separation matrix. The separation is definedas findingW () which satisfies the condition that output signal y() isthe same as s(). In order to estimateW (), GSS introduces two costfunctions, that is, separation sharpness (JSS ) and geometric constraints(JGC ) defined by

    JSS(W ) = E [yyH diag[yyH ]]2, (15)

    JGC (W ) = diag[WD I ]2, (16)

    where 2 indicates the Frobenius norm, diag[] is the diagonal opera-tor, E [] represents the expectation operator andH represents the con-jugate transpose operator. D shows a transfer function matrix basedon a direct sound path between a sound source and each microphone.

    The total cost function J(W ) is represented as

    J(W ) = SJSS(W ) + JGC (W ), (17)

    where S represents the weighting parameter which controls theweighting between the separation cost and the cost of the geomet-

    ric constraint. This parameter is usually set to xHx2 according to[34]. In an online version of GSS,W is updated by minimizing J(W )

    W t+1 = W t J(W t), (18)

    where W t denotes W at the current time step t, J(W ) is defined as

    an update direction ofW , and means a step-size parameter.

    3.2. Multi-channel post-filter

    A multi-channel post-filter is used to enhance the output of the GSS

    algorithm[36]. It is a spectral filter using an optimal noise estimator de-

    scribed in [10]. This method is a kind of spectral subtraction [3], but

    it generates less musical noises and distortion, because it takes tem-

    poral and spectral continuities into account. We extended the original

    noise estimator to estimate both stationary and non-stationary noise

    by using multi-channel information, while most post-filters only address

    the reduction of a specific type of noise: stationary background noise

    [17].

    The output of GSS y forms an input of the multi-channel post-filter;

    An output of the multi-channel post-filter s, is defined as

    s = Gy, (19)

    where G is a spectral gain. The estimation of G is based on minimummean-square error estimation of spectral amplitude. To estimate G,noise variance is estimated.

    The noise variance estimation m is expressed as

    m = stat.m +

    leakm , (20)

    where stat.m is the estimate of the stationary component of the noisefor source m, at frame t for frequency f , and leakm is the estimate ofsource leakage.

    We computed the stationary noise estimate, stat.m , using MCRA tech-nique [5]. To estimate leakm , we assumed that the interference fromother sources is reduced by factor (typically -10dB -5 dB) byLSS. The leakage estimate is thus expressed as

    leakm = M1

    i=0,i 6=mZi, (21)

    where Zi is the smoothed spectrum of the m-th source, Ym and recur-sively defined (with 0.7) [40]:

    Zm(f, t) = Zm(f, t 1) + (1 )Ym(f, t). (22)

    3.3. Acoustic feature extraction

    To estimate the reliability of acoustic features, we have to exploit the

    fact that noises and distortions are usually concentrated in some ar-

    eas in the spectro-temporal space. Most conventional ASR systems

    useMel-Frequency Cepstral Coecient (MFCC) [26] as an acous-tic feature, but noises and distortions are spread to all coefficients in

    MFCC. In general, Cepstrum-based acoustic features like MFCC are

    not suitable for MFT-ASR, Therefore, we use Mel-Scale Log Spec-trum (MSLS) as an acoustic feature.MSLS is obtained by applying inverse discrete cosine transformation

    to MFCCs. Three normalization processes are then applied in order

    to obtain noise-robust acoustic features: mean-power normalization,

    spectrum-peak emphasis and spectrum-mean normalization. The de-

    tails are described in [22]. These three normalization processes corre-

    spond to three normalization performed against MFCC; C0 normaliza-

    tion, liftering, and Cepstrum mean normalization. The acoustic-feature

    vector composes 13 MSLS features, their derivatives and log power,i.e., a 27-dimensional MSLS-based acoustic vector was used.

    3.4. Missing Feature Theory based ASR

    Several robot audition systems with pre-processing and ASR have

    been reported so far [11, 19]. Such systems just combine pre-

    processing with ASR, and focus on the improvement of SNR and real-

    time processing.

    Two critical issues remain: what kinds of pre-processing are required

    for ASR, and how does ASR use the characteristics of pre-processing

    besides using an acoustic model with multi-condition training. We ex-

    ploited an interfacing scheme between preprocessing and ASR based

    on MFT.

    41

  • PALADYN Journal of Behavioral Robotics

    MFT uses MFMs in a temporal-frequency map to improve ASR. Each

    MFM specifies whether a spectral value for a frequency bin in a specific

    time frame is reliable or not. Unreliable acoustic features caused by er-

    rors in preprocessing are masked using MFMs, and only reliable ones

    are used for a likelihood calculation in the ASR decoder. The decoder

    is an HMM-based recognizer, which is commonly used in conventional

    ASR systems. The estimation process of output probability in the de-

    coder is modified in MFT-ASR.

    Let M(i) be a MFM vector that represents the reliability of the i-thacoustic feature. The output probability bj (x) is given by the follow-ing equation:

    bj (x) =Ll=1

    P(l|Sj ) exp

    {Ni=1

    M(i) log f(x(i)|l, Sj )

    }, (23)

    where P() is a probability operator, x(i) is an acoustic feature vector,N is the size of the acoustic feature vector, and Sj is the j-th state.

    For implementation, we used Multiband Julian [13], which is based

    on the Japanese real-time large-vocabulary speech-recognition engine,

    Julian [31]. It supports various HMM types such as shared-state tri-

    phones and tied-mixture models. Network grammar is supported for

    a language model. It can work as a stand-alone or client-server ap-

    plication. To run as a server, we modified the system to be able to

    communicate acoustic features and MFM via a network.

    4. System Implementation

    This section introduces our open-sourced robot audition software,

    HARK, then we describe implementation of the proposed soft-MFM

    generation as a new module for HARK, and a robot audition system

    with the new module.

    4.1. Open-Sourced Robot Audition Software HARK

    We consider a software environment for robot audition research. Most

    studies focus on their own robot platforms, and their systems are un-

    available for other researchers and research groups. It is pleasant

    and useful for robot audition researchers to share a common platform,

    because the researchers do not need to make their own robot plat-

    forms from scratch, and they can easily change a module to compare

    it with another. Thus, we implemented each technique described in

    the previous sections as a component for modular-based architecture

    called FlowDesigner [7]1. FlowDesigner provides a flexible and effi-

    cient software development environment, which is achieved by flex-

    ible replacement of modules and fast data communication between

    modules. These advantages are achieved by the pull architecture of

    FlowDesigner.

    We then released a set of components as HARK (Honda Research

    Institute Japan Audition for Robots with Kyoto University; the word

    also means of listen in old English).2 [21]. HARK provides a user-

    customizable total robot audition system including multi-channel sound

    acquisition, sound localization, sound separation and ASR. HARK also

    1 http://owdesigner.sourceforge.net/2 It is available at http://winnie.kuis.kyoto-u.ac.jp/HARK/.

    provides opportunities to discuss general applicability, platform depen-

    dency, customization and tuning. Difficulties such as customization to

    another platform or tuning to another acoustic environment have not

    yet been discussed. In addition, HARK has a possibility to stimulate

    the robot audition research area, and to provide an effective tool for

    interdisciplinary research such as natural language processing, naviga-

    tion, and HRI. According to user feedback, performance and stability

    of HARK will improve.

    As related work, Valin released sound-source localization and separa-

    tion software for robots called "ManyEars" as General Public License

    (GPL) open-source software (OSS). This is the first software which

    can provide generally-applicable and customizable robot audition sys-

    tems. The only missing function is ASR. "ManyEars" is limited to sound-

    source localization and separation. ASR has yet not been included as

    it has a lot of parameters which affect the performance of a total robot

    audition system.

    4.2. Implementation of soft MFM generation forHARK

    Figure 4 shows an example of a robot audition system constructed

    using HARK. A green rectangle represents a module, while a con-

    nection between modules is indicated by a black arrow. The top-

    left module (AudioStreamFromMic) captures sounds using a robot-embedded microphone array. After frequency analysis (MultiFFT),sound sources are localized using LocalizeMUSIC,SourceTracker,and SourceIntervalExtender. The localized sound sources are sep-arated with GSS, and PostFilter enhances the separated sounds.MSLS features are calculated using MSLSExtraction, Delta, andFeatureRemover in a Mel-frequency domain (MelFilterBank). TheMFM is estimated in SMFMGeneration. Finally, MSLS features andthe corresponding MFMs are sent to MFT-ASR via a Socket interface

    usingSpeechRecognitionClient. Thesemodules are prepared in ad-vance. Thus, users can easily construct their own robot audition sys-

    tems by selecting modules and connecting them using a GUI interface.

    The newly-developed module is shown as SMFMGeneration at thecenter of Figure 4. It has four terminals which are shown as black dots

    on the left and right edges of the box. Three terminals on the left edge

    correspond to the input signals such as sm, ym, and bn in Equation (1).The last terminal on the right edge shows the output, that is, a soft-

    MFM vector for a frame. This module also has three parameters such

    as FBANK, THRESHOLD, and TILT shown in the property setting

    window in Figure 5, which appears by double-clicking the green box

    of SMFMGeneration. FBANK represents the number of dimensionsfor a static part of a MFM vector defined in Eq. (9), that is 13 in our set-

    ting. THRESHOLD and TILT correspond to a and b defined in Eq. (10),respectively.

    5. Evaluation

    To evaluate the proposed robot audition system with soft-MFM gener-

    ation, simultaneous speech recognition was performed in a manner of

    isolated word recognition. Also, the systemwas introduced to a human-

    robot interaction scenario, that is, a with the robot taking a meal order.

    5.1. Experimental setup

    We used a humanoid robot HRP-2 with eight microphones around the

    top of the head for an experiment of simultaneous speech recognition.

    It was placed at the center of a circle in Figure 8. Three loudspeakers

    42

  • PALADYN Journal of Behavioral Robotics

    Figure 4. A robot audition system constructed by HARK modules.

    Figure 5. A parameter tab of a soft mask generation module.

    were used to play three speeches simultaneously. A loudspeaker was

    fixed in front of the robot, and two other loudspeakers were located

    at 30, 60, 90, 120, or 150 shown in Table 2. The distancebetween the robot and each loudspeaker was 1m. Four combinations

    of sound sources were used shown in Table 3. Thus, we generated

    20 test data sets (3 4). Each test dataset consists of 200 combina-tions of three different words randomly-selected from ATR phonetically

    balanced 216 Japanese words.

    For an acoustic model in ASR, we trained a 3-state and 16-mixture tri-

    phone model based on Hidden Markov Model (HMM) using 27 dimen-

    sional MSLS features. To make evaluation fair, we performed an open

    test, that is, the acoustic model was trained with a different speech

    corpus from test data. For training data, we used a Japanese News

    a) A humanoid robot HRP-2.

    b) Layout of microphones.

    Figure 8. A humanoid robot HRP-2 with an 8 ch microphone array.

    Article Speech Database containing 47,308 utterances by 300 speak-

    ers. After adding 20dB of white noise to the speech data, the acoustic

    model was trained with the white-noise-added training data, which is a

    well-known technique for improving the noise-robustness of an acous-

    tic model for ASR.

    43

  • PALADYN Journal of Behavioral Robotics

    Table 2. Loudspeaker locations.

    Layout center (deg.) left (deg.) right (deg.)

    LL1 0 30 -30

    LL2 0 60 -60

    LL3 0 90 -900

    LL4 0 120 -120

    LL5 0 150 -150

    Table 3. Sound Source Combination.

    combination center left right

    SC1 female 1 female 2 female 3

    SC2 female 4 female 5 female 6

    SC3 male 1 male 2 male 3

    SC4 male 4 male 5 male 6

    5.2. Recognition of three simultaneous speeches

    For comparison, we evaluated three kinds of MFMs as follows:

    1. hard MFM : conventional hard MFM defined by Eqs. (2) and (4).

    2. soft MFM (unweighted) : soft MFM defined by Eqs. (9) and (11)

    with w1 = w2 = 1.

    3. soft MFM (proposed) : the proposed soft MFM defined by

    Eqs. (9) and (11) using the optimized MFM parameter set popt .

    Word correct rates were measured with these MFMs for every test

    dataset described above.

    Figsures 911 illustrate averaged word correct rates for the center, left

    and right speakers, respectively.

    For the center speaker, we can say that our proposed soft MFM drasti-

    cally improved ASR performance. For the left or right speaker, while the

    improvement was less than that for the center speaker, we still find im-

    provements to some extent, especially, when the angle between loud-

    speakers is narrow. This difference is caused by the layout of the three

    speakers. The sound from the center speaker is affected by both theleft and right speakers, while only the center speaker has a large ef-

    fect on each of the side speakers. Thus, the number of overlapping

    TF components for the center speaker is larger than that of either the

    left or the right speaker individually. Also, their overlapping level for the

    center speaker is higher than the others. This proves that the proposed

    soft MFM is able to cope with the large number of overlapping TF com-

    ponents, even in the highly-overlapped cases. The improvement of

    the proposed soft MFM reached around 10 points by averaging three-

    speaker cases.

    When we focus on the difference between the unweighted soft MFM

    and the proposed soft MFM, we can find a similar tendency with respect

    to the difference between the soft and the hard MFMs; that is, the opti-

    mization of weighting factors is more effective when two speakers are

    closer together. This means that weighting factors work effectively to

    deal with highly overlapped TF components.

    30 60 90 120 150 3050

    55

    60

    65

    70

    75

    80

    85

    90

    95

    100

    Speaker intervals (degrees)

    Wo

    rd C

    orr

    ect

    Rat

    e (%

    )

    Soft (w/ weight) Soft (w/o weight) Hard

    Figure 9. Word correct rate for the center speaker.

    30 60 90 120 150 3050

    55

    60

    65

    70

    75

    80

    85

    90

    95

    100

    Speaker intervals (degrees)

    Wo

    rd C

    orr

    ect

    Rat

    e (%

    )

    Soft (w/ weight) Soft (w/o weight) Hard

    Figure 10. Word correct rate for the left speaker.

    5.3. Human-robot interaction scenario

    We applied a robot audition system including the proposed soft MFM

    generation to a human-robot interaction scenario. Figure 12 shows

    snapshots of the scenario. In this scenario, the robot receives ameal or-

    der, with three customers simultaneously asking the robot for what they

    want. The robot localizes and separates their speeches, and the rec-

    ognizes the separated speeches. This demonstration was performed

    in the same room mentioned in Section 2.4. With the hard MFM we

    were previously using, speakers had to keep more than 60 degree in-

    tervals from the neighboring speaker for this kind of real situation, while

    benchmark tests show that the system maintains performance even

    when the interval between speakers is 30 degrees. This is caused

    by dynamically-changing noises in a real-world environment. On the

    44

  • PALADYN Journal of Behavioral Robotics

    30 60 90 120 150 3050

    55

    60

    65

    70

    75

    80

    85

    90

    95

    100

    Speaker intervals (degrees)

    Wo

    rd C

    orr

    ect

    Rat

    e (%

    )

    Soft (w/ weight) Soft (w/o weight) Hard

    Figure 11. Word correct rate for the right speaker.

    other hand, the robot with our proposed system maintained ASR per-

    formance even in the case of a 30 degree interval for this scenario. This

    means that the proposed soft MFM approach was effective in a real-

    world environment where dynamically-changing noises exist to some

    extent.

    6. Conclusion and future work

    We have presented an improvement in automatic speech recognition

    of robot audition to allow it to realize natural human-robot interaction

    which is a topic essential to behavioral robotics. The missing-feature

    theory is adopted to integrate microphone-array-based pre-processing

    of sound-source localization and separation. For the missing feature

    mask, we used a soft missing feature mask taking a continuous value

    between 0 and 1, instead of a conventional hard missing feature mask

    taking a binary value, 0 or 1. The soft missing feature mask is generated

    automatically by estimating the reliability of a time-frequency compo-

    nent based on a sigmoid function. The automatic soft mask generation

    is incorporated as a set of modules into the HARK open-sourced robot

    audition system.

    The resulting HARK-based robot audition system with automatic soft

    mask generation improves the performance of automatic speech

    recognition in the case of three simultaneous speeches, in particular

    for narrower intervals of two adjacent speakers up to 30 degrees. The

    conventional system worked for speaker-sepearations greater than 60

    degrees. Therefore, the soft mask system provides opportunities to de-

    ploy a robot audition system in more realistic multi-party interaction. As

    a proof of concept, a humanoid HRP-2 demonstrated the role of taking

    three meal orders at the same time.

    Future work includes detailed analysis and more applications (for ex-

    ample, extensive benchmarking to analyse the performance of auto-

    matic speech recognition with wide variations of speaker configuration

    under various acoustic environments); and application of the HARK-

    based system to actual multi-party interactions. Typical scenarios will

    include barge-in utterances, utterances of moving talkers, and recog-

    nition while the robot is in motion. These applications are expected

    Could you take an order?

    May I help you?

    3. HRP-2 starts to take orders. 4. Three users simultaneously utter to place an order to HRP-2.

    Coffee

    Coke

    Orangejuice

    You will drink coke.

    5. HRP-2 rephrases the order to the right user. As each user is located, HRP-2 turns its body toward each user when rephrasing.

    6. HRP-2 rephrases the order to the center user.

    You will drinkorange juice.

    7. HRP-2 rephrases the order to the left user.

    Thank you for using our service.

    You will drink coffee.

    8. HRP-2 thanks the users.

    1. HRP-2 waits for a user. 2. A user asks HRP-2 to take an order.

    Figure 12. Snapshots of a meal order taking task.

    to gather more experience in coping with a mixture of sounds, and to

    guide new research towards symbiosis of human and robots through

    verbal communication and auditory scene analysis.

    Acknowledgments

    Our research is partially supported by the Grant-in-Aid for Scientific Re-

    search and Global COE Program.

    45

  • PALADYN Journal of Behavioral Robotics

    References

    [1] J. Barker, M. Cooke, and P. Green. Robust ASR based on clean

    speech models: An evaluation of missing data techniques for con-

    nected digit recognition in noise. In Procedings of Eurospeech-

    2001, pages 213216. ESCA, 2001.

    [2] J. Barker, L. Josifovski, M. Cooke, and P. Green. Soft decisions in

    missing data techniques for robust automatic speech recognition.

    In Proc. of 6th International Conference on Spoken Language

    Processing (ICSLP-2000), volume I, pages 373376, 2000.

    [3] S. F. Boll. A spectral subtraction algorithm for suppression

    of acoustic noise in speech. In roceedings of 1979 Interna-

    tional Conference on Acoustics, Speech, and Signal Processing

    (ICASSP-79), pages 200203. IEEE, 1979.

    [4] C. Breazeal. Emotive qualities in robot speech. In Proceedings

    of 2001 IEEE/RSJ International Conference on Intelligent Robots

    and Systems (IROS 2001), pages 13891394, 2001.

    [5] I. Cohen and B. Berdugo. Speech enhancement for non-

    stationary noise environments. Signal Processing, 81(2):2403

    2418, 2001.

    [6] M. Cooke, P. Green, L. Josifovski, and A. Vizinho. Robust au-

    tomatic speech recognition with missing and unreliable acoustic

    data. Speech Communication, 34(3):267285, May 2000.

    [7] C. Ct, D. Ltourneau, F. Michaud, J. M. Valin, Y. Brosseau,

    C. Rievsky, M. Lemay, and V. Tran. Reusability tools for pro-

    gramming mobile robots. In Proceedings of IEEE/RSJ Inter-

    national Conference on Intelligent Robots and Systems (IROS

    2004), pages 18201825. IEEE, 2004.

    [8] J. de Veth, F. de Wet, B. Cranen, and L. Boves. Missing feature

    theory in asr: Make sure you miss the right type of features. In

    Proceedings of Workshop on Robust Methods for ASR in Adverse

    Conditions, Tampere, pages 231234, 1999.

    [9] A. Drygajlo and M. El-Maliki. Speaker verification in noisy envi-

    ronments with combined spectral subtraction and missing feature

    theory. In Proceedings of 1998 IEEE International Conference on

    Acoustics, Speech, and Signal Processing (ICASSP 1998), pages

    121124, 1998.

    [10] Y. Ephraim and D. Malah. Speech enhancement using mini-

    mum mean-square error short-time spectral amplitude estimator.

    IEEE Transactions on Acoustics, Speech and Signal Processing,

    ASSP-32(6):11091121, 1984.

    [11] I. Hara, F. Asano, H. Asoh, J. Ogata, N. Ichimura, Y. Kawai,

    F. Kanehiro, H. Hirukawa, and K. Yamamoto. Robust speech in-

    terface based on audio and video information fusion for humanoid

    HRP-2. In Proceedings of IEEE/RSJ International Conference on

    Intelligent Robots and Systems (IROS 2004), pages 24042410.

    IEEE, 2004.

    [12] H. Isao, A. Futoshi, K. Yoshihiro, K. Fumio, and Y. Kiyoshi. Ro-

    bust speech interface based on audio and video information fu-

    sion for humanoid hrp-2. In Proceedings of 2004 IEEE/RSJ In-

    ternational Conference on Intelligent Robots and Systems (IROS

    2004), pages 24042410, 2004.

    [13] Multiband Julius. http://www.furui.cs.titech.ac.jp/mbandjulius/.

    [14] H. D. Kim, K. Komatani, T. Ogata, and H. G. Okuno. Human track-

    ing system integrating sound and face localization using em algo-

    rithm in real environments. Advanced Robotics, 23(6):629653,

    2007.

    [15] R. P. Lippmann and B. A. Carlson. Robust speech recognition with

    time-varying filtering, interruptions, and noise. In Proceedings of

    1997 ISCA 5th European Conference on Speech Communication

    and Technology (EuroSpeech 1997), pages 365372, 1997.

    [16] Y. Matsusaka, T. Tojo, S. Kuota, K. Furukawa, D. Tamiya, K. Hay-

    ata, Y. Nakano, and T. Kobayashi. Multi-person conversation via

    multi-modal interface a robot who communicates with multi-user.

    In Proceedings of 6th European Conference on Speech Com-

    munication Technology (Eurospeech 1999), pages 17231726,

    1999.

    [17] I. A. McCowan and H. Bourlard. Microphone array post-filter for

    diffuse noise field. In ICASSP-2002, volume 1, pages 905908,

    2002.

    [18] K. Nakadai, T. Lourens, H. G. Okuno, and H. Kitano. Active au-

    dition for humanoid. In Proc. of 17th National Conference on

    Artificial Intelligence (AAAI-2000), pages 832839. AAAI, 2000.

    [19] K. Nakadai, D. Matasuura, H. G. Okuno, and H. Tsujino. Improve-

    ment of recognition of simultaneous speech signals using av inte-

    gration and scattering theory for humanoid robots. Speech Com-

    munication, 44(1-4):97112, October 2004.

    [20] K. Nakadai, D. Matsuura, H. G. Okuno, and H. Tsujino. Improve-

    ment of recognition of simultaneous speech signals using av inte-

    gration and scattering theory for humanoid robots. Speech Com-

    munication, 44(1-4):97112, 2004.

    [21] K. Nakadai, H. G. Okuno, H. Nakajima, Y. Hasegawa, and H. Tsu-

    jino. An open source software system for robot audition hark and

    its evaluation. In Proceedings of 2008 IEEE/RAS International

    Conference on Humanoid Robots (HUMANOIDS 2008), pages

    561566, 2008.

    [22] Y. Nishimura, T. Shinozaki, K. Iwano, and S. Furui. Noise-robust

    speech recognition using multi-band spectral features. In Pro-

    ceedings of 148th Acoustical Society of America Meetings, num-

    ber 1aSC7, 2004.

    [23] M. T. Padilla, T. F. Quantieri, and D. A. Reynolds. Missing feature

    theory with soft spectral subtraction for speaker verification. In

    Proceedings of the 8th International Congress on Spoken Lan-

    guage Processing (InterSpeech 2006), pages 913916, 2006.

    [24] H.M. Park and R. M. Stern. Missing feature speech recognition us-

    ing dereverberation and echo suppression in reerberation environ-

    ments. In Proceedings of 2007 IEEE International Conference on

    Acoustics, Speech, and Signal Processing (ICASSP 2007), vol-

    ume IV, pages 381384, 2007.

    [25] L. C. Parra and C. V. Alvino. Geometric source separation: Mergin

    convolutive source separation with geometric beamforming. IEEE

    Transactions on Speech and Audio Processing, 10(6):352362,

    2002.

    [26] R. Plomp, L. C. W. Pols, and J. P. van de Geer. Dimensional anal-

    ysis of vowel spectra. Acoustical Society of America, 41(3):707

    712, 1967.

    [27] B. Raj and R. M. Stern. Missing-feature approaches in speech

    recognition. Signal Processing Magazine, 22(5):101116, 2005.

    [28] P. Renevey and A. Drygajlo. Missing feature theory and proba-

    bilistic estimation of clean speech components for robust speech

    recognition. In Proceedings of European Conference on Speech

    Communication Technology (Eurospeech-1999), pages 2627

    2630, 1999.

    [29] M. L. Seltzer, B. Raj, and R. M. Stern. A bayesian classifier for

    spectrographic mask estimation for missing feature speech recog-

    nition. Speech Communication, 43:379393, 2004.

    [30] T. Takahashi, K. Nakadai, K. Komatani, T. Ogata, and H. G. Okuno.

    Missing-feature-theory-based robust simultaneous speech recog-

    nition system with non-clean speech acoustic model. In Pro-

    ceedings of 2009 IEEE/RSJ International Conference on Intelli-

    gent Robots and Systems (IROS 2009), pages 27302735, 2009.

    [31] K. Tatsuya and L. Akinobu. Free software toolkit for Japanese

    large vocabulary continuous speech recognition. In International

    Conference on Spoken Language Processing (ICSLP), volume 4,

    pages 476479, 2000.

    [32] J. M. Valin, F. Michaud, and J. Rouat. Robust localization and

    tracking of simultaneous moving sound sources using beamform-

    46

  • PALADYN Journal of Behavioral Robotics

    ing and particle filtering. Robotics and Autonomous Systems

    Journal, 55(3):216228, 2007.

    [33] J. M. Valin, J. Rouat, and F. Michaud. Enhanced robot audi-

    tion based on microphone array source separation with post-filter.

    In Proc. IEEE/RSJ International Conference on Intelligent Robots

    and Systems (IROS), pages 21332128, 2004.

    [34] J. M. Valin, J. Rouat, and F. Michaud. Enhanced robot audi-

    tion based on microphone array source separation with post-filter.

    In Proceedings of IEEE/RSJ International Conference on Intelli-

    gent Robots and Systems (IROS 2004), pages 21232128. IEEE,

    2004.

    [35] F.Wang, Y. Takeuchi, N. Ohnishi, and N. Sugie. Amobile robot with

    active localization and discrimination of a sound source. Journal

    of Robotic Society of Japan, 15(2):6167, 1997.

    [36] S. Yamamoto, K. Nakadai, J. M. Valin, J. Rouat, F. Michaud, ,

    K. Komatani, T. Ogata, and H. G. Okuno. Making a robot recog-

    nize three simultaneous sentences in real-time. In Proceedings

    of IEEE/RSJ International Conference on Intelligent Robots and

    Systems (IROS 2005), pages 897902. IEEE, 2005.

    [37] S. Yamamoto, J. M. Valin, K. Nakadai, T. Ogata, and H. G. Okuno.

    Enhanced robot speech recognition based on microphone array

    source separation and missing feature theory. In Proceedings

    of IEEE/RSJ International Conference on Intelligent Robots and

    Systems (IROS 2005), pages 14891494. IEEE, 2005.

    [38] S. Yamamoto, K. Nakadai, M. Nakano, H. Tsujino, J. M. Valin,

    K. Komatani, T. Ogata, and H. G. Okuno. Design and implemen-

    tation of a robot audition system for automatic speech recognition

    of simultaneous speech. In Proceedings of IEEE Workshop on

    Automatic Speech Recognition and Understanding (ASRU-2007),

    pages 111116. IEEE, 2007.

    [39] S. Yamamoto, K. Nakadai, H. Tsujino, T. Yokoyama, and H. G.

    Okuno. Improvement of robot audition by interfacing sound source

    separation and automatic speech recognition with missing fea-

    ture theory. In Proceedings of IEEE International Conference on

    Robotics and Automation (ICRA 2004), pages 15171523. IEEE,

    2004.

    [40] S. Yamamoto, K. Nakadai, J. M. Valin, J. Rouat, F. Michaud, K. Ko-

    matani, T. Ogata, and H. G. Okuno. Genetic algorithm-based im-

    provement of robot hearing capabilities inseparating and recogniz-

    ing simultaneous speech signals. In Proceedings of 19th Inter-

    national Conference on Industrial, Engineering, and Other Appli-

    cations of Applied Intelligent Systems (IEA/AIE06), volume LNAI

    4031, pages 207217. Springer-Verlag, 2006.

    47