Soft missing-feature mask generation for robot audition
-
Upload
toru-takahashi -
Category
Documents
-
view
214 -
download
2
Transcript of Soft missing-feature mask generation for robot audition
-
PALADYN Journal of Behavioral Robotics
Research Article DOI: 10.2478/s13230-010-0005-1 JBR 1(1) 2010 37-47
Soft missing-feature mask generation for Robot Audition
Toru Takahashi1, Kazuhiro
Nakadai23 , Kazunori Komatani1,
Tetsuya Ogata1, Hiroshi G. Okuno1
1 Department of Intelligence and Science
and Technology
Graduate School of Informatics,
Kyoto University, Kyoto 606-8501, Japan
2 Honda Research Institute Japan Co., Ltd.,
8-1 Honcho, Wako, Saitama 351-0114, Japan,
3 Mechanical and Environmental Informatics,
Graduate School of Information Science and
Engineering, Tokyo Institute of Technology,
Tokyo, 152-8552, Japan
Received 21 February 2010
Accepted 19 March 2010
Abstract
This paper describes an improvement in automatic speech recognition (ASR) for robot audition by introducing Miss-
ing Feature Theory (MFT) based on soft missing feature masks (MFM) to realize natural human-robot interaction. In
an everyday environment, a robots microphones capture various sounds besides the users utterances. Although
sound-source separation is an effective way to enhance the users utterances, it inevitably produces errors due to
reflection and reverberation. MFT is able to cope with these errors. First, MFMs are generated based on the reli-
ability of time-frequency components. Then ASR weighs the time-frequency components according to the MFMs.
We propose a new method to automatically generate soft MFMs, consisting of continuous values from 0 to 1 based
on a sigmoid function. The proposed MFM generation was implemented for HRP-2 using HARK, our open-sourced
robot audition software. Preliminary results show that the soft MFM outperformed a hard (binary) MFM in recogniz-
ing three simultaneous utterances. In a human-robot interaction task, the interval limitations between two adjacent
loudspeakers were reduced from 60 degrees to 30 degrees by using soft MFMs.
Keywords
Robot Audition HARK missing-feature-theory soft mask generation simultaneous speech recognition Automatic Speech Recog-nition sound source separation sound localization
1. Introduction
Human-robot interaction (HRI) is one of the most essential topics in be-
havioral robotics. HRI is improved by the inclusion of a natural speech
communication function with robot-embedded microphones because
we generally use speech in our daily communication. In an everyday
environment a user may barge in or interrupt a robot while it is speak-
ing, or several users may speak at the same time, which is termed si-
multaneous speech. In addition, the robot itself generates sounds due
to its fans and actuators, so the robot must be able to deal with multi-
ple sound sources simultaneously. A conventional approach in human-
robot interaction is to use microphones near the speakers mouth to col-
lect only the desired speech. Kismet of MIT has a pair of microphones
with pinnae, but a human partner still used a microphone close to the
speakers mouth [4]. A group communication robot, Robita of Waseda
University, assumes that each human participant uses a headset micro-
phone [16]. Thus, Robot Audition was proposed to realize the hearing
capability that allows a robot to listen to several things simultaneously
by using the its embedded microphones in [18].
Robot audition has now been actively studied for more than ten years,
as typified by organized sessions on robot audition at the IEEE/RAS
International Conferences on Intelligent Robots and Systems (IROS
20042009), and also a special session on robot audition at the IEEE
International Conference on Acoustics Speech and Signal Processing
E-mail: {tall,komatani,ogata,okuno}@kuis.kyoto-u.ac.jpE-mail: [email protected]
(ICASSP 2009) of the Signal Processing Society. Sound source sepa-
ration as pre-processing of automatic speech recognition (ASR) is an
actively-studied research topic in this field.
Hara et al. reported a humanoid robot, HRP-2, which uses a micro-
phone array to localize and separate a mixture of sounds, and which is
capable of recognizing speech commands in a noisy environment [12].
HRP-2 can recognize one speakers utterance under noisy or interfering
speakers. Nakadai et al. reported SIG, a humanoid robot which uses a
pair of microphones to separate multiple speech signals through an ac-
tive direction-pass filter, and recognizes each separated speech phrase
using ASR [20]. They demonstrated that even when three speakers
utter words at the same time, SIG was able to recognize what each
speaker said. However, since their system used 51 acoustic models
trained under different conditions at the same time, the system incurs
a high computational cost, and performance deteriorates in an environ-
ment with unexpected and/or dynamically changing noises. Kim et al.
have developed another binaural sound-source localization and sep-
aration method by integrating sound-source localization obtained by
CSP (Cross-power Spectrum Phase) and that obtained by visual infor-
mation with an EM algorithm [14]. This system assumes that only one
predominant sound exists in each time frame. Valin et al. have devel-
oped sound-source localization and separation by Geometric Source
Separation, and a multi-channel post-filter with 8 microphones to per-
form speaker tracking [32, 33].
Sound-source separation is an ill-posed problem, however, because
it is impossible to perfectly estimate the effect of reverberation and
environmental noises which change dynamically using microphones
embedded in a mobile robot. Thus, sound-source separation pro-
duces separation errors. To remove such errors, a non-linear speech
enhancement method such as Minima Controlled Recursive Average
37
-
PALADYN Journal of Behavioral Robotics
(MCRA) [5] or Minimum Mean Square Error (MMSE) [10] is often used.
Indeed, non-linear speech enhancement removes the separation er-
rors, but it also generates some distortions like musical noise, which
drops ASR performance.
ASR systems, on the other hand, assume that the input speech is clean
or contaminated with a known noise source, because their target is
mainly telephony applications, which generally involve a high signal-to-
noise ratio (SNR).
There is, therefore, a mismatch between pre-processing sound source
separation and ASR systems. It follows that one of the most important
issues in robot audition is integration between preprocessing andASR.
1.1. Missing Feature Theory
Missing Feature Theory (MFT) is a promising approach for integration
between pre-processing and ASR. MFT is a technique which is know
to improve the noise-robustness of speech recognition by masking out
unreliable acoustic features using a so-called missing feature mask
(MFM) [6, 15, 27]. The effectiveness of MFT has been widely reported
in connected digit recognition for telephony applications [1, 8], speaker
verification [9, 23], de-reverberation [24], and recognition of separated
speech in a binaural way [35].
Yamamoto and Nakadai et al. are the first research group to introduce
a Missing Feature Theory (MFT) to integrate ASR with a binaural robot
audition system [39]. First, the reliability of each time-frequency (TF)
component was estimated by comparing separated speech with the
corresponding clean speech. Then, a hard MFM consisting of 0 or 1
for each TF component was generated based on the reliability using
a manually-defined threshold. Since this mask generation algorithm
used reference speech signals to estimate the reliability, the generated
MFM is called a priori MFM. Although they used a priori hard MFM,they showed a remarkable improvement in the speech recognition of
separated sounds. This showed the effectiveness of MFT approaches.
Automatic MFM generation rises as an issue; actually, this is the pri-mary issue in MFT approaches, and remains an open question despite
numerous MFT studies. Although most works on automatic MFM gen-
eration focus on single-channel input, or on binaural input, Yamamoto
and Valin et al. have developed an automatic MFM generation process
based on microphone-array processing [37].
First, they showed that unreliable features generated by pre-processing
are mainly caused by energy leakage from other sound sources. A
microphone-array-based technique was developed to estimate the re-
liability of each time-frequency component from this energy leakage, by
considering the properties of a multi-channel post-filter process and en-
vironmental noises. Their automatic MFM generation was able to cor-
rectly estimate around 70% of unreliable TF components, compared to
a prioriMFM. Thus, the ASR performance drastically improved, and si-multaneous speech recognition of three voices was attained. However,
they still used a hard binary MFM consisting of a value equal to 0 or 1,
while the reliability of each TF component is estimated as a continuous
value in the range 0 to 1.
This means that some useful information which is contained in the es-
timated reliability may be lost if hard MFM is used.
1.2. Soft Missing Feature Mask
A soft MFM with a continuous value from 0 to 1 was reported as a
better masking approach[27] than hard MFM,both because soft mask-
ing can directly deal with the reliability of an input signal, and because
probabilistic methods can be applied at the same time. Bayesian mask
estimation algorithms were proposed in [28, 29], while Barker et al.[2]
M-SourceSimultaneous
Speeches
MFM
SeparatedSignal
GeometricSource
Separation
Multi-channel
Post-Filter
AcousticFeature
Extraction
Results
MFT-BasedASR
ym sm^
sN
s1
Noise-supressed
Signal
AcousticFeatureAutomatic MFM
Generation
LeakNoise
Estimation
BackgroundNoise
Estimation
bnm
N-channel
SigmoidFunction
R
Figure 1. Geometric source separation with multi-channel post-filter.
used a sigmoid function to estimate a soft MFM. We therefore believe
that a soft MFM also improves the performance of robot audition in
the recognition of pre-processed (separated) speeches. A hard MFM
approach may work when a small number of time-frequency compo-
nents are overlapped between the target speech and a noise, but in
speech-noise cases such as barge-in or simultaneous speech, many
time-frequency components are overlapped. Since a soft masking ap-
proach directly uses reliability, it can also deal with overlapped time-
frequency components properly.
In this paper, we present an automatic, soft-MFM generation method
based on a sigmoid function which is then implemented as a mod-
ule of our open-sourced robot audition software, HARK [21]. To show
the validity of the proposed soft-MFM method, we demonstrate its ef-
fectiveness through tasks including simultaneous speech recognition,
and human-robot interaction involving a humanoid HRP-2 robot taking
a meal order.
The rest of this paper is organized as follows: Section 2 describes the
design of a soft-MFM-generation algorithm for robot audition. Section 3
describes the implementation of a robot-audition system with the pro-
posed soft-MFM generation method, using HARK, our robot audition
software. Section 4 illustrates how HRP-2 receives a meal order by
means of robot audition functions. Section 5 evaluates our proposed
soft-MFM-generation method through recognition of three simultane-
ous speeches and a human-robot interaction scenario. The last section
concludes this paper.
2. The Design of Soft Missing FeatureMask
This section describes the design of our soft MFM which is based on
reliability estimation for time-frequency components. First, the reliability
of the time-frequency component is defined, then separated speeches
are analyzed based on the measured reliability in order to model soft-
MFM generation. Parameter optimization for the modeled soft-MFM
generation is also shown.
2.1. Definition of reliability
Figure 1 shows the core steps of pre-processing in HARK, i.e. Geomet-
ric Source Separation (GSS) [25] and multi-channel post-filtering. GSS
38
-
PALADYN Journal of Behavioral Robotics
is a hybrid sound-source separation method between beam forming
and blind-source separation. Thus, an N-channel input signal whichconsists ofM sound sources sm is separated into each sound source,ym. We use an 8-channel microphone array (N = 8), and the num-ber of sound sources, M, is decided in a sound localization module(see Sec.3.1). As mentioned in the previous section, however, sound-
source separation is an ill-posed problem, and thus ym still includesnon-stationary cross-talk (leakage) and stationary background noises.
Multi-channel post-filtering suppresses both of these types of noise and
produces a noise-suppressed signal sm.The reliability of sm for each time-frequency component (frame and fre-quency indices are omitted for simplification) was defined by
R = sm + bnym . (1)
where bn is a background noise which is separately estimated usingMCRA [5]. Note that R corresponds to leakage level because leakageis a dominant factor in making a time-frequency component unreliable,
as mentioned in the previous section.
2.2. Analysis of separated speech based on reliability
We analyzed the characteristics of R and found that there are twopeaks in the histogram for separated speeches when three speeches
were uttered simultaneously.
One peak corresponds to the leakage components, and the other
matches target-speech components. We checked several intervals
from 10, 20, , 80, 90 degrees, and found the same tendency forevery interval.
2.3. Modeling a soft mask
In hard masking, a hard MFM is generated by thresholding as follows:
HMm ={
1, R > TMFM0, otherwise (2)
where TMFM is a threshold. Dynamic acoustic features, called fea-tures, are commonly used with static acoustic features to improve ASR
performance. features are calculated by linear regression of five con-secutive frames. Let static acoustic features be m(k), features arethen defined by
m(k) = 12i=2 i2
2i=2 i m(k + i), (3)
where k represents frequency indices. Thus, hard masks for featuresare defined in the same way.
HMm(k) =k+2
i=k2,i 6=kHMm(i). (4)
where k now shows the frame index.
Such a linear discrimination with TMFM , however, leads to misclassi-fied time-frequency components; we therefore decided to introduce
soft masking. We assume that these two groups follow Gaussian dis-
tributions. The distribution function for a Gaussian is defined by
d(x) = 12
(1 + erf
( x 2
))(5)
erf(x) = 2 x0 e
t2dt (6)
Let the distribution functions for leakage and target speech be dn(R)and ds(R), respectively. A normalized speech reliability can be definedby
B(R) = ds(R)ds(R) + dn(R) (7)
=1 + erf
(Rss2)
2 + erf(Rss2) erf
(Rnn2) (8)
This is a sigmoid-like function defined using error functions erf(). Sincethere is a high calculation cost for B(R), we decided to use a typicalsigmoid function Q(R) rather than to use this complicated function di-rectly. We then defined a soft MFM based on Q(R) as follows [30]:
SMm = w1Q(R|a, b), (9)Q(x|a, b) =
{ 11+exp(a(xb)) , x > b0, otherwise , (10)
where w1 is an weight factor for static features (0.0 w1). Q(|a, b)is a modified sigmoid function which has two tunable parameters; acorresponds to a trend of the sigmoid function while b represents anx-offset. We also defined soft masks for features as
SMm(k) = w2k+2
i=k2,i6=kQ(R(i|a, b)). (11)
where w2 is an weight factor for dynamic () features (0.0 w2).
2.4. Parameter optimization for soft masking
Figure 2 shows the relationship between soft and hard MFMs. When
a is infinity and w = 1.0 in Equation (10), a soft MFM works as a hardMFM. In this case, b works as threshold, TMFM . Parameters a and bcan be derived from Eqs. (10) and (7), but it is difficult to attain analytical
solutions for them. In addition, for w1 and w2, we have no theoreticalevidence for parameter estimation. We thus measured the recognition
performance of three simultaneous speech signals in order to optimize
these parameters for a robot having eight omni-directional microphones
as shown in Figure 8. Simultaneous speech signals were recorded in a
room with RT20 = 0.35. Three different words were played simultane-ously at the same volume from three loudspeakers located 2m away
39
-
PALADYN Journal of Behavioral Robotics
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Reliability
Soft
mas
k
a = 140, b = 0.15
a = 80, b = 0.15
a = 20, b = 0.15
Figure 2. Sigmoid function (Eq.(10) for soft mask generation when b = 0.15and a = 20, 80, 140.
Table 1. A search space for soft MFM parameters
parameter range stepa 40 b 0.1 1.5 0.1w1 0.1 1.5 0.1w2 0.1 1.5 0.1
from the robot. Each word was selected from the ATR phonetically bal-
anced wordset consisting of 216 Japanese words. The direction of one
loudspeaker was fixed in front of the robot, and the others were located
at 10, 20, , 80, 90 degrees to the robot. For each configu-ration, 200 combinations of the three different words were played.
Table 1 shows a search space for a soft MFM parameter set p =(a, b, w1, w2). Figure 3 shows an example of w1-w2 parameter op-timization for the center speaker when the loud speakers were located
at 0, 90, and -90 degrees. For other conditions, we obtained a similar
tendency forw1-w2 parameter optimization. We also performed param-eter optimization for a and b, and found that a similar result is obtainedfor every layout. We therefore obtained the optimized parameter set
popt defined by
popt = argmaxp
1
9
90=10
1
3(WC(a, b, w1, w2) +
+WR(a, b, w1, w2) + WL(a, b, w1, w2)) (12)
where WC , WR , and WL indicate the number of correct words foreach of the center, right and left loudspeakers where their locations are
(0, ,) degrees, respectively.Finally, we attained the optimal parameter set for the soft MFM as
popt = (40, 0.5, 0.1, 0.2).
76
76
78
78
78
78
80
80
80 80
80
82
82
82
82
82
82
82
82
84
84
84
84
86
86
8688
delta weight (w2)
sta
tic
wei
gh
t (w
1)
Center speaker in 90deg. condition
0.5 1 1.5
0.5
1
Figure 3. ASR Performance for the center loudspeaker in a word correct rate.This is the case where three loudspeakers were located at (0. 90.-90). This shows the results for the parameters w1 and w2.
3. A Robot Audition System
Our robot audition system consists of five major components shown
in Figure 1. Our proposed soft-MFM generation was described in the
previous section. This section explains the other four components:
A: Geometric Source Separation,
B: Multi-channel post-filter,
C: Acoustic feature extraction,
D: Missing-Feature-Theory-based Automatic Speech Recognition
(MFT-ASR).
Besides the five components, our robot audition system uses several
techniques such as sound-source localization and tracking, which are
described in [38].
3.1. Geometric source separation
GSS is a hybrid algorithm of Blind Source Separation (BSS) and beam-
forming [25]. BSS has a number of limitations such as permutation and
scaling problems, which can be relaxed in GSS by the introduction of
geometric constraints. These are obtained from the locations of mi-
crophones and sound sources. Unlike the Linearly Constrained Mini-
mum Variance (LCMV) beamformer which minimizes the output power
subject to a distortion-less constraint, GSS explicitly minimizes cross-
talk, leading to faster adaptation. The method is also interesting for
use in the mobile robotics context because it allows easy addition and
removal of sources. Using some approximations, it is also possible to
implement separation with relatively low complexity.
Our GSS was modified so as to provide faster adaptation using
stochastic-gradient and shorter time-frame estimation. The locations
of sound sources are estimated with Multiple Signal Classification (MU-
SIC). This is a frequency-domain adaptive beamforming method which
40
-
PALADYN Journal of Behavioral Robotics
produces a sharp local peak corresponding to a sound-source direc-
tion, thus its noise robustness improves in the real world.
To formulate GSS, suppose that there are M sources and N ( M)microphones. A spectrum vector of M sources at frequency , s(),is denoted as [s1()s2() . . . sM ()]
T , and a spectrum vector of sig-
nals captured by the N microphones at frequency , x(), is denotedas [x1()x2() . . . xN ()]
T , where T represents a transpose operator.x() is, then, calculated as
x() = H()s(), (13)
where H() is a transfer function matrix. Each component Hnm of thismatrix represents the transfer function from them-th source to the n-thmicrophone. The source separation is generally formulated as
y() = W ()x(), (14)
whereW () is called a separation matrix. The separation is definedas findingW () which satisfies the condition that output signal y() isthe same as s(). In order to estimateW (), GSS introduces two costfunctions, that is, separation sharpness (JSS ) and geometric constraints(JGC ) defined by
JSS(W ) = E [yyH diag[yyH ]]2, (15)
JGC (W ) = diag[WD I ]2, (16)
where 2 indicates the Frobenius norm, diag[] is the diagonal opera-tor, E [] represents the expectation operator andH represents the con-jugate transpose operator. D shows a transfer function matrix basedon a direct sound path between a sound source and each microphone.
The total cost function J(W ) is represented as
J(W ) = SJSS(W ) + JGC (W ), (17)
where S represents the weighting parameter which controls theweighting between the separation cost and the cost of the geomet-
ric constraint. This parameter is usually set to xHx2 according to[34]. In an online version of GSS,W is updated by minimizing J(W )
W t+1 = W t J(W t), (18)
where W t denotes W at the current time step t, J(W ) is defined as
an update direction ofW , and means a step-size parameter.
3.2. Multi-channel post-filter
A multi-channel post-filter is used to enhance the output of the GSS
algorithm[36]. It is a spectral filter using an optimal noise estimator de-
scribed in [10]. This method is a kind of spectral subtraction [3], but
it generates less musical noises and distortion, because it takes tem-
poral and spectral continuities into account. We extended the original
noise estimator to estimate both stationary and non-stationary noise
by using multi-channel information, while most post-filters only address
the reduction of a specific type of noise: stationary background noise
[17].
The output of GSS y forms an input of the multi-channel post-filter;
An output of the multi-channel post-filter s, is defined as
s = Gy, (19)
where G is a spectral gain. The estimation of G is based on minimummean-square error estimation of spectral amplitude. To estimate G,noise variance is estimated.
The noise variance estimation m is expressed as
m = stat.m +
leakm , (20)
where stat.m is the estimate of the stationary component of the noisefor source m, at frame t for frequency f , and leakm is the estimate ofsource leakage.
We computed the stationary noise estimate, stat.m , using MCRA tech-nique [5]. To estimate leakm , we assumed that the interference fromother sources is reduced by factor (typically -10dB -5 dB) byLSS. The leakage estimate is thus expressed as
leakm = M1
i=0,i 6=mZi, (21)
where Zi is the smoothed spectrum of the m-th source, Ym and recur-sively defined (with 0.7) [40]:
Zm(f, t) = Zm(f, t 1) + (1 )Ym(f, t). (22)
3.3. Acoustic feature extraction
To estimate the reliability of acoustic features, we have to exploit the
fact that noises and distortions are usually concentrated in some ar-
eas in the spectro-temporal space. Most conventional ASR systems
useMel-Frequency Cepstral Coecient (MFCC) [26] as an acous-tic feature, but noises and distortions are spread to all coefficients in
MFCC. In general, Cepstrum-based acoustic features like MFCC are
not suitable for MFT-ASR, Therefore, we use Mel-Scale Log Spec-trum (MSLS) as an acoustic feature.MSLS is obtained by applying inverse discrete cosine transformation
to MFCCs. Three normalization processes are then applied in order
to obtain noise-robust acoustic features: mean-power normalization,
spectrum-peak emphasis and spectrum-mean normalization. The de-
tails are described in [22]. These three normalization processes corre-
spond to three normalization performed against MFCC; C0 normaliza-
tion, liftering, and Cepstrum mean normalization. The acoustic-feature
vector composes 13 MSLS features, their derivatives and log power,i.e., a 27-dimensional MSLS-based acoustic vector was used.
3.4. Missing Feature Theory based ASR
Several robot audition systems with pre-processing and ASR have
been reported so far [11, 19]. Such systems just combine pre-
processing with ASR, and focus on the improvement of SNR and real-
time processing.
Two critical issues remain: what kinds of pre-processing are required
for ASR, and how does ASR use the characteristics of pre-processing
besides using an acoustic model with multi-condition training. We ex-
ploited an interfacing scheme between preprocessing and ASR based
on MFT.
41
-
PALADYN Journal of Behavioral Robotics
MFT uses MFMs in a temporal-frequency map to improve ASR. Each
MFM specifies whether a spectral value for a frequency bin in a specific
time frame is reliable or not. Unreliable acoustic features caused by er-
rors in preprocessing are masked using MFMs, and only reliable ones
are used for a likelihood calculation in the ASR decoder. The decoder
is an HMM-based recognizer, which is commonly used in conventional
ASR systems. The estimation process of output probability in the de-
coder is modified in MFT-ASR.
Let M(i) be a MFM vector that represents the reliability of the i-thacoustic feature. The output probability bj (x) is given by the follow-ing equation:
bj (x) =Ll=1
P(l|Sj ) exp
{Ni=1
M(i) log f(x(i)|l, Sj )
}, (23)
where P() is a probability operator, x(i) is an acoustic feature vector,N is the size of the acoustic feature vector, and Sj is the j-th state.
For implementation, we used Multiband Julian [13], which is based
on the Japanese real-time large-vocabulary speech-recognition engine,
Julian [31]. It supports various HMM types such as shared-state tri-
phones and tied-mixture models. Network grammar is supported for
a language model. It can work as a stand-alone or client-server ap-
plication. To run as a server, we modified the system to be able to
communicate acoustic features and MFM via a network.
4. System Implementation
This section introduces our open-sourced robot audition software,
HARK, then we describe implementation of the proposed soft-MFM
generation as a new module for HARK, and a robot audition system
with the new module.
4.1. Open-Sourced Robot Audition Software HARK
We consider a software environment for robot audition research. Most
studies focus on their own robot platforms, and their systems are un-
available for other researchers and research groups. It is pleasant
and useful for robot audition researchers to share a common platform,
because the researchers do not need to make their own robot plat-
forms from scratch, and they can easily change a module to compare
it with another. Thus, we implemented each technique described in
the previous sections as a component for modular-based architecture
called FlowDesigner [7]1. FlowDesigner provides a flexible and effi-
cient software development environment, which is achieved by flex-
ible replacement of modules and fast data communication between
modules. These advantages are achieved by the pull architecture of
FlowDesigner.
We then released a set of components as HARK (Honda Research
Institute Japan Audition for Robots with Kyoto University; the word
also means of listen in old English).2 [21]. HARK provides a user-
customizable total robot audition system including multi-channel sound
acquisition, sound localization, sound separation and ASR. HARK also
1 http://owdesigner.sourceforge.net/2 It is available at http://winnie.kuis.kyoto-u.ac.jp/HARK/.
provides opportunities to discuss general applicability, platform depen-
dency, customization and tuning. Difficulties such as customization to
another platform or tuning to another acoustic environment have not
yet been discussed. In addition, HARK has a possibility to stimulate
the robot audition research area, and to provide an effective tool for
interdisciplinary research such as natural language processing, naviga-
tion, and HRI. According to user feedback, performance and stability
of HARK will improve.
As related work, Valin released sound-source localization and separa-
tion software for robots called "ManyEars" as General Public License
(GPL) open-source software (OSS). This is the first software which
can provide generally-applicable and customizable robot audition sys-
tems. The only missing function is ASR. "ManyEars" is limited to sound-
source localization and separation. ASR has yet not been included as
it has a lot of parameters which affect the performance of a total robot
audition system.
4.2. Implementation of soft MFM generation forHARK
Figure 4 shows an example of a robot audition system constructed
using HARK. A green rectangle represents a module, while a con-
nection between modules is indicated by a black arrow. The top-
left module (AudioStreamFromMic) captures sounds using a robot-embedded microphone array. After frequency analysis (MultiFFT),sound sources are localized using LocalizeMUSIC,SourceTracker,and SourceIntervalExtender. The localized sound sources are sep-arated with GSS, and PostFilter enhances the separated sounds.MSLS features are calculated using MSLSExtraction, Delta, andFeatureRemover in a Mel-frequency domain (MelFilterBank). TheMFM is estimated in SMFMGeneration. Finally, MSLS features andthe corresponding MFMs are sent to MFT-ASR via a Socket interface
usingSpeechRecognitionClient. Thesemodules are prepared in ad-vance. Thus, users can easily construct their own robot audition sys-
tems by selecting modules and connecting them using a GUI interface.
The newly-developed module is shown as SMFMGeneration at thecenter of Figure 4. It has four terminals which are shown as black dots
on the left and right edges of the box. Three terminals on the left edge
correspond to the input signals such as sm, ym, and bn in Equation (1).The last terminal on the right edge shows the output, that is, a soft-
MFM vector for a frame. This module also has three parameters such
as FBANK, THRESHOLD, and TILT shown in the property setting
window in Figure 5, which appears by double-clicking the green box
of SMFMGeneration. FBANK represents the number of dimensionsfor a static part of a MFM vector defined in Eq. (9), that is 13 in our set-
ting. THRESHOLD and TILT correspond to a and b defined in Eq. (10),respectively.
5. Evaluation
To evaluate the proposed robot audition system with soft-MFM gener-
ation, simultaneous speech recognition was performed in a manner of
isolated word recognition. Also, the systemwas introduced to a human-
robot interaction scenario, that is, a with the robot taking a meal order.
5.1. Experimental setup
We used a humanoid robot HRP-2 with eight microphones around the
top of the head for an experiment of simultaneous speech recognition.
It was placed at the center of a circle in Figure 8. Three loudspeakers
42
-
PALADYN Journal of Behavioral Robotics
Figure 4. A robot audition system constructed by HARK modules.
Figure 5. A parameter tab of a soft mask generation module.
were used to play three speeches simultaneously. A loudspeaker was
fixed in front of the robot, and two other loudspeakers were located
at 30, 60, 90, 120, or 150 shown in Table 2. The distancebetween the robot and each loudspeaker was 1m. Four combinations
of sound sources were used shown in Table 3. Thus, we generated
20 test data sets (3 4). Each test dataset consists of 200 combina-tions of three different words randomly-selected from ATR phonetically
balanced 216 Japanese words.
For an acoustic model in ASR, we trained a 3-state and 16-mixture tri-
phone model based on Hidden Markov Model (HMM) using 27 dimen-
sional MSLS features. To make evaluation fair, we performed an open
test, that is, the acoustic model was trained with a different speech
corpus from test data. For training data, we used a Japanese News
a) A humanoid robot HRP-2.
b) Layout of microphones.
Figure 8. A humanoid robot HRP-2 with an 8 ch microphone array.
Article Speech Database containing 47,308 utterances by 300 speak-
ers. After adding 20dB of white noise to the speech data, the acoustic
model was trained with the white-noise-added training data, which is a
well-known technique for improving the noise-robustness of an acous-
tic model for ASR.
43
-
PALADYN Journal of Behavioral Robotics
Table 2. Loudspeaker locations.
Layout center (deg.) left (deg.) right (deg.)
LL1 0 30 -30
LL2 0 60 -60
LL3 0 90 -900
LL4 0 120 -120
LL5 0 150 -150
Table 3. Sound Source Combination.
combination center left right
SC1 female 1 female 2 female 3
SC2 female 4 female 5 female 6
SC3 male 1 male 2 male 3
SC4 male 4 male 5 male 6
5.2. Recognition of three simultaneous speeches
For comparison, we evaluated three kinds of MFMs as follows:
1. hard MFM : conventional hard MFM defined by Eqs. (2) and (4).
2. soft MFM (unweighted) : soft MFM defined by Eqs. (9) and (11)
with w1 = w2 = 1.
3. soft MFM (proposed) : the proposed soft MFM defined by
Eqs. (9) and (11) using the optimized MFM parameter set popt .
Word correct rates were measured with these MFMs for every test
dataset described above.
Figsures 911 illustrate averaged word correct rates for the center, left
and right speakers, respectively.
For the center speaker, we can say that our proposed soft MFM drasti-
cally improved ASR performance. For the left or right speaker, while the
improvement was less than that for the center speaker, we still find im-
provements to some extent, especially, when the angle between loud-
speakers is narrow. This difference is caused by the layout of the three
speakers. The sound from the center speaker is affected by both theleft and right speakers, while only the center speaker has a large ef-
fect on each of the side speakers. Thus, the number of overlapping
TF components for the center speaker is larger than that of either the
left or the right speaker individually. Also, their overlapping level for the
center speaker is higher than the others. This proves that the proposed
soft MFM is able to cope with the large number of overlapping TF com-
ponents, even in the highly-overlapped cases. The improvement of
the proposed soft MFM reached around 10 points by averaging three-
speaker cases.
When we focus on the difference between the unweighted soft MFM
and the proposed soft MFM, we can find a similar tendency with respect
to the difference between the soft and the hard MFMs; that is, the opti-
mization of weighting factors is more effective when two speakers are
closer together. This means that weighting factors work effectively to
deal with highly overlapped TF components.
30 60 90 120 150 3050
55
60
65
70
75
80
85
90
95
100
Speaker intervals (degrees)
Wo
rd C
orr
ect
Rat
e (%
)
Soft (w/ weight) Soft (w/o weight) Hard
Figure 9. Word correct rate for the center speaker.
30 60 90 120 150 3050
55
60
65
70
75
80
85
90
95
100
Speaker intervals (degrees)
Wo
rd C
orr
ect
Rat
e (%
)
Soft (w/ weight) Soft (w/o weight) Hard
Figure 10. Word correct rate for the left speaker.
5.3. Human-robot interaction scenario
We applied a robot audition system including the proposed soft MFM
generation to a human-robot interaction scenario. Figure 12 shows
snapshots of the scenario. In this scenario, the robot receives ameal or-
der, with three customers simultaneously asking the robot for what they
want. The robot localizes and separates their speeches, and the rec-
ognizes the separated speeches. This demonstration was performed
in the same room mentioned in Section 2.4. With the hard MFM we
were previously using, speakers had to keep more than 60 degree in-
tervals from the neighboring speaker for this kind of real situation, while
benchmark tests show that the system maintains performance even
when the interval between speakers is 30 degrees. This is caused
by dynamically-changing noises in a real-world environment. On the
44
-
PALADYN Journal of Behavioral Robotics
30 60 90 120 150 3050
55
60
65
70
75
80
85
90
95
100
Speaker intervals (degrees)
Wo
rd C
orr
ect
Rat
e (%
)
Soft (w/ weight) Soft (w/o weight) Hard
Figure 11. Word correct rate for the right speaker.
other hand, the robot with our proposed system maintained ASR per-
formance even in the case of a 30 degree interval for this scenario. This
means that the proposed soft MFM approach was effective in a real-
world environment where dynamically-changing noises exist to some
extent.
6. Conclusion and future work
We have presented an improvement in automatic speech recognition
of robot audition to allow it to realize natural human-robot interaction
which is a topic essential to behavioral robotics. The missing-feature
theory is adopted to integrate microphone-array-based pre-processing
of sound-source localization and separation. For the missing feature
mask, we used a soft missing feature mask taking a continuous value
between 0 and 1, instead of a conventional hard missing feature mask
taking a binary value, 0 or 1. The soft missing feature mask is generated
automatically by estimating the reliability of a time-frequency compo-
nent based on a sigmoid function. The automatic soft mask generation
is incorporated as a set of modules into the HARK open-sourced robot
audition system.
The resulting HARK-based robot audition system with automatic soft
mask generation improves the performance of automatic speech
recognition in the case of three simultaneous speeches, in particular
for narrower intervals of two adjacent speakers up to 30 degrees. The
conventional system worked for speaker-sepearations greater than 60
degrees. Therefore, the soft mask system provides opportunities to de-
ploy a robot audition system in more realistic multi-party interaction. As
a proof of concept, a humanoid HRP-2 demonstrated the role of taking
three meal orders at the same time.
Future work includes detailed analysis and more applications (for ex-
ample, extensive benchmarking to analyse the performance of auto-
matic speech recognition with wide variations of speaker configuration
under various acoustic environments); and application of the HARK-
based system to actual multi-party interactions. Typical scenarios will
include barge-in utterances, utterances of moving talkers, and recog-
nition while the robot is in motion. These applications are expected
Could you take an order?
May I help you?
3. HRP-2 starts to take orders. 4. Three users simultaneously utter to place an order to HRP-2.
Coffee
Coke
Orangejuice
You will drink coke.
5. HRP-2 rephrases the order to the right user. As each user is located, HRP-2 turns its body toward each user when rephrasing.
6. HRP-2 rephrases the order to the center user.
You will drinkorange juice.
7. HRP-2 rephrases the order to the left user.
Thank you for using our service.
You will drink coffee.
8. HRP-2 thanks the users.
1. HRP-2 waits for a user. 2. A user asks HRP-2 to take an order.
Figure 12. Snapshots of a meal order taking task.
to gather more experience in coping with a mixture of sounds, and to
guide new research towards symbiosis of human and robots through
verbal communication and auditory scene analysis.
Acknowledgments
Our research is partially supported by the Grant-in-Aid for Scientific Re-
search and Global COE Program.
45
-
PALADYN Journal of Behavioral Robotics
References
[1] J. Barker, M. Cooke, and P. Green. Robust ASR based on clean
speech models: An evaluation of missing data techniques for con-
nected digit recognition in noise. In Procedings of Eurospeech-
2001, pages 213216. ESCA, 2001.
[2] J. Barker, L. Josifovski, M. Cooke, and P. Green. Soft decisions in
missing data techniques for robust automatic speech recognition.
In Proc. of 6th International Conference on Spoken Language
Processing (ICSLP-2000), volume I, pages 373376, 2000.
[3] S. F. Boll. A spectral subtraction algorithm for suppression
of acoustic noise in speech. In roceedings of 1979 Interna-
tional Conference on Acoustics, Speech, and Signal Processing
(ICASSP-79), pages 200203. IEEE, 1979.
[4] C. Breazeal. Emotive qualities in robot speech. In Proceedings
of 2001 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS 2001), pages 13891394, 2001.
[5] I. Cohen and B. Berdugo. Speech enhancement for non-
stationary noise environments. Signal Processing, 81(2):2403
2418, 2001.
[6] M. Cooke, P. Green, L. Josifovski, and A. Vizinho. Robust au-
tomatic speech recognition with missing and unreliable acoustic
data. Speech Communication, 34(3):267285, May 2000.
[7] C. Ct, D. Ltourneau, F. Michaud, J. M. Valin, Y. Brosseau,
C. Rievsky, M. Lemay, and V. Tran. Reusability tools for pro-
gramming mobile robots. In Proceedings of IEEE/RSJ Inter-
national Conference on Intelligent Robots and Systems (IROS
2004), pages 18201825. IEEE, 2004.
[8] J. de Veth, F. de Wet, B. Cranen, and L. Boves. Missing feature
theory in asr: Make sure you miss the right type of features. In
Proceedings of Workshop on Robust Methods for ASR in Adverse
Conditions, Tampere, pages 231234, 1999.
[9] A. Drygajlo and M. El-Maliki. Speaker verification in noisy envi-
ronments with combined spectral subtraction and missing feature
theory. In Proceedings of 1998 IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP 1998), pages
121124, 1998.
[10] Y. Ephraim and D. Malah. Speech enhancement using mini-
mum mean-square error short-time spectral amplitude estimator.
IEEE Transactions on Acoustics, Speech and Signal Processing,
ASSP-32(6):11091121, 1984.
[11] I. Hara, F. Asano, H. Asoh, J. Ogata, N. Ichimura, Y. Kawai,
F. Kanehiro, H. Hirukawa, and K. Yamamoto. Robust speech in-
terface based on audio and video information fusion for humanoid
HRP-2. In Proceedings of IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS 2004), pages 24042410.
IEEE, 2004.
[12] H. Isao, A. Futoshi, K. Yoshihiro, K. Fumio, and Y. Kiyoshi. Ro-
bust speech interface based on audio and video information fu-
sion for humanoid hrp-2. In Proceedings of 2004 IEEE/RSJ In-
ternational Conference on Intelligent Robots and Systems (IROS
2004), pages 24042410, 2004.
[13] Multiband Julius. http://www.furui.cs.titech.ac.jp/mbandjulius/.
[14] H. D. Kim, K. Komatani, T. Ogata, and H. G. Okuno. Human track-
ing system integrating sound and face localization using em algo-
rithm in real environments. Advanced Robotics, 23(6):629653,
2007.
[15] R. P. Lippmann and B. A. Carlson. Robust speech recognition with
time-varying filtering, interruptions, and noise. In Proceedings of
1997 ISCA 5th European Conference on Speech Communication
and Technology (EuroSpeech 1997), pages 365372, 1997.
[16] Y. Matsusaka, T. Tojo, S. Kuota, K. Furukawa, D. Tamiya, K. Hay-
ata, Y. Nakano, and T. Kobayashi. Multi-person conversation via
multi-modal interface a robot who communicates with multi-user.
In Proceedings of 6th European Conference on Speech Com-
munication Technology (Eurospeech 1999), pages 17231726,
1999.
[17] I. A. McCowan and H. Bourlard. Microphone array post-filter for
diffuse noise field. In ICASSP-2002, volume 1, pages 905908,
2002.
[18] K. Nakadai, T. Lourens, H. G. Okuno, and H. Kitano. Active au-
dition for humanoid. In Proc. of 17th National Conference on
Artificial Intelligence (AAAI-2000), pages 832839. AAAI, 2000.
[19] K. Nakadai, D. Matasuura, H. G. Okuno, and H. Tsujino. Improve-
ment of recognition of simultaneous speech signals using av inte-
gration and scattering theory for humanoid robots. Speech Com-
munication, 44(1-4):97112, October 2004.
[20] K. Nakadai, D. Matsuura, H. G. Okuno, and H. Tsujino. Improve-
ment of recognition of simultaneous speech signals using av inte-
gration and scattering theory for humanoid robots. Speech Com-
munication, 44(1-4):97112, 2004.
[21] K. Nakadai, H. G. Okuno, H. Nakajima, Y. Hasegawa, and H. Tsu-
jino. An open source software system for robot audition hark and
its evaluation. In Proceedings of 2008 IEEE/RAS International
Conference on Humanoid Robots (HUMANOIDS 2008), pages
561566, 2008.
[22] Y. Nishimura, T. Shinozaki, K. Iwano, and S. Furui. Noise-robust
speech recognition using multi-band spectral features. In Pro-
ceedings of 148th Acoustical Society of America Meetings, num-
ber 1aSC7, 2004.
[23] M. T. Padilla, T. F. Quantieri, and D. A. Reynolds. Missing feature
theory with soft spectral subtraction for speaker verification. In
Proceedings of the 8th International Congress on Spoken Lan-
guage Processing (InterSpeech 2006), pages 913916, 2006.
[24] H.M. Park and R. M. Stern. Missing feature speech recognition us-
ing dereverberation and echo suppression in reerberation environ-
ments. In Proceedings of 2007 IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP 2007), vol-
ume IV, pages 381384, 2007.
[25] L. C. Parra and C. V. Alvino. Geometric source separation: Mergin
convolutive source separation with geometric beamforming. IEEE
Transactions on Speech and Audio Processing, 10(6):352362,
2002.
[26] R. Plomp, L. C. W. Pols, and J. P. van de Geer. Dimensional anal-
ysis of vowel spectra. Acoustical Society of America, 41(3):707
712, 1967.
[27] B. Raj and R. M. Stern. Missing-feature approaches in speech
recognition. Signal Processing Magazine, 22(5):101116, 2005.
[28] P. Renevey and A. Drygajlo. Missing feature theory and proba-
bilistic estimation of clean speech components for robust speech
recognition. In Proceedings of European Conference on Speech
Communication Technology (Eurospeech-1999), pages 2627
2630, 1999.
[29] M. L. Seltzer, B. Raj, and R. M. Stern. A bayesian classifier for
spectrographic mask estimation for missing feature speech recog-
nition. Speech Communication, 43:379393, 2004.
[30] T. Takahashi, K. Nakadai, K. Komatani, T. Ogata, and H. G. Okuno.
Missing-feature-theory-based robust simultaneous speech recog-
nition system with non-clean speech acoustic model. In Pro-
ceedings of 2009 IEEE/RSJ International Conference on Intelli-
gent Robots and Systems (IROS 2009), pages 27302735, 2009.
[31] K. Tatsuya and L. Akinobu. Free software toolkit for Japanese
large vocabulary continuous speech recognition. In International
Conference on Spoken Language Processing (ICSLP), volume 4,
pages 476479, 2000.
[32] J. M. Valin, F. Michaud, and J. Rouat. Robust localization and
tracking of simultaneous moving sound sources using beamform-
46
-
PALADYN Journal of Behavioral Robotics
ing and particle filtering. Robotics and Autonomous Systems
Journal, 55(3):216228, 2007.
[33] J. M. Valin, J. Rouat, and F. Michaud. Enhanced robot audi-
tion based on microphone array source separation with post-filter.
In Proc. IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), pages 21332128, 2004.
[34] J. M. Valin, J. Rouat, and F. Michaud. Enhanced robot audi-
tion based on microphone array source separation with post-filter.
In Proceedings of IEEE/RSJ International Conference on Intelli-
gent Robots and Systems (IROS 2004), pages 21232128. IEEE,
2004.
[35] F.Wang, Y. Takeuchi, N. Ohnishi, and N. Sugie. Amobile robot with
active localization and discrimination of a sound source. Journal
of Robotic Society of Japan, 15(2):6167, 1997.
[36] S. Yamamoto, K. Nakadai, J. M. Valin, J. Rouat, F. Michaud, ,
K. Komatani, T. Ogata, and H. G. Okuno. Making a robot recog-
nize three simultaneous sentences in real-time. In Proceedings
of IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS 2005), pages 897902. IEEE, 2005.
[37] S. Yamamoto, J. M. Valin, K. Nakadai, T. Ogata, and H. G. Okuno.
Enhanced robot speech recognition based on microphone array
source separation and missing feature theory. In Proceedings
of IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS 2005), pages 14891494. IEEE, 2005.
[38] S. Yamamoto, K. Nakadai, M. Nakano, H. Tsujino, J. M. Valin,
K. Komatani, T. Ogata, and H. G. Okuno. Design and implemen-
tation of a robot audition system for automatic speech recognition
of simultaneous speech. In Proceedings of IEEE Workshop on
Automatic Speech Recognition and Understanding (ASRU-2007),
pages 111116. IEEE, 2007.
[39] S. Yamamoto, K. Nakadai, H. Tsujino, T. Yokoyama, and H. G.
Okuno. Improvement of robot audition by interfacing sound source
separation and automatic speech recognition with missing fea-
ture theory. In Proceedings of IEEE International Conference on
Robotics and Automation (ICRA 2004), pages 15171523. IEEE,
2004.
[40] S. Yamamoto, K. Nakadai, J. M. Valin, J. Rouat, F. Michaud, K. Ko-
matani, T. Ogata, and H. G. Okuno. Genetic algorithm-based im-
provement of robot hearing capabilities inseparating and recogniz-
ing simultaneous speech signals. In Proceedings of 19th Inter-
national Conference on Industrial, Engineering, and Other Appli-
cations of Applied Intelligent Systems (IEA/AIE06), volume LNAI
4031, pages 207217. Springer-Verlag, 2006.
47