Colburn et al. (2002) - ASR

download Colburn et al. (2002) - ASR

of 6

Transcript of Colburn et al. (2002) - ASR

  • 8/12/2019 Colburn et al. (2002) - ASR

    1/6

    METHODOLOGY FOR DEVELOPING SPEECH AND NOISE DATABASES

    FOR SOLDIER-BORNE AUTOMATIC SPEECH RECOGNITION

    Kevin D. Colburn1, Cynthia L. Blackwell2, Randall D. Sullivan3, Gary E. Riccio1, and Olivier Deroo4

    1Exponent, Inc.

    21 Strathmore Road

    Natick, MA 01760

    Corresponding authors e-mail: [email protected]

    2U.S. Army Soldier and Biological Chemical Command

    Natick Soldier Center

    Kansas St.

    Natick, MA 01760

    3The Wexford Group International

    931 Front AvenueColumbus, GA 31901

    4Babel Technologies SA

    33 Boulevard Dolez

    B-7000 Mons, Belgium

    Abstract: A soldier-borne Automatic Speech Recognition (ASR) system would enable the dismounted soldier of the near

    future to control other electronic subsystems without removing hands from weapon or eyes from target, thus increasing the

    soldier's mobility, lethality, and survivability. Soldier-borne ASR systems face two primary challenges to successful

    performance: varying background noise, including complex combinations of speech and background noises, and shifts in

    user state. Traditionally, ASR systems perform accurately only when the training speech and noise conditions are identical

    or substantially similar to the operating speech and noise conditions. A survey of existing databases revealed a lack ofoperationally relevant speech and noise databases for dismounted infantry operations. The purpose of this paper is to

    provide a methodology for creating such databases. We summarize our analyses and conclusions about existing databases,

    the definition of various operational noises and individual speech variations that occur on the battlefield, the

    experimentation plan for the collection of data, its planned refinement into useful databases, and the planned validation of

    those databases. Part of this reads like a conclusion.

    1.INTRODUCTIONThe soldier of the near future will be equipped with a growing number of body-worn electronic subsystems. It is critical to

    minimize the impact of these systems control mechanism on the soldiers physical and cognitive capabilities; a mechanism

    that enables control of electronic subsystems without removal of hands from weapon or eyes from target will naturally

    increase soldier mobility, lethality, and survivability. A recent collaboration [Exponent, 2002] between Exponent and theU.S. Armys Natick Soldier Center, addressing System Voice Control (SVC), demonstrated the possibilities for

    implementing a hands-free, and to some extent eyes-free, control mechanism. Part of the SVC effort involved an

    investigation of variations in speech signal and noise and their influence on recognition performance.

    Automatic speech recognition (ASR) for the soldier faces two primary challenges to successful performance: varying

    background noise and shifts in user state. Battlefield acoustic conditions involve complex combinations of speech and

    background noises, including noises that are continuous, impulsive (and often intermittent), stationary, non-stationary, and

    of extremely high sound pressure levels. In addition, the soldier may shift states rapidly, e.g., from whispering (to avoid

    revealing ones position) to shouting (primarily because of background noise).

    ASR systems tend to perform well only when the training speech and noise conditions are identical or substantially

    similar to the operating speech and noise conditions. The SVC effort revealed a dearth of operationally valid speech and

    battlefield noise databases. Until new databases are created, systems intended for use on the battlefield can be neither

  • 8/12/2019 Colburn et al. (2002) - ASR

    2/6

    properly trained nor properly evaluated. The SVC effort provided a foundation for an effort to develop such databases and

    to demonstrate that their use in the development of robust ASR systems will enable those systems to perform in battlefield

    noise conditions. Building on that foundation, we describe a methodology that can achieve the following goals:

    Collect data that will allow soldier-worn ASR systems to be trained with acoustic conditions that are identical orsubstantially similar to those of the environment in which the systems will operate.

    Collect data needed for databases that can be used by others for ASR research and development that is relevant toArmy stakeholders.

    Create the databases according to established standards, and document the process to facilitate replication andelaboration by others.

    Validate the databases in a way that is free of specific ties to a particular ASR system or subsystem vendor.Once trained with the databases, ASR systems should improve their performance in the traditionally challenging

    acoustic conditions experienced by dismounted soldiers in combat and support roles. Reliably high levels of performance in

    such conditions are a necessary condition for soldier acceptability of SVC. The resulting databases would also complement

    the ongoing refinement of ASR algorithms, systems, and general architecture by creating a test bed for systematically

    comparing and evaluating ASR systems and components. Thus, the databases created using this methodology would form

    an enabling technology for robust ASR systems.

    2.SURVEY OF EXISTING DATABASESCOTS/GOTS databases for speech signal variations that are used for customization of SVC products in the commercial

    sector were obtained and categorized in terms of the dimensions of acoustic variation that they represent. Some databases

    are composed of continuous speech, which is not ideal for command-style ASR training. Others focus on discrete words

    and numbers, which is not ideal for dictation or continuous-speech ASR. COTS/GOTS databases of environmental noise,

    relevant to military applications, were evaluated in a similar process to that of the speech databases. One database,

    NOISEX, was selected as the best available because of its range of noise types as well as the rigor taken in documenting the

    sources and technical details associated with the samples in the database. However, that range of noise types provides

    insufficient variation across dismounted-infantry-appropriate operational environments, and the noises differ (drastically in

    some cases) in intensity and complexity from the noises that dismounted infantry would actually hear. Is this the only

    problem with that database? This is a key statement in that it provides the basis for the work you propose. Need something

    stronger here it used the wrong microphone is not a strong reason for recommending the need for this soldier-bornedatabase. What is the real problem?

    The survey clearly revealed that recordings of speech and environmental noise that are appropriate to dismounted-

    infantry operations are needed. I dont see where this concurrency is addressed in the paper.

    3.DEFINITION OF OPERATIONAL NOISEOPERATIONAL ENVIRONMENT FACTORSThe acoustic operating environment for soldier-worn ASR systems will vary by mission, operational mode, duty position,

    terrain, and other such factors.

    Every mission involves stages, such as planning, movement, and actions on the objective. A stage may exhibit the

    preponderance of particular noises or noise types, though the frequency of occurrence of the noises will vary significantly.

    Offensive, defensive, and stability missions are likely to be similar acoustically and can be combined in a general combat

    category. A support mission is less likely to contain weapons, munitions, or explosives noises, so it can be considered asecond category of mission.

    ASR systems will be required to perform in all of the operational modes soldiers may encounter in combat (generally

    referred to as dismounted, mounted, mounted supported by dismounted, and dismounted supported by mounted), each of

    which generates particular noises. Dismounted operations are performed on foot by forces in close combat (offensive,

    defensive, and stability) and support. Mounted operations include all operations utilizing ground, air, and maritime

    vehicles, all of which generate noise. The mounted mode is also characterized by the firing of weapons organic to the

    vehicle during combat operations. Both mounted and dismounted modes are likely to be supported by Joint and coalition

    forces and assets (e.g., air and naval fire support), generating a wide variety of noises.

    Soldiers in different duty positions will experience different acoustic environments, though many will have common

    aspects. Duty position examples include: Rifleman/machine gunner, grenadier/special operations soldier; medic; vehicle

    refueler; engineer (combat/sapper); engineer (bridge builder); and field artilleryman.

  • 8/12/2019 Colburn et al. (2002) - ASR

    3/6

    The type of terrain expected to be encountered may dictate the inclusion or exclusion of entire classes of vehicles,

    weapons, and the like, and the presence or absence of these noise sources will influence the soldiers acoustic environment.

    Terrain will also influence the amount of echo and reverberation in the acoustic environment.

    4.DEFINITION OF OPERATIONAL NOISEBATTLEFIELD NOISE SOURCES NEED TO SAY WHY THESEARE DIFFERENT THAN OPERATIONAL FACTORS. ONE SENTENCE SHOULD DO IT.

    The preceding section addressed the factors influencing the set of noises to which a soldier might be exposed. This section

    identifies the specific noise sources that populate those sets. The sources that a soldier is likely to encounter in combat are

    divided into natural groupings that have different effects on ASR performance: vehicles and aircraft tend to be stationary,

    continuous noises compared to weapons, especially machine guns, which are intermittent and impulsive; munitions and

    explosives tend to be non-stationary, impulsive, and often generate overpressure at the impact point; and environmental and

    infrastructure noises can be stationary, non-stationary, continuous, impulsive, and intermittent, and can introduce effects

    such as reverberation.

    The noise of weapons, munitions, or explosives rounds impacting various targets is different from the noise the

    weapons, munitions, or explosives generate at the firing position, and these target sounds should be collected to replicate

    the noise of enemy munitions fired at U.S. forces in battle. One possibility for recording noises at the target position is to

    emplace one expendable microphone close to the target position to experience the effects of overpressure and another, non-

    expendable microphone at a safe distance to record the entire duration of the noise.

    Any of the vehicles typically found on the battlefield can produce noise from its engine, drive train, wheels/tracks, and

    airflow around the vehicle. The noise from the vehicle will generally be constant (stationary), though engine speed and

    travel speed (via wind and/or wave effects) will affect intensity and acoustic frequency. The firing of weapons mounted on

    the vehicle or carried by the rider(s) will produce additional noises. Aircraft typically found on or over the battlefield

    include helicopters, fighter jets, and transport planes. Any of the aircraft can produce noise from engines, propellers, rotors,

    airflow around the aircraft, and from the firing of weapons mounted on the aircraft or carried by the rider(s).

    Other noise sources that can affect speech recognition and should thus be considered for data collection include the

    following: weather-related noise such as heavy rain, thunder, and wind; natural terrain, which rarely generates additive

    noise, but can attenuate or amplify battlefield noise and create echo and reverberation; urbanized terrain, which like natural

    terrain, can either attenuate or amplify battlefield noise, but can also generate noises; and gasoline and diesel generators

    used by U.S. forces in combat and support. These last two paragraphs are key. The fact that the DI is exposed to thesenoises directly is the pitch for this work to be done and what separates this work from previous databases. I believe

    something of this sort is worth saying/pointing out here.

    Dismounted infantry are, uniquely, exposed directly to all of these noise sources and therefore require new databases

    for ASR that accurately reflect their unique acoustic environment.

    5.DEFINITION OF OPERATIONAL NOISEOPERATIONAL SPEECHSpeech typically encountered in the battlefield includes voice communications from leaders, members of squads/crews, and

    command groups. Some messages tend to recur. The standard formats of some of these typically recurring messages are

    prime examples of the kinds of vocabulary used at the battalion level and below. These messages are predominantly spoken

    over a radio or, in some instances, person-to-person. Many could potentially be spoken into a soldier- or leader-worn

    computer using ASR.Standard message format examples are as follows, in order of likelihood of occurrence in combat and stability

    operations: Spot Report (SPOTREP), Size, Activity, Location, Unit or Uniform, Time and Equipment (SALUTE), or Size

    Activity, Location and Time (SALT) report; call for fire; brevity codes; fire commands for machine-gun teams and anti-

    tank weapons; immediate Close Air Support (CAS) request; medical evacuation request (Medevac); Ammunition,

    Casualties, and Equipment (ACE) report; Warning Order/Fragmentary Order; SLANT report/combat power report; and

    free-text message. We are not aware of any examples of soldiers talking to computers in the current force.

    6.INDIVIDUAL VARIATION

  • 8/12/2019 Colburn et al. (2002) - ASR

    4/6

    The following dimensions are quasi-static and represent variation acrossindividuals: age, accent [does this include racial

    variations (e.g., Hispanic, Asian) or is there an assumption that all soldier are essentially English- speaking Americans?],

    gender, and education level. For a database to accurately represent any of these dimensions, the characteristics of both the

    experimental subjects and of the Army soldier population are necessary. The database would need adequate sample sizesfor various subpopulations. Without adequate sample sizes, best practices indicate that databases should be annotated with

    descriptions of the subpopulations from which the speech samples are collected. Ambient environmental conditions could

    be recorded to some degree, and qualitative observations of respiration rate, anxiety level, and any particular tasks that the

    soldier performs could be recorded when possible. The most effective ASR system will take advantage of the likely

    emergence of smart card technology that will be carried by each soldier. The emergence of such a solution allows for the

    storage of various data, including a voice template for each soldier. Although this will not eliminate the need for some

    general baseline algorithm training to provide a base ASR capability, it will likely obviate the need to train the system on a

    large set of data that represents the quasi-static dimensions of variation listed above.

    Dynamic dimensions of variation represent variation within individuals, are less likely to be captured in a voice

    profile, and thus must be addressed in real-time by any ASR system:

    Fatigue: At Exponents direction, the University of Massachusetts Exercise Science Department conductedexperiments and generated a database of spoken utterances recorded during standing, locomotion, and fatigue. The

    data from these experiments are still being analyzed. To our knowledge, no other databases have been released that

    capture speech in combination with locomotion and fatigue.

    Heat and cold: The U.S. Army Research Institute of Environmental Medicine developed the psychological HeatStrain Index (HSI), based on core body temperature and heart rate, for humans [Moran, 1998]. A similar index

    called the Cold Strain Index (CSI) was then developed [Moran, 1999], though the effect of either heat stress or

    cold stress on speech production has not been addressed. There may be value in relating HSI or CSI values in test

    subjects to trends in speech data collected from those subjects, but the required data collection (core body

    temperature and heart rate) would be outside the scope of the database development effort.

    Cognitive load and anxiety level: These characteristics are ill-suited for ready and quantitative assessment duringfield speech data collection.

    Intensity: Whispered and high-stress/high-intensity speech are traditionally the most challenging to ASR systems,and therefore should be assigned a high priority for recording. We expect that the Lombard effect [Lombard, 1911]

    would be pervasive throughout the data collection, given the anticipated background sound pressure levels of the

    training exercises.

    Throughout this section is seems as though you are saying speech data cannot be gathered. Is that what you are getting at?Kind of a glass half empty section. What speech data can be gathered? It is probably feasible to collect adequate sample

    sizes for each of these dimensions. Much of the data would be collected in a laboratory setting.

    7.SPEAKER DEPENDENCE IN ASR SYSTEMS NOT SURE HOW THIS SECTION RELATES TO THEPAPER. SEEMS OUTSIDE THE SCOPE OF METHDOLOGY DEVELOPMENT OF A DATABASE.

    SEEMS MORE ABOUT SVC TECHNOLOGY SOLUTIONS AND GENERAL TRAINING OF THOSE

    SOLUTIONS. WHAT ARE YOU TRYING TO GET AT HERE?

    A speaker-independent solution for SVC would avoid the need for an initial training session for each soldier and an

    additional training session when one soldier needs to use another soldiers system. A speaker-independent system is

    typically less accurate than a speaker-dependent system and requires far more training data, but is likely to be more noise-

    robust, especially when the operating or testing conditions do not match exactly the training conditionsas will often bethe case with soldiers due to background noise and speaker stress.

    Speaker-independent systems allow the recognition of words that were not in the training set, but require many

    speakers during the training phase. One type requires a total of at least 200 repetitions of each keyword from a group of at

    least 50 different speakers for training and testing; another requires at least 100 phonetically balanced sentences from a

    group of at least 50 different speakers; and a third type requires more speakers (preferably 300 or more) and more data.

    Speaker-dependent systems are generally capable of accurately recognizing only words that were in the training set

    any other word is rejected as an out-of-vocabulary utterance. Furthermore, the system will not be robust to noise that is

    substantially different from noise present in its training data. Because a different model is created for each speaker, though,

    the quasi-static sources of inter-speaker variation are inherently addressed and thus no other speakers are needed during

    training. For isolated word recognition, a minimum of two repetitions of each keyword by the speaker are necessary during

    training to achieve good recognition rates.

  • 8/12/2019 Colburn et al. (2002) - ASR

    5/6

    The amount of data required to train a speaker-independent system is thus much greater than for a speaker-dependent

    system. For either system, a sufficient amount of data must be collected to divide the data into separate training and testing

    sets. Typically, 10-20% of the data should be set aside for testing. Each set must be large enough to achieve statistical

    significance for the accuracy results. This database development effort would likely be conducted with the expectation ofbeing used for speaker-dependent systems. This is an important decision because although a speaker-dependent system can

    be trained on data gathered for a speaker-independent system, the converse is not possible.

    The data collection in this effort would be variable, opportunistic, and often unpredictable. Single-word utterances of

    relevant vocabulary words would occur frequently, given the verbal exchanges that one should target (as described in

    Section 5). It is unrealistic to expect soldiers in training exercises to speak phonetically balanced sentences that are not part

    of their typical lexicon.

    It would be prudent to begin the effort with a pilot study to demonstrate the ASR performance advantage of recording

    speech and noise combinations in the battlefield. The nature of a pilot study would limit the amount of data collected in the

    study. After separation into training and testing sets, the amount of pilot data would likely be large enough to train only a

    speaker-dependent system. Both speaker-dependent and speaker-independent systems could be tested with the data, though:

    one could train a speaker-independent system with a specific training database, as in previous Exponent SVC experiments,

    test it on the pilot test data, and then compare the results to those of a speaker-dependent system tested and trained on the

    pilot data. In short, one could record both a speaker-dependent training and testing database and a speaker-independent

    testing database.

    Beyond the pilot study, for the full database development effort, a goal for the remainder of the data collection should

    be to collect a sufficient amount of data to enable testing of any combination of speaker-dependent and speaker-

    independent systems, as well as noise-canceling algorithms and other front-end additions to the models. Ideally, data would

    also be collected that would allow training of speaker-independent systems, but whether a sufficient number of phonetically

    balanced sentences from a sufficient number of different speakers could be recorded remains to be seen.

    8.LOGISTICS OF DATABASE DEVELOPMENTCollecting speech and noise data during actual combat operations would be nearly impossible for a variety of reasons, so

    data would be collected primarily during live-fire training exercises. For mobile speech collection instrumentation, an air

    conduction microphone(s) would be placed as close as possible to the location where an ASR system close-talk microphone

    would be placed. The characteristics of the recorded speech signal would vary significantly with microphone placement,while the characteristics of the recorded background noise may vary little with microphone placement, depending on the

    type of microphone used. Bone conduction microphones would also be considered and, if used, placed in the soldiers

    helmet.

    In order to determine the noise or noise combinations to be recorded, four aspects of the noise or noise combinations

    should be considered:

    Probability of occurrence in the soldiers environment: Support operations may be more predictable than combat. Degree of challenge posed to ASR: The effect of duration, frequency, intermittence, intensity, etc. on an ASR

    systems ability to detect contemporaneous speech.

    Ease of collection (equipment, methods, and personnel) Probability of occurrence in a training exercise: Likely combinations of noise are easier to predict in a training

    exercise than in the battlefield because the position and motion of noise sources are somewhat predetermined.

    To integrate these factors into the decision process for data collection, one would assign each noise or noise combination a

    numerical value in the range of 1 to 10 for each factor, where a value of 10 represents the most favorable state for a factor(e.g., the highest degree of challenge). One would then assign weights to each of the four factors in order of importance.

    Then the sum for each noise or noise combination would reflect a composite rating of four weighted factors, and the greater

    the value of the sum, the more desirable the noise or noise combination would be for recording.

    Once the noises and noise combinations to be collected have been identified, other collection factors should be

    considered: if possible, five repetitions of each noise or noise combination should be recorded; noises should be recorded at

    various distances from the microphone; and if an intermittent noise (such as a machine gun) is fired differently by different

    soldiers, then each style of firing should be recorded (i.e., capture whatever occurrence frequencies exist).

    Say why combinations of sites are needed. Why cant you just go to one site then another and another?

    The data collection requires a minimum of two teams: a coordination team and a data collection team. The two-person

    coordination team would coordinate with the targeted installations and units with, at a minimum, the following: division-

  • 8/12/2019 Colburn et al. (2002) - ASR

    6/6

    level and above operations section (G3)/Directorate of Plans and Training (DPT)/Directorate of Plans, Training, and

    Mobilization (DPTM); brigade- and battalion-level operations section (S3); installation range control and safety office.

    Unnecessary detail.

    The viability of locating portions of the instrumentation off-soldier and off-vehicle would be explored. Unlike themobile instrumentation, stationary instrumentation would be the only means of collecting noise data in down-range target

    locations and would be inexpensive in the event of loss due to munitions effects. First time you even address speech data

    collection in this section. Need a little more on this.

    In addition to the acoustic data recording, other data may be collected in order to identify the various components of

    speech and noise combinations present in the recordings. This could involve simultaneous GPS signal recording,

    simultaneous video and/or additional audio recording, real-time note-taking by data collection personnel, or other methods.

    The collected acoustic data would be prepared, annotated, and transcribed according to Linguistic Data Consortium

    guidelines. Preparation would involve uploading audio from the recording medium to computer disk and segmentation into

    units appropriately sized for analysis. National Institute of Standards and Technology (NIST) SPHERE headers would be

    attached to the audio to facilitate research, and the data would be divided into training, development, and evaluation data

    sets. Transcription of the data would be conducted according to a modified Hub-5 convention for acoustic databases to be

    used in ASR.

    An increase in ASR recognition accuracy when tested in operationally relevant noise conditions would clearly validate

    the database generation. In addition, another validation is possible: that of combining noises in the laboratory for training

    and testing of ASR systems. As the data are collected, we expect that several noises would occur simultaneously (e.g., the

    noise from a tank passing by as a machine gun is fired). The first step in validation would be to record those same noises

    individually. Next, they would be combined in the laboratory. Finally, this combination would be used to train an ASR

    system. With identical test conditions, similar performance with a system trained on simultaneously-occurring noises versus

    a system trained on noises combined in the laboratory would validate the combination method. Widely dissimilar results

    would invalidate the method and would necessitate much more extensive data collection in the field.

    9.CONCLUSIONThis paper delineates a methodology to create operationally valid speech and battlefield noise databases for ASR

    systems intended for battlefield use by dismounted infantry. There is a paucity of available speech and noise databases for

    military operational environments as compared to other application domains. The value of the databases lies in gaining ahigher-fidelity representation of the acoustic environment in which dismounted infantry operate. The methodology leads to

    a higher-fidelity representation because speech and noise data would be collected under relatively controllable conditions

    that are as similar as possible to the actual operating environment. Creating such databases and using them to train and

    evaluate ASR systems would enable sufficiently effective ASR performance to achieve soldier acceptability, and thus to

    grant the desired increase in mobility, lethality, and survivability that would come with hands-free operation of electronic

    subsystems. . First part of this reads like an abstract.

    10.REFERENCES1. Future Warrior Technology Integration: System Voice Control Phase III Final Report. Exponent, Inc. (2002). Natick,

    MA.

    2. Future Warrior Technology Integration: System Voice Control EY2000 Report. Exponent, Inc. (2000). Natick, MA.3. Lombard, E. (1911). Le signe de l'lvation de la voix. Ann. Maladiers Oreille, Larynx, Nez, Pharynx, 37: 101-119.4. Moran, D.S., Shitzer, A., and Pandolf, Kent B. (1998). A physiological strain index to evaluate heat stress. American

    Journal of Physiology, 275 (Regulatory Integrative Comp. Physiology 44): R129-R134.

    5. Moran, D.S., Castellani, J.W., OBrien, C., Young, A.J., and Pandolf, Kent B. (1999). Evaluating physiological strainduring cold exposure using a new cold strain index. American Journal of Physiology, 277 (Regulatory Integrative

    Comp. Physiology 46): R556-R564.