Aug. 14, 2009 NCMMSC2009 Masato AKAGI (赤木 正人) , Professor School of Information...

download Aug. 14, 2009 NCMMSC2009 Masato AKAGI  (赤木 正人) , Professor School of Information Science,

If you can't read please download the document

description

Multi-layer model for expressive speech perception and its application to expressive speech synthesis. Aug. 14, 2009 NCMMSC2009 Masato AKAGI (赤木 正人) , Professor School of Information Science, Japan Advanced Institute of Science and Technology. BASIC CONCEPTS. - PowerPoint PPT Presentation

Transcript of Aug. 14, 2009 NCMMSC2009 Masato AKAGI (赤木 正人) , Professor School of Information...

  • Multi-layer model for expressive speech perception and its application to expressive speech synthesisAug. 14, 2009NCMMSC2009

    Masato AKAGI , ProfessorSchool of Information Science,Japan Advanced Institute of Science and Technology

    *

    BASIC CONCEPTSSpeech production and perception are humans activities. Study knowledge on speech production and perception as humans activities and construct useful models for advanced sound processing systems.: AISL (Masato Akagi): IIPL (Jianwu DANG)

    *

    Research Issues

    *

    Motivation: Global and Universal CommunicationSpeech is the most natural and important means of human-human communication in our daily life.Even without the understanding of one language, we can still judge the expressive content of a voice, such as emotions.

    Our study aims at; constructing universal communication environments beyond languages, nations and cultures based on non-linguistic information, and globalizing and universalizing human-human communications in which we can communicate among elders, infants, handicapped persons, etc. and/or machines as well as those in different languages, nations, and cultures

    *

    ProblemsToward being possible to communicate each other beyond languages, nations, and cultures, some biological common features in speech production and perception, independent of languages, nations, and cultures, has to be needed, that is;

    Common organ movements for production,Common features produced by common movements,Common impression and brain activities caused by presenting common acoustic features, andCommon behaviors among communicators.

    We have to; discuss what are essential in speech production and perception of non-linguistic information in the chain structure,find out biological common features among humans not depending on languages, nations and cultures, andapply these common features to human-machine communications as well as human-human communications.

    *

    Strategies in Chain StructureProduction Acoustic features Perception (forward)Perception Acoustic features Production (backward)Interaction between perception and production via BrainFinding common featuresApplicationsadding non-linguistic information in speech (in synthesis) dialog understanding based on non-linguistic information recognition (in recognition)

    *

    Research subjects

    *

    Modeling of emotional speech perceptionTarget: Emotional Speech

    *

    Basic Concept:How do we define angry voice?A voice where the power of components in the high frequency region is increased by 10 dB over their neutral oneRight, butLoud voice, shrill voice etc.UsualFor emotional speech

    *

    Multi-layer model of auditory impression Concept;high-level psychological features like emotions (Neutral, Sad, Joy, etc.) are explained by semantic primitives described by relevant adjectives,each semantic primitive is conveyed by certain physical acoustic features, andeach high-level psychological feature is related to certain semantic primitives and then related to acoustic features

    *

    Development of the model: For emotions5 Emotions: Neutral, Joy, Sad, Cold Anger, and Hot Anger, selected from the database produced by the Fujitsu Laboratory and recorded by a professional actress

    Experiment 1Examine utterances in terms of emotion

    Experiment 2 Construct a perceptual space of utterances in different categories using an MDS

    Experiment 3Determine suitable primitive features for the perceptual model

    *

    Exp. 1 Examine subjects perception of expressive utterances

    NeutralJoyCold AngerSadnessHot AngerNeutral98%12%10%5%1%Joy0%87%0%0%0%Cold Anger2%1%86%3%2%Sadness0%0%4%92%0%Hot Anger0%0%0%0%97%

    *

    Exp. 2 Construct a psychological distance model and Exp. 3 Determine suitable primitive features

    *

    Select 17 semantic primitives

    Semantic Primitivesbrightmonotonousdarkheavyhighclearlownoisystrongquietweaksharpcalmfastunstableslowwell-modulated

    *

    Appropriately describe human nature by fuzzy logicSlightlyjoyful

    joyfulVeryjoyfulThis time he is joyful20cm40cm60cm

    *

    Calculate a regression line to fit the output of FISSlope of the regression line indicate the relationshipThe slope is positive, the relationship is a positive correlate, vice versa.The absolute value of slope is higher, the relationship is more closely related.

    *

    Results are compatible with the way a human responds when they percept

    *

    Analyze acoustic features for building the relationship27 acoustic features were measuredConduct correlation analysisSelect acoustic features that their coefficient is over 0.616 acoustic features most related to semantic primitives

    *

    Know human vagueness nature by the resulting relationships - Joy

    *

    *

    Synthesis of Emotional SpeechTarget: Verify the three-layered model (From bottom to top)

    *

    Approach is to construct a model for expressive speech perceptionAcousticfeaturesSemantic primitivesExpressivespeechcategoriesVerify

    *

    Implementation: Flow of Main FunctionOriginalUtterance MorphedUtterance

    Morphing Process

    Control parameters variatio

    Segmentation Information

    F0 contour & Spectrum

    STRAIGHTAnalysis

    Segmentation measurement

    *

    Implementation: Flow of Morphing Process

    Spectrum Modification

    Time Duration Modification

    Control Parameters of Power Envelope

    Resynthesized Signal (modification of F0, spectrum, time duration)

    Control Parameters of Spectrum

    Segmentation Information

    Control Parameters of Time Duration

    Power Envelope Modification

    F0 Contour

    Morphed speech signal

    Spectrum

    F0 ContourModification

    Control Parameters of F0 contour

    STRAIGHT Synthesis

    *

    Modifying Acoustic FeaturesDecompose acoustic features to modify them independently Speechwave

    *

    Temporal decomposition (TD)Temporal decomposition (Atal, 1983) N: number of framesK: number of events; K