1 Are We Ready? A Look at the State of the Art in Speech-to-text Applications Marie Meteer August...

1

www.everyzing.comwww.everyzing.com

Are We Ready?

A Look at the State of the Art in Speech-to-text Applications

Marie Meteer

August 2007

http://www.everyzing.com/


Overview

• Speech Recognition: The State of the Art– A look back at where it came from– Elements of the models– State of the art performance

• Applications: Making them work– Call Center Analytics– Voicemail Transcription– Needles in Haystacks– Multimedia search

BBN Technology’s Speech Milestones

Early continuous speech recognizer using natural language understanding First software-

only, real-time, large-vocabulary, speaker-independent, continuous speech recognizer

First 40,000 word real time speech recognizer

Pioneered statistical language understanding and data extraction

Early adopter of statistical hidden Markov models

Introduced context dependent phonetic units

1992 1994

19951982 1986 1998

Rough’ n’ Ready prototype system for browsing audio

1976

2004

Exceeded DARPA EARS targets

2003

Audio Indexer System – 1st generation

Broadcast Monitoring System delivered to U.S. Gov’t. – 2nd generation

2002

DARPA EARS Program Award

20052000

AVOKE STX 1.0 introduced

AVOKE STX 2.0 with Domain Development Tools

4

Progress in Speech Recognition 1990’s

87 88 89 90 91 92 93 94 95 96 97 98

Wo

rd E

rro

r R

ate

(%)

50

40

30

20

10

60

70

80

90

5

21

Resource Management

WSJ 64K VocabW

SJ 5K Vocab

Broadcast News

SWBD Conversational Telephone

Connected Digits

Resource Mgt Spkr Dep.

Airline Task

Call Home

BBN’s 2003 Performance ExceedsWord Error Rate Goals

0

10

20

30

40

50

60

2002 2005 2007

Year

Wo

rd e

rro

r ra

te

Broadcast news ceiling

Broadcast news floor

Telephony ceiling

Telephony floor

2003

DARPA EARS for ASR Performance

Elements of a Speech Model

• Dictionary– List of all the words and their pronunciations, the sequence of “phonemes”

that make up the word• >Real Networks R-IY-L N-EH-T-W-ER-K-S

– Dictionary tool automatically creates phonetic pronunciations for most words

• Acoustic Model– Captures the relationship between the sounds and the phonemes– Specific to a language (e.g. English, Spanish) and a channel (e.g.

telephony, broadcast)

• Domain Model– Captures the sequences of words in the language using a “tri-gram” model,

that is the likelihood of a word given the two previous words– Can be as general as “Conversational” or as specific as “Technology”

Model Requirements

• Acoustic Data– Minimum of 50-100 hours transcribed data– English Broadcast News transcribed on 1600 hours of broadcast news

data– Training data must be a precise transcription with corresponding audio

file (including partial words, “um”, laugh, etc)

• Domain Modeling data– Text data, either transcribed from audio or off the web– Does not have to be as precise as for acoustic modeling– Has to model both the vocabulary and “style” of speaking

• Dictionary– Phonetic pronunciations of all of the words

Word Accuracy

• Recognition performance varies based on audio quality and domain

– Within News • Factors include

– Speaker– Audio quality– Background music

– Across Domains• Factors include

– Speaking style, – Out of vocabulary rate– Audio quality

DOMAIN ACCURACY

News 74.5

Movie Reviews 77.8

Technology 79.4

Gaming 59.45

Religion 68.2

SPEAKER ACCURACY

Male Anchor 82

Female Anchor 76

Non-native over the telephone 53

Commercial 55

Document Retrieval Accuracy

• To correctly retrieve a document, a search term only has to be found once in the document

• The table below reports on document retrieval accuracy based on words occurring 2 or more times in the document compared with overall word accuracy.

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

ACCURACY

RECALL

PRECISION

Markets and Applications

Call Center Recording

Government Intelligence

Enterprise Search(webcasts, corp info)

Broadcast Monitoring & Retrieval

(audio/video publication)

Digital Asset Production

Consumer Search(video search)

AVOKE Caller Experience Analytics

• Breakthrough Caller Experience Analytics– The Only True End-to-End Solution

• From dialing to termination

– Multiple Techniques To Extract Understanding• Prompt and speech recognition, telephony data,

and human annotation

– Data-Driven Insights• With drill-down to listen for root cause

– Zero Integration• No on-site hardware or software

• To Manage & Optimize Contact Processes– Improve Operational Visibility

– Reduce Agent Time by 15-30+%

– Boost First Call Resolution

– Eliminate Customer Dis-Satisfiers

Full Text & Keyword Search

Search for words spoken by callers or agents

View call with full text of caller and call center – including all IVR(s), queue(s) and agent(s)

Voicemail Transcription

• Requirements – Near real time transcription– High accuracy, especially on names– Frequently very noisy conditions (Non-native speaker calling

on a cell phone from a street corner in Germany)

• Solution– Speech recognition automates a “first pass”– Human correction provides accuracy– Full human transcription on poor quality calls

Voicemail Solution?Human in the loop

Transcribers fix the output of the speech recognizer

“Hi Tom. I can’t make the meeting but I’m available to call in.

Give me a call at 101-555-1212. Thanks.”

“Hi Tom. I can’t make the meeting but I’m available to call in.

Give me a call at 101-555-1212. Thanks.”

Phone message

is left

Speech Recognizer produces a

rough transcript

Correct transcription goes back to the

server

Correct transcription goes back to the

server

Result: High Quality, Lower CostResult: High Quality, Lower Cost

Custom Applications: Broadcast Monitoring

Automatic translationof Arabic transcript from

Language Weaver MT

Automatic transcriptionof Arabic speech from

BBN Audio Indexer

Real-time streaming video(<5 min delay)

MultiMedia Search

16

…let’s look at the overall picture not just Obama and and Clinton Brett how do you assess the overall dynamics of what's happened over the course of the last three months how big -- victory for the president how big a defeat for the Democrat well it it. He would have been a bigger defeat it was a victory. This is this is -- reprieve cents for the president it's only as bill pointed out for months worth of funding. And it's and this issue's going to come up again in the Democrats are going to continue to try to impose restrictions on the with a president for a just war -- vote to be funded completely which is what. We're just talking about so. This is just justices have a battle he wanted that's that's nice for him but there's another one coming in just a few months. And of course what we have now is this whole idea that is taken hold and it's it's out there in the in the public parlance about September being in the big month not helpful to the president's cause -- -- for prisoners efforts you know we're not going to -- all the troops on the ground until next month and then visiting get to bounce of the summer to try to fix the situation. Probably unrealistic which in September's going to be a tough month of. ...

Problem:

Search engines have historically had very little to work with in terms of properly discovering and indexing multimedia content:

Opportunity:

The value of multimedia content is “trapped” inside the files, out of view of search engines. Titles and tags miss key concepts within the files:

Multimedia Consumption

17

Consumption:

• Automatic extraction of key terms and concepts for tagging, categorization

• Patent-pending “Snippet” navigation technology enables users to jump to relevant segments of the clip

• Social media integrations drives RSS subscription, bookmarking, etc.

• Full text output enables related content presentation

Multimedia Discovery

18

Search Term EveryZing Results FoxSports Results EveryZing Increase

Manny Ramirez 22 7 214%

Yankees 281 111 153%

Manchester United 21 2 950%

Golf 214 170 25%

Federer 45 15 200%

David Beckham 36 17 111%

Tom Brady 53 31 71%

Example: FoxSports.com

• EveryZing Media Merchandising indexes the full contents of FoxSports Multimedia files.

• As a result, EveryZing able to significantly increase the number of keyword results

• Great discovery leads to increased consumption and enhanced monetization opportunities.

Summary

• Speech recognition takes an inaccessible data structure (audio) and turns it into an accessible one (text)

• It’s far from perfect, but it’s a big jump from nothing

• Take away: It’s the task that matters. Find the right role, and speech recognition works

• (Corollary: A good prompt is worth two years of research)

20

Media Merchandising Solutions

Thank you!

Marie Meteer VP of Speech and NLP

[email protected]

www.everyzing.com

mailto:[email protected]


1 Are We Ready? A Look at the State of the Art in Speech-to-text Applications Marie Meteer August...

Documents

Transcript of 1 Are We Ready? A Look at the State of the Art in Speech-to-text Applications Marie Meteer August...