A Survey on Automatic Text/Speech Summarization Shih-Hsiang Lin( 林士翔 ) Department of Computer...

A Survey on Automatic Text/Speech Summarization

Shih-Hsiang Lin(林士翔 )

Department of Computer Science & Information Engineering

National Taiwan Normal University

References:1. D, Das and A. F. T. Martins, A Survey on Automatic Text Summarization, 20072. Y. T. Chen et al., A probabilistic generative framework for extractive broadcast news speech summarization, IEEE Trans. on ASLP 2009.3. Hovey’s tutorial, Automated Text summarization Tutorial , COLING/ACL 19984. Radev’s tutorial, Text summarization, SIGIR 20045. Berlin’s lecture, A Brief Review of Extractive Summarization Research, 2008

NLP Related Technologies

2

Outline

• Introduction• Single-Document Summarization

– Early work

– Supervised Methods

– Unsupervised Method

• Multi-Document Summarization– Not abailable yet …

• Evaluation– ROGUE

– Information-Theoretic Method

3

4

Introduction

• The subfield of summarization has been investigated by the NLP community for nearly the last half century– “A text that is produced from one or more texts, that conveys important

information in the original text(s), and that is no longer than half of the original text(s) and usually significantly less than that” – (Radev, 2000)

Summaries may be produced from a single document or multiple documents Summaries should preserve important information Summaries should be short

• Terminology in the summarization dialect– Extraction: identify important sections of the text

– Abstraction: produce important material in a new way

– Fusion: combines extracted parts coherently

– Compression: throw out unimportant sections of the text

– Indicative vs. Informative vs. Critic

– Generic vs. Query-oriented

– Single-Document Summarization vs. Multi-Document Summarization

Introduction (cont.)

• Input (Jones, 1997)– Subject type: domain

– Genre: newspaper articles, editorials, letters, reports...

– Form: regular text structure; free-form

– Source size: single doc; multiple docs (few; many)

• Purpose– Situation: embedded in larger system (MT, IR) or not?

– Audience: focused or general

– Usage: IR, sorting, skimming...

• Output– Completeness: include all aspects, or focus on some?

– Format: paragraph, table, etc.

– Style: informative, indicative, critical...

5*This slides was adopted from Prof. Hovey’s presentation


• A Summarization Machine

6*This slides was adopted from Prof. Hovey’s presentation


• A brief history of summarization

7

Speech Summarization

• Fundamental problems with speech summarization– Disfluencies, hesitations, repetitions, repairs, …

– Difficulties of sentence segmentation

– More spontaneous parts of speech (e.g. interviews in broadcast news) are less amenable to standard text summarization

– Speech recognition errors

• Speech Summarization– Speech-to-text summarization

The documents can be easily looked through The part of the documents that is interesting for users can be easily extracted Information extraction and retrieval techniques can be easily applied to the

documents

– Speech-to-speech summarization Wrong information due to speech recognition errors can be avoided Prosodic information such as the emotion of speakers that is conveyed only by

speech can be presented

8*This slides was adopted from Prof. Furui’s presentation

Single-Document SummarizationEarly Work

• The most cited paper on summarization is that of (Luhn, 1958)– The frequency of a particular word in an article provides an useful measure

of its significance

– There are also several key ideas put forward in this paper that have assumed importance in later work on summarization

words were stemmed to their root forms, and stop words were deleted compiled a list of content words sorted by decreasing frequency, the index

providing a significance measure of the word a significance factor was derived that reflects the number of occurrences of

significant words within a sentence all sentences are ranked in order of their significance factor, and the top ranking

sentences are finally selected to form the auto-abstract

• Baxendale also suggest that “sentence position” is helpful in finding salient parts of documents (Baxendale, 1958)– examined 200 paragraphs to find that in 85% of the paragraphs the topic

sentence came as the first one & in 7% of the time it was the last sentence

9

Single-Document Summarization Early Work (cont.)

• Edmundson (1969) describes a system that produces document extracts– His primary contribution was the development of a typical structure for an

extractive summarization experiment (400 technical documents)

– Four kind of features are used Word frequency, Positional feature Cue words: present of words like significant, or hardly The skeleton of the document: whether the sentence is a title or heading

– Weights were attached to each of these features manually to score each sentence

About 44% of the auto-extracts matched the manual extracts

10

Single-Document SummarizationSupervised Methods

• In the 1990s, with the advent of machine learning techniques in NLP– a series of seminal publications appeared that employed statistical

techniques to produce document extracts

• Kupiec et al. (1995) using a naive-Bayes classifier to categorizes each sentence as worthy of extraction or not– Let be a particular sentence, the set of sentences that make up the

summary, and the features

– Assuming independence of the features

– Two additional features are used: sentence length and the presence of uppercase words

– Feature analysis revealed that a system using only the position and the cue features, along with the sentence length, performed best

11

s skFFF ,,, 21

k

i i

k

i ik

FP

sPsFPFFFsP

1

121

|,,,|

SSS

Single-Document Summarization Supervised Methods (cont.)

• Aone et al. (1999) also incorporated a naive-Bayes classifier, but with richer features– Signature words: derived from term frequency(TF) and inverse document

frequency(IDF)

– Named-entity tagger

– Shallow discourse analysis

– Synonyms and morphological variants were also merged (accomplied by WordNet)

• Lin and Hovy (1997) studied the importance of sentence position feature– However, since the discourse structure significantly varies over domains

– They makes an important contribution by investigating techniques of tailoring the position method towards optimality over a genre

Measured the yield of each sentence position against the topic keywords Then ranked the sentence positions by their average yield to produce the Optimal

Position Policy (OPP) for topic positions for the genre

12

k ki

jiji n

ntf

,

,, ii tdd

Didf

:log


• Lin (1999) broke away from the assumption that features are independent of each other– He tried to model the problem of sentence extraction using decision trees,

instead of a naive-Bayes classifier

– Some novel features were introduced in his paper Query Signature: normalized score given to sentences depending on number of

query words that they contain IR signature: score given to sentences depending on number and scores of IR

signature words included (the m most salient words in the corpus) Average lexical connectivity: the number of terms shared with other sentences

divided by the total number of sentences in the text Numerical data: value 1 when sentences contained a number Proper name, Pronoun or Adjective, Weekday or Month, Quotation (similar as

previous feature) Sentence length, Sentence Order

– Feature analysis suggested that the IR signature was a valuable feature, corroborating the early findings of Luhn (1958)

13

• Conroy and O'leary (2001) modeled the problem of extracting a sentence from a document using a hidden Markov model (HMM)

– The HMM was structured as follows states (alternating between summary states and non-summary

states) Allowed “hesitation“ only in non-summary states and “skipping next state” only in

summary states The transition matrix can be estimated from training corpus

element is the empirical probability of transitioning from state i to state j

Associated with each state i was an output function assume that the features are multivariate normal distributed using the training data to compute the maximum likelihood estimate of its mean and

covariance matrix (shared covariance)

– Use three features: position of the sentence, number of terms in the sentence, and likeliness of the sentence terms given the document terms


14

12 s s 1s

M̂ ji,

istateOPObi |


• Osborne (2002) used log-linear models to obviate the assumption of feature independence– Let be a label, the item we are interested in labeling, the i-th feature

and the corresponding feature weight

– The conditional log-linear model can be stated as follows

– The authors added a non-uniform prior to the model, claiming that a log-linear model tends to reject too many sentences for inclusion in a summary

– The features included word pairs, sentence length, sentence position, and naive discourse features like inside introduction or inside conclusion.

15

s if

iii scf

sZscP ,exp

1|

iii

CcCcscfcPscPcPslabel ,logmaxarg|maxarg

c

i


• Svore et al. (2007) propose an algorithm based on neural nets and the use of third party datasets to perform extractive summarization– They trained a model that could infer the proper ranking of sentences

The ranking was accomplished using RankNet based on neural networks For the training set, they used ROUGE-1 to score the similarity of a human written

highlight and a sentence in the document These similarity scores were used as “soft-labels” during training, contrasting with other

approaches where sentences are “hard-labeled”, as selected or not

– Another novelty of the framework lay in the use of features that derived information from query logs from Microsoft's news search engine and Wikipedia entries (third party datasets)

They conjecture that if a document sentence contained keywords used in the news search engine, or entities found in Wikipedia articles, then there is a greater chance of having that sentence in the highlight

– They generate 10 features for each sentence in each document Is first sentence, Sentence position, SumBasic score(unigram), SumBasic bigram

score, Title similarity score, Average News Query Term Score, News Query Term Sum Score, Relative News Query Term Score, Average Wikipedia Entity Score, Wikipedia Entity Sum Score

16


• Other kinds of supervised summarizers includes– Support vector machine (SVM) (Hirao et al. 2002)

– Gaussian Mixture Models (GMM) (Murray et al. 2005)

– Conditional Random Fields (CRFs) (Shen et al. 2007)

• In general, the extractive summarization can be treated as a two-class (summary/non-summary) classification problem (Lin et al. 2009)– A sentence with a set of representative features

– To summarize documents with different summary ratios, the important sentences of a document can be selected (or ranked) based on the posterior probability of a sentence being included in the summary given the feature set

17

iS M iMimii xxxX ,,,,1

iS

iX

Single-Document SummarizationUnsupervised Methods

• Gong (2001) proposed using vector space model (VSM)– Vector representations of sentences and the document to be summarized

using statistical weighting, such as TF-IDF

– Sentences are ranked based on their proximity to the document

Maximum Marginal Relevance (MMR) (Murray et al. 2005) can be applied to summarize more important and different concepts in a document

18

x

y

iS

D

DS

DSDSsim

i

ii

,

)),()(1()),(( SummSSimaDSSimaS iiMMRi

Single-Document SummarizationUnsupervised Methods (cont.)

• Latent Semantic Analysis (LSA) (Gong 2001)– Construct a “term-sentence” matrix for a given document

– Perform Singular Value Decomposition (SVD) on the “term-sentence” matrix The right singular vectors with larger singular values represent the dimensions of

the more important latent semantic concepts in the document Represent each sentence of a document as a semantic vector in the reduced

space

19

Jw

w

w

2

1

1S 2S MS

J content words

M sentences Information of word j

j

A U

1

2

K

i

Information of sentence i

tVTerm-sentence

matrixLeft singularvector matrix

Right singularvector matrix

singular value matrix

12

k

Jw

w

w

2

1

1S 2S MS

J content words

M sentences Information of word j

j

A U

1

2

K

i

Information of sentence i

tVTerm-sentence

matrixLeft singularvector matrix

Right singularvector matrix

singular value matrix

12

k

Single-Document Summarization Unsupervised Methods (cont.)

• Probabilistic Generative Framework (Chen et al. 2009)– Criterion: Maximum a posteriori (MAP)

– Sentence Generative Model Each sentence of the document as a probabilistic generative model Language Model (LM), Sentence Topic Model (STM) and Word Topic Model

(WTM) are initially investigated

– Sentence Prior Model Sentence prior is simply set to uniform here Or may have to do with duration/position, correctness of sentence boundary,

confidence score, prosodic information, etc.

20

ii

rankii

i SPSDPDP

SPSDPDSP

iSDP

iSP

Single-Document Summarization Unsupervised Methods (cont.)

– Language Model (LM) Approach (Literal Term Matching)

– Sentence Topic Model (STM) Approach (Concept Matching)

– Word Topic Model (WTM) Approach (Concept Matching)

21

Djw

Djwc

jijiLM CwPSwPSDP,

1 SwP j

CwP j

: the sentence model: the collection model

: a weighting parameter

Djw

DjwcK

kkkjiSTM DTPTwPSDP

,

1

Djw

Djwc

iSmw

K

kmwkkjimiWTM MTPTwPSDP

,

1,

Multi-Document Summarization

• Task Characteristics– Input: a set of documents on the same topic

Retrieved during an IR search Clustered by a news browsers Problem: same topic or same event

– Output: a paragraph length summary Salient information across documents Similarities between topics?

– Redundancy removal is critical

• Application oriented task– News portal, presenting articles from different sources

– Corporate emails organized by subjects.

– Medical reports about a patient

22

Evaluation

• Recall-Oriented Understudy for Gisting Evaluation (ROUGE) (Lin 2004)– Let be a set of reference summary, and let be a summary

generated automatically by a system. Let be a binary vector representing n-grams contained in a document

– The metric ROUGE-N is an n-gram recall based statistic

where denotes the usual inner product of vectors

– The various versions of ROUGE were evaluated by computing the correlation coefficient between ROUGE scores and human judgment scores

ROUGE-2 performed the best among the ROUGE-N variants

23

Rr nn

Rr nn

rr

srsNROGUE

,

,

mrrR ,,1 s dn

d

,

昨天　馬英九　訪問　中國大陸

昨天　馬英九　結束　訪問　回國

Evaluation (cont.)

• Lin et al., (2006) also proposed to use an information-theoretic method to automatic evaluation of summaries– The central idea is to use a divergence measure (i.e., Jensen-Shannon

divergence), between a pair of probability distributions The first distribution is derived from an automatic summary and the second from a

set of reference summaries

– Let be the set of documents to summarize

A distribution parameterized by generates reference summaries

A summarization system is governed by some distribution We may define a good summarizer as one for which is closed to One information-theoretic measure between distributions that is adequate for this

is the KL divergence

However, the KL divergence is unbounded and goes to infinity whenever vanishes and does not

Another problem is that KL divergence is not symmetric

24

R nddD ,,1

AA R

m

iR

i

AiA

iRA

p

ppppKL

1

log||

Aip

Rip

Evaluation (cont.)

– Hence, they propose to use the Jensen-Shannon divergence which is bounded and symmetric

where

– To evaluate a summary given a reference summary , the negative JS divergence can be used for the purpose

25

RA

RARA

pHpHrH

rpKLrpKLppJS

2

1

2

1

||2

1||

2

1||

RA ppr

2

1

2

1

AS RS

RRAARA SpSpJSSSScore |||||

A Survey on Automatic Text/Speech Summarization Shih-Hsiang Lin( 林士翔 ) Department of Computer...

Documents

Transcript of A Survey on Automatic Text/Speech Summarization Shih-Hsiang Lin( 林士翔 ) Department of Computer...