Improving Text Categorization Bootstrapping via Unsupervised Learning

Improving Text Categorization Bootstrapping via Unsupervised Learning

Presenter : Bo-Sheng Wang 　Authors :ALFIO GLIOZZO, IDO DAGAN

TSLP, 2009

1

Outlines

• Motivation• Objectives• Methodology• Evaluation• Experiments• Conclusions• Comments

2

Motivation

• Supervised systems for text categorization require large amounts of hand-labeled texts

• IL inherently suffers from a score scaling problem and very little information about the intension of a category.

3

Objectives

• Investigate and improve two specific weaknesses that inherently affect the IL schema.

Latent Semantic Index

Gaussian Mixture Algorithm

4

Methodology-Latent Semantic Index

5

Vector Semantic Model

6


7


8

Methodology-Gaussian Mixture Algorithm

9

• This paper propose mapping the similarity values into class posterior probabilities using unsupervised estimation of Gaussian mixtures.

Methodology-Gaussian Mixture Algorithm

10

Seeds

11

Evaluation-Impact of LSI Similarity and GM on IL Performance

12

Evaluation-Extensional vs. Intensional Learning

• A major of a comparison between IL and EL is the amount of supervision required to obtain level of performance.

13

Experiments –

14

Conclusions

• We obtained competitive performance using only the category names as initial seeds.

• Drastically reduce the number of seeds while significantly improving the performance.

15

Comments

• Advantages– Performance,

• Disadvantage– Time

• Applications– Text Mining

16

Improving Text Categorization Bootstrapping via Unsupervised Learning

Documents

Transcript of Improving Text Categorization Bootstrapping via Unsupervised Learning