Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build...
-
Upload
helen-davis -
Category
Documents
-
view
253 -
download
0
Transcript of Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build...
Text Categorization
PengBo10/31/2010
本次课大纲
Text Categorization Problem definition Build a Classifier
Naïve Bayes Classifier K-Nearest Neighbor Classifier
Evaluation
Definition
Given: 实例 instance, xX, where X is the instance
language or instance space. Issue: how to represent text documents.
固定的类别集合 categories: C = {c1, c2,…, cn}
Determine: The category of x : c(x)C,
where c(x) is a categorization function 分类函数 We want to know how to build categorization
functions (“classifiers 分类器” ).
Text Categorization Examples
Assign labels to each document or web-page: Labels are most often topics such as Yahoo-categories
e.g., "finance," "sports," "news>world>asia>business" Labels may be genres
e.g., "editorials" "movie-reviews" "news“ Labels may be opinion
e.g., “like”, “hate”, “neutral” Labels may be domain-specific binary
e.g., "interesting-to-me" : "not-interesting-to-me” e.g., “spam” : “not-spam” e.g., “contains adult language” :“doesn’t”
Classification Methods
人工分类 Manual classification Used by Yahoo!, Looksmart, about.com, ODP, Medline Accurate but expensive to scale
自动文本分类 Automatic document classification 基于规则: Hand-coded rule-based systems
Spam mail filter,… 有监督的学习: Supervised learning of a document-
label assignment function No free lunch: requires 人工标注的训练集 hand-
classified training data Note that many commercial systems use a
mixture of methods
Think about it…
How to represent text documents and categories? Vectors & Regions String & Language (models)
How to build categorization functions ? Closeness/Similarity to regions Probability to generate the
string/language model
K-Nearest Neighbors
Classes in a Vector Space
Government
Science
Arts
Classification Using Vector Spaces
Each training doc a point (vector) labeled by its topic (= class)
Hypothesis: docs of the same class form a contiguous region of space
We define surfaces to delineate classes in space
Test Document = Government
Government
Science
Arts
Similarityhypothesistrue ingeneral?
k Nearest Neighbor Classification
To classify document d into class c Define k-neighborhood N as k nearest
neighbors of d Count number of documents i in N that belong
to c Estimate P(c|d) as i/k Choose as class argmaxc P(c|d) [ = majority
class]
Example: k=6 (6NN)
Government
Science
Arts
P(science| )?
Nearest-Neighbor Learning Algorithm
Learning is just storing the representations of the training examples in D.
Testing instance x: Compute similarity between x and all examples
in D. Assign x the category of the most similar
example in D. Does not explicitly compute a generalization
or category prototypes. Also called:
Case-based learning Memory-based learning Lazy learning
Why K?
Using only the closest example to determine the categorization is subject to errors due to: A single atypical example. Noise (i.e. error) in the category label of a single
training example. More robust alternative is to find the k
most-similar examples and return the majority category of these k examples.
Value of k is typically odd to avoid ties; 3 and 5 are most common.
kNN decision boundaries
Government
Science
Arts
Boundaries are in principle arbitrary surfaces – but usually polyhedra
Similarity Metrics
Nearest neighbor method depends on a similarity (or distance) metric. Simplest for continuous m-dimensional instance
space is Euclidean distance. Simplest for m-dimensional binary instance
space is Hamming distance (number of feature values that differ).
For text, cosine similarity of tf.idf weighted vectors is typically most effective.
Illustration of 3 Nearest Neighbor for Text Vector Space
Nearest Neighbor with Inverted Index
Naively finding nearest neighbors requires a linear search through |D| documents in collection
But determining k nearest neighbors is the same as determining the top-k best retrievals using the test document as a query to a database of training documents.
Use standard vector space inverted index methods to find the k nearest neighbors.
Testing Time: O(B|Vt|) where B is the average number of training documents in which a test-document word appears. Typically B << |D|
kNN: Discussion
No training necessary No feature selection necessary Scales well with large number of classes
Don’t need to train n classifiers for n classes Classes can influence each other
Small changes to one class can have ripple effect
Scores can be hard to convert to probabilities
Naïve Bayes
Bayes Classifiers
Task: Classify a new instance D based on a tuple of attribute values into one of the classes cj C nxxxD ,,, 21
),,,|(argmax 21 njCc
MAP xxxcPcj
),,,(
)()|,,,(argmax
21
21
n
jjn
Cc xxxP
cPcxxxP
j
)()|,,,(argmax 21 jjnCc
cPcxxxPj
Naïve Bayes Assumption
P(cj) Can be estimated from the frequency of classes
in the training examples. P(x1,x2,…,xn|cj)
O(|X|n•|C|) parameters Could only be estimated if a very, very large
number of training examples was available.
Flu
X1 X2 X5X3 X4
feversinus coughrunnynose muscle-ache
Conditional Independence Assumption
Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).
features detect term presence and are independent of each other given the class:)|()|()|()|,,( 52151 CXPCXPCXPCXXP
Learning the Model
First attempt: maximum likelihood estimates simply use the frequencies in the data
)(
),()|(ˆ
j
jiiji cCN
cCxXNcxP
C
X1 X2 X5X3 X4 X6
N
cCNcP jj
)()(ˆ
What if we have seen no training cases where patient had no flu and muscle aches?
Zero probabilities cannot be conditioned away, no matter the other evidence!
Problem with Max Likelihood
0)(
),()|(ˆ 5
5
nfCN
nfCtXNnfCtXP
i ic cxPcP )|(ˆ)(ˆmaxarg
Flu
X1 X2 X5X3 X4
feversinus coughrunnynose muscle-ache
)|()|()|()|,,( 52151 CXPCXPCXPCXXP
Smoothing to Eliminate Zeros
kcCN
cCxXNcxP
j
jiiji
)(
1),()|(ˆ
# of values of Xi
Add one smooth (Laplace smoothing) As a uniform prior (each attribute occurs once for
each class) that is then updated as evidence from the training data comes in.
Document Generative Model
“Love is patient, love is kind. “ Basic : bag of words a binary independence model
Multivariate binomial generation feature Xi is term Value Xi = 1 or 0, indicate term Xi present in doc
or not a multinomial unigram language model
Multinomial generation feature Xi is term position Value of Xi = term at position i position independent
Love is
patient
kind
Bernoulli Naive Bayes Classifiers
Multivariate binomial Model One feature Xw for each word in
dictionary Value Xw = true in document d if w
appears in d Naive Bayes assumption:
Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears
Love is
patient
kind
Multinomial Naive Bayes Classifiers II
Multinomial = Class conditional unigram One feature Xi for each word pos in
document feature’s values are all words in
dictionary Value of Xi is the word in position i Naïve Bayes assumption:
Given the document’s topic, word in one position in the document tells us nothing about words in other positions
)|text""()|our""()(argmax
)|()(argmax
1j
j
jnjjCc
ijij
CcNB
cxPcxPcP
cxPcPc
Still too many possibilities Assume that classification is independent of
the positions of the words Second assumption:
Word appearance does not depend on position
Just have one multinomial feature predicting for all words
Use same parameters for each position
Multinomial Naive Bayes Classifiers
)|()|( cwXPcwXP ji for all positions i,j, word w, and class c
Parameter estimation
fraction of documents of topic cj
in which word w appears
Binomial model:
Multinomial model:
)|(ˆjw ctXP
fraction of times in which word w appears
across all documents of topic cj
)|(ˆji cwXP
Naive Bayes algorithm (Multinomial model)
Naive Bayes algorithm (Bernoulli model)
NB Example
c(5)=?
NB Example
c(5)=?
Multinomial NB Classifier
Feature likelihood estimate
Posterior
Result: c(5) = China
NB Example
c(5)=?
Bernoulli NB Classifier
Feature likelihood estimate
Posterior
Result: c(5) <> China
Classification
Multinomial vs Multivariate binomial?
Multinomial is in general better
Classification Evaluation
Let’s think about it…
How to do evaluation experiments for classifiers? Dataset Measures Generalization Performance
Training setTraining set
Test setTest set
RecallRecall PrecisionPrecision
F1F1 AccuracyAccuracy
Most (over)used data set 21578 documents 9603 training, 3299 test articles (ModApte
split) 118 categories
An article can be in more than one category Learn 118 binary category distinctions
Average document: about 90 types, 200 tokens
Average number of classes assigned 1.24 for docs with at least one category
Only about 10 out of 118 categories are largeCommon categories(#train, #test)
Classic Reuters Data Set
• Earn (2877, 1087) • Acquisitions (1650, 179)• Money-fx (538, 179)• Grain (433, 149)• Crude (389, 189)
• Trade (369,119)• Interest (347, 131)• Ship (197, 89)• Wheat (212, 71)• Corn (182, 56)
Measuring ClassificationFigures of Merit
Accuracy of classification Main evaluation criterion in academia
Speed of training statistical classifier Some methods are very cheap; some very
costly Speed of classification (docs/hour)
No big differences for most algorithms Exceptions: kNN, complex preprocessing
requirements Effort in creating training set/hand-built
classifier human hours/topic
Measuring ClassificationFigures of Merit
In the real world, economic measures: Your choices are:
Do no classification That has a cost (hard to compute)
Do it all manually Has an easy to compute cost if doing it like that now
Do it all with an automatic classifier Mistakes have a cost
Do it with a combination of automatic classification and manual review of uncertain/difficult/”new” cases
Commonly the last method is most cost efficient and is adopted
Per class evaluation measures
Recall: Fraction of docs in class i classified correctly:
Precision: Fraction of docs assigned class i that are actually about class i:
“Correct rate”: (1- error rate) Fraction of docs classified correctly:
j
ij
ii
c
c
j
ji
ii
c
c
jij
i
iii
c
c
A B C
A
B
C
Actual Class
Predictedclass
Measuring Classification
Overall error rate Not a good measure for small classes. Why?
Precision/recall for classification decisions F1 measure: 1/F1 = ½ (1/P + 1/R) Correct estimate of size of category
Why is this different? Stability over time / category drift Utility
Costs of false positives / false negatives may be different
For example, cost = tp-0.5fp
Generalization Performance
Generalization Performance Results can vary based on sampling error
due to different training and test sets. Average results over multiple training and
test sets (splits of the overall data) for the best results.
Ideally, test and training sets are independent on each trial.
But this would require too much labeled data.But this would require too much labeled data.
Good practice department
N-Fold Cross-Validation Partition data into N equal-sized disjoint
segments. Run N trials, each time using a different
segment of the data for testing, and training on the remaining N1 segments.
This way, at least test-sets are independent.
Report average classification accuracy over the N trials.
Typically, N = 10.
Good practice department II
Learning Curves Would like to know how performance varies with
the number of training instances. Learning curves plot classification accuracy on
independent test data (Y axis) versus number of training examples (X axis).
One can do both the above and produce learning curves averaged over multiple trials from cross-validation
how to combine multiple measures?
If we have more than one class, how do we combine multiple performance measures into one quantity?
Macroaveraging: Compute performance for each class, then
average. Microaveraging:
Collect decisions for all classes, compute contingency table, evaluate.
Micro- vs. Macro-Averaging: Example
Truth: yes
Truth: no
Classifier: yes
10 10
Classifier: no
10 970
Truth: yes
Truth: no
Classifier: yes
90 10
Classifier: no
10 890
Truth: yes
Truth: no
Classifier: yes
100 20
Classifier: no
20 1860
Class 1 Class 2 Micro.Av. Table
Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 = .83 Why this difference?
Exercise
Federalist papers 1787-1788 年间由
Hamilton, Jay and Madison 用笔名发表的 77 篇短文,来劝说NY 支持 US Constitution
其中 12 篇 papers 的作者存在争议
谁是作者? 谁是作者?
53
Author identification
In 1964 Mosteller and Wallace* solved the problem Mosteller, Frederick and Wallace, David L.
1964. Inference and Disputed Authorship: The Federalist.
It’s a Text Catergorization Problem They identified 70 function words as good
candidates for authorship analysis Using statistical inference they concluded
the author was Madison
Feature Selection
Classifier
Function words for Author Identification
Function Words for Author Identification
Summary
Definition The category of x:
c(x)C K-Nearest Neighbor Naïve Bayes
Bayesian Methods Bernoulli NB classifier Multinomial NB classifier
Categorization Evaluation
Training data/Test data Over-fitting & Generalize
)|text""()|our""()(argmax
)|()(argmax
1j
j
jnjjCc
ijij
CcNB
cxPcxPcP
cxPcPc
Thank You!
Q&A
Readings
[1] IIR Ch13, Ch14.2 [2] Y. Yang and X. Liu, "A re-examination of
text categorization methods," presented at Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), 1999.
Bernoulli trial
Bernoulli trial is an experiment whose outcome is random and can be either of two possible outcomes, "success" and "failure".
Binomial Distribution
binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p.
Multinomial Distribution
multinomial distribution is a generalization of the binomial distribution.
each trial results in one of some fixed finite number k of possible outcomes, with probabilities p1, ..., pk,
there are n independent trials.
We can use a random variable Xi to indicate the number of times outcome number i was observed over the n trials.
Bayes’ Rule
priorprior
posteriorposterior likelihoodlikelihood
Use Bayes Rule to Gamble
someone draws an envelope at random and offers to sell it to you. How much should you pay?
before deciding, you are allowed to see one bead drawn from the envelope. Suppose it’s red: How much should you pay?
Prosecutor's fallacy
You win the lottery jackpot. You are then charged with having cheated, for instance with having bribed lottery officials.
At the trial, the prosecutor points out that winning the lottery without cheating is extremely unlikely, and that therefore your being innocent must be comparably unlikely.
Maximum a posteriori Hypothesis
)|(argmax DhPhHh
MAP
)(
)()|(argmax
DP
hPhDP
Hh
)()|(argmax hPhDPHh
As P(D) isconstant
Maximum likelihood Hypothesis
If all hypotheses are a priori equally likely, we only
need to consider the P(D|h) term:)|(argmax hDPh
HhML
Likelihood
a likelihood function is a conditional probability function considered as a function of its second argument with its first argument held fixed
Given a parameterized family of probability density functions
Where θ is the parameter, the likelihood function is
where x is the observed outcome of an experiment.
when f(x | θ) is viewed as a function of x with θ fixed, it is a probability density function,
when viewed as a function of θ with x fixed, it is a likelihood function.
)|( xfx
)|()|( xfxL
Reuters Text Categorization data set (Reuters-21578) document
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798">
<DATE> 2-MAR-1987 16:51:43.42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off
tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter
</BODY></TEXT></REUTERS>
New Reuters: RCV1: 810,000 docs
Top topics in Reuters RCV1
北大天网 : 中文网页分类 通过动员不同专业的几十个学生,人工选取形成了一个基
于层次模型的大规模中文网页样本集。 包括 12,336 个训练网页实例和 3,269 个测试网页实例,分
布在 12 个大类 , 共计 733 个类别中,每个类别平均有 17个训练实例和 4.5 个测试实例
天网免费提供网页样本集给有兴趣的同行,燕穹产品号:YQ-WEBENCH-V0.8
中文信息检索论坛 www.cwirf.org 全国搜索引擎和网上信息挖掘学术研讨会 SEWM 上进行分
类评测
北大天网 : 中文网页分类
类别编号
类别名称 类别数 训练样本数
测试样本数
1 人文与艺术 24 419 110
2 新闻与媒体 7 125 19
3 商业与经济 48 839 214
4 娱乐与休闲 88 1510 374
5 计算机与因特网 58 925 238
6 教育 18 286 85
7 各国风情 53 891 235
8 自然科学 113 1892 514
9 政府与政治 18 288 84
10 社会科学 104 1765 479
11 医疗与健康 136 2295 616
12 社会与文化 66 1101 301
共计 733 12336 3269
Concept Drift
Categories change over time Example: “president of the united states”
1999: clinton is great feature 2002: clinton is bad feature
One measure of a text classification system is how well it protects against concept drift.
Feature selection: can be bad in protecting against concept drift
Think about it…
Describe the process, can you tell what is it?
How do you do it? Why we talk about it here, or what does it
mean to Information Overloading?
鹰 肉鸡
Recall: Vector Space Representation
Each document is a vector, one component for each term (= word).
Normalize to unit length. High-dimensional vector space:
Terms are axes 10,000+ dimensions, or even 100,000+ Docs are vectors in this space