집단지성프로그래밍 - 6장 문서 필터링

문서 필터링집단지성 프로그래밍 Ch.6

허윤

Document Filtering

Filtering == Classification Problem

Data Mining Problem

Estimation Classification Predication

Clustering Description

Affinity Grouping

Document?A set of feature -> text document, image, etc.

p( document ) = ?

Spam Filtering

Binary Classification Problem

‘Spam’ or ‘Ham’

Techniques

Naïve Bayesian Classifier

Support Vector Machine

Decision Tree

Rule vs. Modelpros and cons

Spam Filtering in Practice

Referred at: Sahil Puri1 et al, “COMPARISON AND ANALYSIS OF SPAM DETECTION ALGORITHMS”, 2013, IJAIEM

Referred at: Rene, “New insights into Gmail’s spam filtering”, 2012, emailmarketingtipps.de

Naïve Bayesian Classifier

Bayes Theorem

Naïve?

Bayesian Theorem with string independence assumption

Classifier ignore evidence term

Posterior1 > posterio2Posterior1 < posterio2

Example

1. 상자 A 가 선택될 확률 P( A ) = 7 / 10

2. 상자 A 에서 흰공 뽑힐 확률 P( 흰공 | A )= 2 / 10

3. 주머니에서는 A, 상자 A 에서 흰공 뽑힐 확률

4. 흰공의 확률

❶ ❷

Example ❶ ❷

어디선가 흰공이 나왔는데… P( A | 흰공 )A 에서 나왔을 확률 ?

B 에서 나왔을 확률 ? P( B | 흰공 )

P( A | 흰공 ) = ?

Bayes Rule

❶ Conditional Prob. A given B ❷ Conditional Prob. B given A

❸ Bayes Rule

Document Representation Extracting words from document

Implementation: Preparation


Representation of Classifier

{'python': {'bad': 0, 'good': 6}, 'the': {'bad': 3, 'good': 3}}

# getwords

How to access dict


Training


Result


Recall

Bayesian Theorem

p( category | doc ) = p( doc )

p( doc | category ) * p( category)

Implementation : Classifier

P( feature | category ) as prior

Assumed Probability to resolve data sparseness


Results


P( document | category ) as likelihood


P( document | category ) * p( category )


Classifying


Result


Recall: Naïve Bayesian Classifier

Fisher’s Method

Fisher’s Method

First, p( document| category ) = p( feature_1| category ) * p( feature_2| category ) … * p( feature_N| category )

p( category | document ) ??

p( category | feature ) = # of documents having feature in category

# of documents having feature

Q&A

Thank You

집단지성프로그래밍 - 6장 문서 필터링

Technology

Transcript of 집단지성프로그래밍 - 6장 문서 필터링