[2013 CodeEngn Conference 08] Homeless - Android 악성앱 필터링 시스템
집단지성프로그래밍 - 6장 문서 필터링
-
Upload
yun-her -
Category
Technology
-
view
83 -
download
7
Transcript of 집단지성프로그래밍 - 6장 문서 필터링
문서 필터링집단지성 프로그래밍 Ch.6
허윤
Document Filtering
Filtering == Classification Problem
Data Mining Problem
Estimation Classification Predication
Clustering Description
Affinity Grouping
Document?A set of feature -> text document, image, etc.
p( document ) = ?
Spam Filtering
Binary Classification Problem
‘Spam’ or ‘Ham’
Techniques
Naïve Bayesian Classifier
Support Vector Machine
Decision Tree
Rule vs. Modelpros and cons
Spam Filtering in Practice
Referred at: Sahil Puri1 et al, “COMPARISON AND ANALYSIS OF SPAM DETECTION ALGORITHMS”, 2013, IJAIEM
Referred at: Rene, “New insights into Gmail’s spam filtering”, 2012, emailmarketingtipps.de
Naïve Bayesian Classifier
Bayes Theorem
Naïve?
Bayesian Theorem with string independence assumption
Classifier ignore evidence term
Posterior1 > posterio2Posterior1 < posterio2
Example
1. 상자 A 가 선택될 확률 P( A ) = 7 / 10
2. 상자 A 에서 흰공 뽑힐 확률 P( 흰공 | A )= 2 / 10
3. 주머니에서는 A, 상자 A 에서 흰공 뽑힐 확률
4. 흰공의 확률
❶ ❷
Example ❶ ❷
어디선가 흰공이 나왔는데… P( A | 흰공 )A 에서 나왔을 확률 ?
B 에서 나왔을 확률 ? P( B | 흰공 )
P( A | 흰공 ) = ?
Bayes Rule
❶ Conditional Prob. A given B ❷ Conditional Prob. B given A
❸ Bayes Rule
Document Representation Extracting words from document
Implementation: Preparation
Implementation: Preparation
Representation of Classifier
{'python': {'bad': 0, 'good': 6}, 'the': {'bad': 3, 'good': 3}}
# getwords
How to access dict
Implementation: Preparation
Training
Implementation: Preparation
Result
Implementation: Preparation
Recall
Bayesian Theorem
p( category | doc ) = p( doc )
p( doc | category ) * p( category)
Implementation : Classifier
P( feature | category ) as prior
Assumed Probability to resolve data sparseness
Implementation : Classifier
Results
Implementation : Classifier
P( document | category ) as likelihood
Implementation : Classifier
P( document | category ) * p( category )
Implementation : Classifier
Classifying
Implementation : Classifier
Result
Implementation : Classifier
Recall: Naïve Bayesian Classifier
Fisher’s Method
Fisher’s Method
First, p( document| category ) = p( feature_1| category ) * p( feature_2| category ) … * p( feature_N| category )
p( category | document ) ??
p( category | feature ) = # of documents having feature in category
# of documents having feature
Q&A
Thank You