進捗報告2007 11 09 15 31 39

中間報告2007-11-09

乾口研M2　氏久達博

•研究の背景と目的•Fuzzy-Rough Sets Theory•Certainty Qualification•Web Categorization•実験•進捗状況と今後の課題

研究の背景•WWW (World Wide Web) 上の莫大な量の文章を自動で分類したい

•対象文章における各単語の出現頻度など、入力特徴の次元数が多い

•冗長な情報をカットしつつ高精度で自動分類できるよう次元数を減らしたい

研究の背景•変換に基づく手法[5] では、縮約でsemanticsを破壊 *

•エントロピーに基づく手法[6] では、閾値というパラメータが必要

•データのみから得られる情報に基づき、かつ必要な情報を失わない手法が必要

研究の背景•R. Jensen, Q. Shenらがfuzzy-rough reductionを用いたweb分類手法を提案した[1]

•[1]の事例研究では最大74.9%の精度で分類が行えるという結果が出ていた

研究の目的•Certainty Qualification [2]を用いることによって、従来よりよい下近似を構成することができる。これを用いて、[1]の手法による web 分類をより高い精度で行うことを目的とする。

Fuzzy-Rough Sets•ラフ集合における同値関係 (equivalence relation) を、類似関係 (similarity relation) に拡張したもの

•実数値データを離散化することなくそのまま利用することができる

Fuzzy-Rough Sets•ファジィ集合Qを、ファジィ集合族ΦまたはSimilarity Relation Pによる同値類[x]Pによって近似する

•「必ず含まれる」下近似、「重なっている部分がある」上近似の二つの近似のペアで表現する

•対象集合X: crisp set•決定属性Q: fuzzy set•分割Φ: level two fuzzy set•Fuzzy Implicator I(a, b)•Fuzzy Negetor n(a)•Necessity Measure NA(B)•ファジィラフ集合(R*(A), R*(A))

ファジィ下近似

•ファジィ同値類Fに対するファジィP-下近似 (P: 条件属性)

•対象xに対するファジィP-下近似µPX(F ) = inf

xmax{1! µF (x), µX(x)}

µPX(x) = supF!U/P

min!µF (x), µPX(F )

"

Fuzzy-Rough Reduct•以下の値を保つ、Pの極小な部分集合のことをfuzzy-rough reductと呼ぶ

µPOSP (Q)(x) = supX!U/Q

µPX(x)

!P (Q) =!

x!U µPOSP (Q)(x)|U |

ただし

QUICKREDUCT•fuzzy-rough reductを高速に求める貪欲アルゴリズム

R. Jensen, Q. Shen / Fuzzy Sets and Systems 141 (2004) 469– 485 477

Fig. 2. The fuzzy–rough QUICKREDUCT algorithm.

3.4. Reduct computation

A problem may arise when this approach is compared to the crisp approach. In conventionalRSAR, a reduct is de!ned as a subset R of the attributes which have the same information contentas the full attribute set A. In terms of the dependency function this means that the values !(R) and!(A) are identical and equal to 1 if the data set is consistent. However, in the fuzzy–rough approachthis is not necessarily the case as the uncertainty encountered when objects belong to many fuzzyequivalence classes results in a reduced total dependency.A possible way of combatting this would be to determine the degree of dependency of the full

attribute set and use this as the denominator (for normalization rather than |U|), allowing !! toreach 1. With these issues in mind, a new QUICKREDUCT algorithm has been developed as given inFig. 2. It employs the new dependency function !! to choose which attributes to add to the currentreduct candidate in the same way as the original QUICKREDUCT process. The algorithm terminates whenthe addition of any remaining attribute does not increase the dependency (such a criterion could beused with the original QUICKREDUCT algorithm). As with the original QUICKREDUCT algorithm, for adimensionality of n, the worst case data set will result in (n2 + n)=2 evaluations of the dependencyfunction. However, as fuzzy RSAR is used for dimensionality reduction prior to any involvementof the system which will employ those attributes belonging to the resultant reduct, this potentiallycostly operation has no negative impact upon the run-time e"ciency of the system.Note that it is also possible to reverse the search process; that is, start with the full set of

attributes and incrementally remove the least informative attributes. This process continues until nomore attributes can be removed without reducing the total number of discernible objects in the dataset.

3.5. Fuzzy RSAR example

To illustrate the operation of fuzzy RSAR, an example data set is given here. In crisp RSAR, thedata set would be discretized using the non-fuzzy sets. However, in the new approach membership

ただしC = P,D = Q

Web CategorizationR. Jensen, Q. Shen / Fuzzy Sets and Systems 141 (2004) 469– 485 481

Fig. 5. Modular decomposition of the classi!cation system.

for simplicity only these conventional classi!ers are adopted here to show the power of attributereduction. Better classi!ers are expected to produce more accurate results, though not necessarilyenhance the comparisons between classi!ers that use reduced or unreduced data sets.

4.2. Bookmark classi!cation

As the use of the World Wide Web becomes more prevalent and the size of personal reposito-ries grows, adequately organizing and managing bookmarks becomes crucial. Several years ago, inrecognition of this problem, web browsers included support for tree-like folder structures for orga-nizing bookmarks. These enable the user to browse through their repository to !nd the necessaryinformation. However manual URL classi!cation and organization can be di"cult and tedious whenthere are more than a few dozen bookmarks to classify—something that goes against the whole grainof the bookmarking concept.Many usability studies [13] indicate that a deep hierarchy results in less e"cient information

retrieval as many traversal steps are required, so users are more likely to make mistakes. Users alsodo not have the time and patience to arrange their collection into a well-ordered hierarchy. Thepresent work can help extract information from this relatively information-poor domain in order toclassify bookmarks automatically.In order to retain as much information as possible, all !elds residing within the bookmark database

are considered. For every bookmark, the Uniform Resource Locater (URL) is divided into its slash-separated parts, with each part regarded as a keyword. In a similar fashion, the bookmark title issplit into terms with stop words removed.For this domain, the fuzzy RSAR approach produces exactly the same results as the crisp approach

as all equivalence classes are crisp. In particular, as shown in Table 1, it can be seen that using thepresent work, the amount of attributes was reduced to around 35% of the original. This demonstratesthe large amount of redundancy present in the original data sets. In light of the fact that bookmarkscontain very little useful information, the results are a little better than anticipated.It is also interesting to compare how the use of reduced data sets may a#ect the classi!cation

accuracy as compared to that of the unreduced data set. The results of this comparison are listedin Table 1 as well. Importantly, the performance of the reduced data set is almost as good asthe original. Although a very small amount of important information may have been lost in the

Keyword acquisition•文章からその文章を構成する単語を抽出し、重要度を考慮して重み付きで保存する

•単語の重みは今回の実験ではTF-IDF値を用いた (後述)

Keyword selection•対象集合が文章群、条件属性が単語、条件属性値が単語の重み、決定属性がその文章のカテゴリであるような決定表を縮約

•Web分類に最低限必要な単語を抽出する

Classification•縮約された決定表に基づき、テストデータを実際に分類する

•分類器としてvector space model (VSM)[3], Boolean inexact model (BIM)[4] などがある

TF-IDF•文章中の特徴的な単語を抽出するための指針

•tf: 対象文章中の対象単語の出現回数•N: 全文章数•df: 対象単語を含む文章数

tf ! logN

df

Certainty Qualification•近似の精度を保証するもの•NA(B) ≥ qを保つ最大のA、最小のBを考える

•これでμPX(x)が変わり、関連するγなども全て変わる

•下近似の計算が以下のように置換わる

ただし

•infの計算の回数が一段増える

µ!!(A)(x) = maxi=1,2,...,n

![I](µFi(x), NFi(A))

![i](a, b) = inf0!h!1

{h|I(a, h) ! b}

提案法の問題点•Certainty Qualificationに基づくファジィ下近似を構成する計算コスト

•QUICKREDUCT以前に、下近似を作るのに時間がかかる

•なんらかの近似解法が必要

実験•実際にcertainty qualificationに基づくもの・基づかないものの両方でweb分類を行い、classifier等によらず前者がよりよい精度となることを示す

•また、計算時間を比較する

•Yahoo! Directories のうちArts, Business and Economy, Computers and Internetなどの5つのカテゴリからそれぞれ60ページずつ抽出したページから計算

•keyword 総数: 3000程度•1つの文章は1つのカテゴリにのみ属すと仮定

処理の流れ•Documentを事前に用意しておく•全Documentから全Wordを抽出する•各Documentの各Wordに対するTF-IDF値を計算し、[0-1]の範囲内に正規化し、データベースに保存する

•このテーブルの縮約を計算する

現在の進捗状況•CQでないFRSによる KeywordAcquisition, KeywordSelection を実装

•QUICKREDUCTによる高速処理•処理に大変時間がかかり、単語数100で5時間かけても終わらない

•プログラムを修正し高速化している最中•文章が属するカテゴリが1つとは限らない場合の考慮が必要か検討中

•条件属性として採用する単語のフィルタリング

今後の課題•classifierを作る•BIM[3]とVSM[4]•certainty qualification に基づく Fuzzy-Rough Setsで reduct をとったものと比較する

参考文献(1) Fuzzy-rough attribute reduction with application to web categorization

Richard Jensen, Qiang Shen, 2004

(2) Fuzzy Rough Sets Based on Certainty QualificationsMasahiro Inuiguchi, Tetsuzo Tanino, 2000

(3) Extended boolean information retrievalG. Salton, E.A. Fox, H. Wu, 1983

(4) A vector space model for automatic indexingG. Salton, A. Wong, C.S. Yang, 1975

(5) Pattern Recognition: A Statistical ApproachP. Devijver, J. Kittler, 1982

(6) Machine LearningT. Mitchell, 1997

進捗報告2007 11 09 15 31 39

Technology

Transcript of 進捗報告2007 11 09 15 31 39