11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu...

Post on 02-Jan-2016

218 views 0 download

Transcript of 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu...

11

Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications, National Tsing Hua University, Hsinchu, Taiwan

a Department of Computer Science, University of Vermont, Burlington, VT 05405, USAb School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China

Integrating induction and deduction for noisy data mining

報告人:陳重光

22

Outline(1/2)

1. Introduction

2. Noise modeling with associative corruption rules– A systematic noise handling framework

– Problem statement• Input data

• Notations and definitions

– Method• Algorithm ACF

• Algorithm ACB

33

Outline(2/2)

3. Experiments– Experiment settings

– Experimental results

4. Conclusions

44

1. Introduction(1/3)

1. The main purpose of data mining is from the large amount of data can be manipulated to find the knowledge.

2. In many data mining topics , of which the three most fundamental ones are

– Classification

– Cluster analysis

– Association analysis

3. Other classification methods– Linear regression models

– Classification rules

– Neural networks

55

1. Introduction(2/3)

4. Data mining techniques have been applied to many fundamental research domains

– Biology

– Medicine

– Ecology

5. There are two essential driving forces that push data mining research to move forward energetically

– Large amounts of data

– Powerful hardware support

66

1. Introduction(3/3)

6. The main purpose of this paper is perform a study on integrating induction and deduction for noisy data mining.

7

Noise modeling with associative corruption rules

a) In large scale data mining applications , erroneous entries in the data are almost unavoidable.

b) The existence of such noise degrades the dataset’s truthfulness , which directly affects the data quality.

c) The robustness of data mining results crucially relies on the quality of the underlying data.

8

A systematic noise handling framework

• Noise data from different sources , can be traced by analyzing the erroneous data items , unless they are totally random.

• Gaussian noise follows the normal distribution with some certain mean value and variance , it can be regarded as a kind of systematic noise.

9

1. To understand the nature of the noise.

2. To eliminate the noise from the source data so as benefit the succeeding data mining process.

性質

Data

10

Problem statement

a) To derive the associative rules that corrupt the original clean dataset Dc1.

b) To eliminate the noise from the noisy data Dobs and construct a robust learner for supervised learning.

11

Input data

a) A subset of instances that are suspects of noise are identified based on a certain criterion.

b) proposing error correction rules and performing error correction on this subset of data.

12

• Before providing the definitions of several concepts for this study , we give some notations as follows :– A: a set of feature attributes of Dobs ; A = {A1; A2; . . . ; AN} ;– C: the class attribute of Dobs ;– V: the value space of the corresponding attributes :

V = {V1 ; V2 ; . . . ; VN ; VC} , where Vi corresponds to Ai ;Ai 屬於 (A {C})∪ ;

– H: a 2-tuple structure (Ai; vi) , where• Ai 屬於 (A {C})∪ ; Ai is called the Head of H ;• vi 屬於 Vi ; vi is the value of Head , Vi corresponds to Ai ;

– T: a 2-tuple structure <p,v> , where• p = <Ai; vi> is a structure H ; Ai is called the Head of T ;• v 屬於 Vi ; v is the modified value of Head , Vi corresponds to Ai ;• vi ≠ v.

Notations and definitions

13

Method

Problem description

Noisy Clean data Dobs1 Dc1

Dobs2 Dcor2

1) Noise Formation

2) Noise Correction

Dobs Dcicor2

14

• In this study , propose a deductive learning procedure to derive these corruption rules.

• The idea follows a two-step fashion.– Firstly , we propose an algorithm called ACF(Associative Corruption

Forward) to learn the noise formation mechanism from Dc1 to Dobs1.

– Secondly , we propose an algorithm called ACB (Associative Corruption Backward) that corrects Dobs2.

15

Algorithm ACF

• Algorithm ACF is used to infer the set of AC rules R1 that corrupts Dc1.

• Employ the method of classification rule induction.

16

Algorithm ACB

• Algorithm ACB (Associative Corruption Backward) is used for noise correction.

• It is not a strict one to one mapping.

• ACB builds a Naive Bayes learner based on Dc1 for each noise corrupted attribute.

17

Experiments

• The objectives of our experiments focus on two aspects.– Firstly , we want to examine whether the algorithm ACF could

accurately derive the AC rules.

– Secondly , we seek to verify whether our noise correction procedure ACB could produce a higher quality dataset Dcor2 in terms of supervised learning.

18

Experiment settings

• Evaluate the system performances on datasets collected from the UCI database repository And References [22] compared .

• In order to evaluate the performance of the proposed method we first separate Dclean into two parts :– A dataset Dbase.

– Corresponding testing set Dtest.

19

Experimental results

• In the set of AC rules R that corrupts the original clean dataset , more than one AC rule are allowed. However ; restrictions are applied to R as follows:– Every rule in R is an AC rule ;– For any two rules in R , the right-hand side of them differs from each

other ;– If P => Q 屬於 R , where Q = <p,v> , then predicate p does not

exist in both the left and the right-hand sides of any other rules in R.

20

Basic information on the datasets for the experiment.

21

Shows the comparative results of five models m0 through m4. m0 is a benchmark learner built on noise-free data dclean.

22

Conclusions(1/2)

• Bring up a systematic noise handling framework into discussion , where the deductive reasoning on noise information and inductive learning from the input data are integrated neatly.

• proposed a method to handle the noise caused by Associative Corruption (AC) rules for supervised learning.

23

Conclusions(2/2)

• In order to propose a method to correct Dobs2 , we design a two-step method that includes algorithms ACF and ACB.

• In this experiments , we show that our method could infer the noise formation mechanism accurately and perform a noise correction process appropriately, so as to enhance the quality of the original dataset.