03 Data Preprocessing

60
 สอนโดย ดร  .หท ยรตน เกต   มณ ยร ตน ภาคว ชาการจ ดการเทคโนโลย การผล ตและสารสนเทศ บทท    3: การเตรยมขอม    ลสาหร บการท าเหมอง (Data Preprocessing) 336331 การท าเหมองขอม (Data Ming) 

description

data mining

Transcript of 03 Data Preprocessing

  • .

    3: (Data Preprocessing)

    336331 (Data Ming)

  • - (Incomplete data) (Missing

    value) N/A - (Noisy data) (Error)

    (Outliers) - (Inconsistent data)

  • ()

    Cust_ID Name Income Age Birthday

    001 A n/a 200 12/10/79

    002 - $2000 25 27 Dec 81

    003 C -10000 27 18 Feb 20

  • 1) Data Cleaning 2) Data Integration 3) Data Transformation 4) Data Reduction

  • Data cleaning

    Data integration

    Data transformation

    Data reduction

    -2, 32, 100,59, 48 -0.02, 0.32, 1.00, 0.59, 0.48

    tran

    sact

    ion

    tran

    sact

    ion

    attribute attribute

    A1 A2 A3 A126 A1 A2 A115

    T1

    T2

    T2000

  • 1) Data Cleaning ( )

    (Missing Value) smooth

  • (Missing value)

    ???

  • (Missing value)

    1. Ignore the tuple 2. Fill in the missing value manually 3. Use a global constant to fill in the missing value 4. Use the attribute mean to fill in the missing value 5. Use the attribute mean for all samples belonging to the same

    class as the given tuple 6. Use the most propable value to fill in the missing value

  • Ignore the tuple

    (Classification)

    Fill in the missing value manually

    Use a global constant to fill in the missing value

    unknown

  • Use the attribute mean to fill in the missing value

    12000 Use the attribute mean for all samples belonging to the same class as the given tuple

  • Use the most propable value to fill in the missing value

    (Regression) (Bayesian formula) (Decision tree)

  • (Noisy data)

    - - - (Data Transmission) -

  • (Noisy data)

    Binning Methods Regression Clustering

  • Binning Methods binning (Partition)

    bin bin (Local Smoothing) (Neighborhood) bin bucket bin (Bin Means) bin (Bin Medians) bin (Bin Boundaries)

  • Binning Methods Binning Method Example: Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34 (N = 3) Partition into (equi-depth) bins: Bin 1: 4, 8, 15 Bin 2: 21, 21, 24 Bin 3: 25, 28, 34 Smoothing by bin means: Bin 1: 9, 9, 9 Bin 2: 22, 22, 22 Bin 3: 29, 29, 29 Smoothing by bin boundaries: Bin 1: 4, 4, 15 Bin 2: 21, 21, 24 Bin 3: 25, 25, 34

  • Regression Smooth by fitting the data into regression functions

    Linear Regression

    Y = +X

    Multiple Linear Regression Y =b0 +b1 X1 +b2 X2 +...+bmXm

  • Regression

    (Least-square error)

    x

    y = x+1

    y = x+1

    X1

    Y1

    Y1

  • Clustering

    Outlier Cluster

  • 2) Data Transformation ()

    (Normalization)

    min-max normalization z-score normalization normalization by decimal scaling Sigmoidal

  • Min-Max Nornalization [new_minA, new_maxA] (income) 12,000 (min)

    98,000 (max) 73,600 [0,1] 73,600

    AAA

    AA

    A minnewminnewmaxnewminmax

    minvv _)__('

    716.00)00.1(000,12000,98

    000,12600,73

  • Z-Score 0 1

    (income) 54,000 (mean)

    16,000 (stand_dev) 73,600 Z-Score

    A

    A

    devstand_

    meanvv

    '

    225.1000,16

    000,54600,73

  • Decimal scaling

    A -986 917

    |-986| = 986 1000 j=3 -986 -0.986

    j

    vv

    10'

    986.010

    9863

  • Sigmoidal

    Normalize input into [-1, 1]

    e

    ey

    1

    1'

    std

    meany

    -1

    1

    x

    y

  • 3) (Data Integration)

    1. (Data Redundancies) (Data Inconsistencies)

    2.

  • (Schema Integration) Metadata

    entities Cusid A CustNumber B

    Data Warehousing

  • Data Integration Data integration: combines data from multiple sources into a coherent store

    Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from

    multiple data sources, e.g., A.cust-id B.cust-#

    Detecting and resolving data value conflicts for the same real world entity, attribute values from different

    sources are different possible reasons: different representations, different scales, e.g.,

    metric vs. British units

  • Data Integration (Cont.) Redundant data occur often when integration of multiple

    databases The same attribute may have different names in different

    databases One attribute may be a derived attribute in another

    table, e.g., annual revenue

    Redundant data may be able to be detected by correlation analysis

    Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

  • Data Integration : Correlation analysis

    The correlation between attribute A and B can be measured by

    If rA,B greater than 0, then A and B are positively correlated, meaning that the value of A increase as the values of B increase

    The mean of A is

    The standard deviation of A is

    BA

    BAn

    BBAAr

    )1(

    ))((,

    n

    AA

    1

    )( 2

    n

    AAA

  • 4) (Data Reduction)

  • 4) (Data Reduction) (cont.) Data reduction strategies Data cube aggregation Dimensionality reduction remove unimportant attributes Data Compression Numerosity reduction fit data into models Discretization and concept hierarchy generation

  • Data Reduction: Data cube aggregation

    The data can be aggregated that the resulting data summarize Ex. The data consist of the ALLElectronics sales per quarter,

    for the year 2002 to 2004.

    aggregated in data summarize the total sales per year instead of per quarter, without loss of information necessary of the analysis task

  • Data Reduction: Data cube aggregation Concept hierarchies may exist for each attribute, allowing

    the analysis of data at multiple levels of abstraction

    Data cube Lattice of cuboids

  • Data Reduction: Dimensionality reduction Feature selection (i.e., attribute subset selection): Select a minimum set of features such that the probability

    distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features

    reduce # of patterns in the patterns, easier to understand

    Heuristic methods: step-wise forward selection step-wise backward elimination combining forward selection and backward elimination decision-tree induction

  • Data Reduction: Dimensionality reduction Step-wise forward selection Start with an empty of attributes called the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration, the best of the remaining original attributes is added to the reduced set

    Initial attribute set:

    {A1, A2, A3, A4, A5, A6}

    Initial reduced set:

    {}

    {A1}

    {A1, A4}

    Reduced attribute set:

    {A1, A4, A6}

  • Data Reduction: Dimensionality reduction Step-wise backward elimination

    Start with the full set of attributes

    At each step, removes the worst attribute remaining in the set

    Initial attribute set:

    {A1, A2, A3, A4, A5, A6}

    Initial reduced set:

    {A1, A3, A4, A5, A6}

    {A1, A4, A5, A6}

    Reduced attribute set:

    {A1, A4, A6}

  • Data Reduction: Dimensionality reduction

    Combining forward selection and backward

    elimination

    At each step, selects the best attribute and removes

    the worst from among the remaining attributes

  • Data Reduction: Dimensionality reduction Decision-tree induction

    Initial attribute set: {A1, A2, A3, A4, A5, A6}

    A4 ?

    A1? A6?

    Class 1 Class 2 Class 1 Class 2

    > Reduced attribute set: {A1, A4, A6}

  • Data Reduction: Data Compression String compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without expansion

    Audio/video compression Typically lossy compression, with progressive refinement Sometimes small fragments of signal can be reconstructed

    without reconstructing the whole

    Time sequence is not audio Typically short and vary slowly with time

  • Data Reduction: Numerosity reduction Reduce data volume by choosing alternative, smaller forms of data representation Type of Numerosity reduction: Parametric methods Assume the data fits some model, estimate model parameters, store only the

    parameters, and discard the data (except possible outliers) Example: Regression

    Non-parametric methods Do not assume models Major families: histograms, clustering, sampling

  • Data Reduction: Numerosity reduction Histograms

    A popular data reduction technique Divide data into buckets and store average (sum) for each bucket Can be constructed optimally in one dimension using dynamic programming Related to quantization problems.

    0

    5

    10

    15

    20

    25

    30

    35

    40

    10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

  • Data Reduction: Numerosity reduction Clustering Partition data set into clusters, and one can store

    cluster representation only Can be very effective if data is clustered but not if

    data is smeared Can have hierarchical clustering and be stored in

    multi-dimensional index tree structures There are many choices of clustering definitions

    and clustering algorithms

  • Data Reduction: Numerosity reduction Sampling

    obtaining a small sample s to represent the whole data set N

    Simple Random Sample Without Replacement (SRSWOR)

    The probability of drawing any tuple in D is 1/N

    Simple Random Sample With Replacement (SRSWR)

    Cluster /Stratified sampling

    Approximate the percentage of each class (or subpopulation of interest) in the overall database

    Used in conjunction with skewed data

  • Data Reduction: Numerosity reduction

    Raw Data

  • Data Reduction: Numerosity reduction

    Raw Data Cluster/Stratified Sample

  • Data Reduction: Numerosity reduction Examples:

  • Data Reduction: Numerosity reduction Hierarchical Reduction reduce the data by collecting and replacing low level concepts

    (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)

    Ex. Suppose that the tree contain 10,000 tuples with key ranging form 1 to

    6 buckets for the key. Each bucket contains roughly 10,000/6 items. Therefore, each bucket has pointers to the data keys 986, 3396, 5411, 8392 and 9544, respectively.

    The use of multidimensional index trees as a form of data reduction relies on an ordering of the attribute values in each dimension.

  • Data Reduction: Discretization Three types of attributes: Nominal values from an unordered set

    Ordinal values from an ordered set

    Continuous real numbers

    Discretization: divide the range of a continuous attribute into intervals

    Some classification algorithms only accept categorical attributes.

    Reduce data size by discretization

    Prepare for further analysis

  • Data Reduction: Discretization

    Typical methods: All the methods can be applied

    recursively

    Binning

    Histogram analysis

    Clustering analysis

    Entropy-based discretization

    Segmentation by natural partitioning

  • Data Reduction: Discretization Entropy-based discretization

    Example: Coin Flip

    AX = {heads, tails}

    P(heads) = P(tails) =

    log2() = * - 1

    H(X) = 1

    What about a two-headed coin?

    Conditional Entropy:

    2( ) ( ) log ( )Xx A

    H X P x P x

    ( | ) ( ) ( | )Yy A

    H X Y P y H X y

  • Data Reduction: Discretization Given a set of samples S, if S is partitioned into two intervals S1

    and S2 using boundary T, the entropy after partitioning is

    The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.

    The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,

    Experiments show that it may reduce data size and improve classification accuracy

    1 2

    1 2

    | | | |( , ) ( ) ( )

    | | | |H S T H H

    S S

    S SS S

    ( ) ( , )H S H T S

  • Data Reduction: Discretization Segmentation by natural partitioning A simply 3-4-5 rule can be used to segment numeric data into

    relatively uniform, natural intervals

    distinct values at the

    most significant digit Natural interval

    (equi-width)

    3, 6, 9 3

    7 3 (2-3-2)

    2, 4, 8 4

    1, 5, 10 5

  • Data Reduction: Discretization Segmentation by natural partitioning

  • Data Reduction: Concept Hierarchy Specification of a partial ordering of attributes explicitly at

    the schema level by users or experts street

  • Data Reduction: Concept Hierarchy Automatic Concept Hierarchy Generation Some concept hierarchies can be automatically generated based on

    the analysis of the number of distinct values per attribute in the given data set The attribute with the most distinct values is placed at the lowest

    level of the hierarchy Note: Exceptionweekday, month, quarter, year

  • Data Reduction: Concept Hierarchy Automatic Concept Hierarchy Generation

    country

    province or_ state

    city

    street

    15 distinct values

    65 distinct values

    3567 distinct values

    674,339 distinct values

  • HW#3

    1. What is Data preprocessing?

    2. Why Preprocess the Data?

    3. What is Major Tasks in Data Preprocessing?

    4. What is Data cleaning task?

    5. How to Handle Missing Data?

    6. What is Normalization Method?

  • HW#3 7. Attribute income are $50,000 (min) and $ 150,000 (max). A

    value of $ 100,000 for income would like to map to the new range in [3,5]. Please calculate the income is transformed ?

    8. Attribute income are $76,000 (mean) and $ 12,500 (std). A value of $ 95,000 for income would like to map to the new range. Please calculate the income is transformed ?

    9. Attribute A range -650 to 999 normalized to decimal value of -650 to decimal scaling therefore, j = 2?

    10. What is Task in Data Integration?

    11. What is Data reduction strategy?

  • LAB 3

    bank-data-missvalue.csv