Introduction Data 資料探勘方法 關聯性法則 (Association Rules) 分類法 ...

26
大大大 (Big Data) Introduction Data 大大大大大大 1. 大大大大大 (Association Rules) 2. 大大大 (Classification) 3. 大大大 (Clustering) Applications 大大

description

大綱. Introduction Data 資料探勘方法 關聯性法則 (Association Rules) 分類法 (Classification) 叢集法 (Clustering) Applications 結論. What Is Data Mining?. Data mining: knowledge discovery from data) - PowerPoint PPT Presentation

Transcript of Introduction Data 資料探勘方法 關聯性法則 (Association Rules) 分類法 ...

Page 1: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

大數據 (Big Data) Introduction Data 資料探勘方法

1. 關聯性法則 (Association Rules)

2. 分類法 (Classification)

3. 叢集法 (Clustering) Applications 結論

Page 2: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

What Is Data Mining?

Data mining: knowledge discovery from data) Extraction of interesting (non-trivial, previously

unknown and potentially useful) patterns or knowledge from huge amount of data

Alternative names Knowledge discovery in databases (KDD),

Page 3: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology Statistics

MachineLearning

A.I.

AlgorithmOther

Disciplines

Visualization

Page 4: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Why Not Traditional Data Analysis?

Tremendous amount of data

High complexity of data

Page 5: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Why Data Mining?

“Necessity is the mother of invention”—Data mining

—Automated analysis of massive data sets.

Page 6: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Business Objectives

[PPS 03]

[BL 97]

[MMPH 03][BM 98]

[LPSHG 01]

[SM 00][FCJ 01][MPT 99]

Page 7: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Data Mining: On What Kinds of Data?

Database-oriented data sets and applications

Relational database, data warehouse, transactional database

Advanced data sets and advanced applications

Data streams and sensor data

Time-series data, temporal data, sequence data (incl. bio-

sequences)

Structure data, graphs, social networks and multi-linked data

Object-relational databases

Heterogeneous databases and legacy databases

Spatial data and spatiotemporal data

Multimedia database

Text databases

The World-Wide Web

Page 8: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Steps of Data Mining

Page 9: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Steps of a KDD Process Learning the application domain

relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation

Find useful features, dimensionality/variable reduction, invariant representation.

Choosing functions of data mining summarization, classification, regression, association,

clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge

Page 10: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Data Mining Functionalities Outlier analysis

Outlier: Data object that does not comply with the general behavior of the data

Noise or exception? Useful in fraud detection, rare events analysis

Trend and evolution analysis Trend and deviation: e.g., regression analysis Sequential pattern mining: e.g., digital camera large SD

memory Periodicity analysis Similarity-based analysis

Page 11: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of

an item based on the occurrences of other items in the transaction

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Example of Association Rules

{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},

Page 12: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Classification: Definition Given a collection of records (training set )

Each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other attributes.

Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model.

Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Page 13: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

TreeInductionalgorithm

Training SetDecision Tree

Page 14: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Weather Data: Play or not Play?

OutlookTemperature

Humidity

Windy

Play?

sunny hot high false No

sunny hot high true No

overcast hot high false Yes

rain mild high false Yes

rain cool normal false Yes

rain cool normal true No

overcast cool normal true Yes

sunny mild high false No

sunny cool normal false Yes

rain mild normal false Yes

sunny mild normal true Yes

overcast mild high true Yes

overcast hot normal false Yes

rain mild high true No

Note:Outlook is theForecast,no relation to Microsoftemail program

Page 15: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

overcast

high normal falsetrue

sunny rain

No NoYes Yes

Yes

Example Tree for “Play?”Outlook

HumidityWindy

Page 16: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs

Land use: Identification of areas of similar land use in an earth observation database

Insurance: Identifying groups of motor insurance policy holders with a high average claim cost

City-planning: Identifying groups of houses according to their house type, value, and geographical location

Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Page 17: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Data Mining for Retail Industry

Retail industry: huge amounts of data on sales, customer shopping history, etc.

Applications of retail data mining Identify customer buying behaviors Discover customer shopping patterns and trends Improve the quality of customer service Achieve better customer retention and satisfaction Enhance goods consumption ratios Design more effective goods transportation and distribution

policies

Page 18: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Data Mining in Retail Industry: Examples

Design and construction of data warehouses based on the benefits of data mining

Multidimensional analysis of sales, customers, products, time, and region

Analysis of the effectiveness of sales campaigns Customer retention: Analysis of customer loyalty

Use customer loyalty card information to register sequences of purchases of particular customers

Use sequential pattern mining to investigate changes in customer consumption or loyalty

Suggest adjustments on the pricing and variety of goods Purchase recommendation and cross-reference of items

Page 19: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Financial Data Mining

Classification and clustering of customers for targeted marketing multidimensional segmentation by nearest-neighbor,

classification, decision trees, etc. to identify customer groups or associate a new customer to an appropriate customer group

Detection of money laundering and other financial crimes integration of from multiple DBs (e.g., bank transactions,

federal/state crime history DBs) Tools: data visualization, linkage analysis, classification,

clustering tools, outlier analysis, and sequential pattern analysis tools (find unusual access sequences)

Page 20: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Financial Data Mining

Classification and clustering of customers for targeted marketing multidimensional segmentation by nearest-neighbor,

classification, decision trees, etc. to identify customer groups or associate a new customer to an appropriate customer group

Detection of money laundering and other financial crimes integration of from multiple DBs (e.g., bank transactions,

federal/state crime history DBs) Tools: data visualization, linkage analysis, classification,

clustering tools, outlier analysis, and sequential pattern analysis tools (find unusual access sequences)

Page 21: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Data Mining for Telecomm. Industry (1)

A rapidly expanding and highly competitive industry and a great demand for data mining

Understand the business involved Identify telecommunication patterns Catch fraudulent activities Make better use of resources Improve the quality of service

Multidimensional analysis of telecommunication data Intrinsically multidimensional: calling-time, duration,

location of caller, location of callee, type of call, etc.

Page 22: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Data Mining for Telecomm. Industry (2)

Fraudulent pattern analysis and the identification of unusual patterns Identify potentially fraudulent users and their atypical usage

patterns Detect attempts to gain fraudulent entry to customer accounts Discover unusual patterns which may need special attention

Multidimensional association and sequential pattern analysis Find usage patterns for a set of communication services by

customer group, by month, etc. Promote the sales of specific services Improve the availability of particular services in a region

Use of visualization tools in telecommunication data analysis

Page 23: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Biomedical and DNA Data Analysis

DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C), guanine (G), and thymine (T).

Gene: a sequence of hundreds of individual nucleotides arranged in a particular order

Humans have around 30,000 genes Tremendous number of ways that the nucleotides can be

ordered and sequenced to form distinct genes Semantic integration of heterogeneous, distributed genome

databases Current: highly distributed, uncontrolled generation and use

of a wide variety of DNA data Data cleaning and data integration methods developed in

data mining will help

Page 24: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

DNA Analysis: Examples

Similarity search and comparison among DNA sequences Compare the frequently occurring patterns of each class (e.g.,

diseased and healthy) Identify gene sequence patterns that play roles in various diseases

Association analysis: identification of co-occurring gene sequences Most diseases are not triggered by a single gene but by a combination

of genes acting together Association analysis may help determine the kinds of genes that are

likely to co-occur together in target samples Path analysis: linking genes to different disease development stages

Different genes may become active at different stages of the disease Develop pharmaceutical interventions that target the different stages

separately Visualization tools and genetic data analysis

Page 25: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

Other Applications Sports

IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat

Astronomy JPL and the Palomar Observatory discovered 22 quasars

with the help of data mining Internet Web Surf-Aid

IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

Page 26: Introduction Data 資料探勘方法 關聯性法則  (Association Rules) 分類法  (Classification) 叢集法  (Clustering)

結論 Data mining is a young discipline with

wide and diverse applications There is still a nontrivial gap between

general principles of data mining and domain-specific, effective data mining tools for particular applications