Data Warehousing and Data Mining - Bharathidasan · PDF fileContent •Introduction...

18
Data Warehousing and Data Mining

Transcript of Data Warehousing and Data Mining - Bharathidasan · PDF fileContent •Introduction...

Data Warehousing and Data Mining

Content

• Introduction

• Overview of data mining technology

• Association rules

• Classification

• Clustering

• Applications of data mining

• Commercial tools

• Conclusion

Introduction

• What is data mining?

• Why do we need to ‘mine’ data?

• On what kind of data can we ‘mine’?

What is data mining? • The process of discovering meaningful new

correlations, patterns and trends by shifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.

• A part of Knowledge Discovery in Data (KDD) process.

Why data mining? The explosive growth in data collection

The storing of data in data warehouses

The availability of increased access to data from Web navigation and intranet

We have to find a more effective way to use these data in decision support process than just using traditional querry languages

On what kind of data? • Relational databases

• Data warehouses

• Transactional databases

• Advanced database systems Object-relational

Spatial and Temporal

Time-series

Multimedia, text

WWW

GeneFilter Comparison Report

GeneFilter 1 Name: GeneFilter 1 Name:

O2#1 8-20-99adjfinal N2#1finaladj

INTENSITIES

RAW NORMALIZED

ORF NAME GENE NAME CHRM F G R GF1 GF2 GF1 GF2 DIFFERENCE RATIO

YAL001C TFC3 1 1 A 1 2 12.03 7.38 403.83 209.79 194.04 1.92

YBL080C PET112 2 1 A 1 3 53.21 35.62 "1,786.11" "1,013.13" 772.98 1.76

YBR154C RPB5 2 1 A 1 4 79.26 78.51 "2,660.73" "2,232.86" 427.87 1.19

YCL044C 3 1 A 1 5 53.22 44.66 "1,786.53" "1,270.12" 516.41 1.41

YDL020C SON1 4 1 A 1 6 23.80 20.34 799.06 578.42 220.64 1.38

YDL211C 4 1 A 1 7 17.31 35.34 581.00 "1,005.18" -424.18 -1.73

YDR155C CPH1 4 1 A 1 8 349.78 401.84 "11,741.98" "11,428.10" 313.88 1.03

YDR346C 4 1 A 1 9 64.97 65.88 "2,180.87" "1,873.67" 307.21 1.16

YAL010C MDM10 1 1 A 2 2 13.73 9.61 461.03 273.36 187.67 1.69

YBL088C TEL1 2 1 A 2 3 8.50 7.74 285.38 220.01 65.37 1.30

YBR162C 2 1 A 2 4 226.84 293.83 "7,614.82" "8,356.39" -741.57 -1.10

YCL052C PBN1 3 1 A 2 5 41.28 34.79 "1,385.79" 989.41 396.38 1.40

YDL028C MPS1 4 1 A 2 6 7.95 6.24 266.99 177.34 89.65 1.51

YDL219W 4 1 A 2 7 16.08 11.33 539.93 322.20 217.74 1.68

YDR163W 4 1 A 2 8 19.13 14.19 642.17 403.56 238.61 1.59

YDR354W TRP4 4 1 A 2 9 62.24 40.74 "2,089.48" "1,158.64" 930.84 1.80

YAL018C 1 1 A 3 2 10.72 8.81 359.75 250.60 109.15 1.44

YBL096C 2 1 A 3 3 10.91 8.98 366.40 255.40 111.00 1.43

YBR169C SSE2 2 1 A 3 4 17.33 27.81 581.80 790.84 -209.05 -1.36

YCL060C 3 1 A 3 5 17.99 24.75 603.96 703.75 -99.79 -1.17

YDL036C 4 1 A 3 6 14.22 8.86 477.39 251.94 225.44 1.89

YDL227C HO 4 1 A 3 7 25.61 31.52 859.71 896.46 -36.75 -1.04

YDR171W HSP42 4 1 A 3 8 102.08 98.37 "3,426.83" "2,797.58" 629.25 1.22

YDR362C 4 1 A 3 9 16.32 12.95 547.96 368.39 179.57 1.49

YAL026C DRS2 1 1 A 4 2 11.32 7.97 379.85 226.53 153.33 1.68

YBL102W SFT2 2 1 A 4 3 55.88 63.74 "1,875.82" "1,812.81" 63.02 1.03

YBR177C 2 1 A 4 4 63.31 29.03 "2,125.20" 825.60 "1,299.60" 2.57

YCL068C 3 1 A 4 5 8.33 4.47 279.51 127.16 152.35 2.20

YDL044C MTF2 4 1 A 4 6 11.73 6.96 393.88 198.07 195.81 1.99

YDL235C YPD1 4 1 A 4 7 38.71 30.20 "1,299.33" 858.83 440.50 1.51

YDR179C 4 1 A 4 8 12.77 11.05 428.60 314.12 114.48 1.36

YDR370C 4 1 A 4 9 16.70 15.30 560.62 435.13 125.49 1.29

YAL034C FUN19 1 1 A 5 2 20.89 24.21 701.32 688.59 12.73 1.02

YBL111C 2 1 A 5 3 22.38 13.67 751.39 388.69 362.70 1.93

YBR185C MBA1 2 1 A 5 4 38.42 19.96 "1,289.61" 567.78 721.83 2.27

YCLX03C 3 1 A 5 5 8.69 3.66 291.77 104.11 187.66 2.80

YDL052C SLC1 4 1 A 5 6 52.37 49.87 "1,758.05" "1,418.33" 339.73 1.24

YDL243C 4 1 A 5 7 15.56 12.95 522.24 368.30 153.94 1.42

YDR186C 4 1 A 5 8 16.48 15.01 553.30 426.75 126.55 1.30

YDR378C 4 1 A 5 9 31.13 28.08 "1,045.01" 798.50 246.50 1.31

YAL040C CLN3 1 1 A 6 2 126.65 107.34 "4,251.70" "3,052.61" "1,199.08" 1.39

YBR006W 2 1 A 6 3 22.74 11.10 763.49 315.55 447.94 2.42

YBR193C 2 1 A 6 4 14.81 15.55 497.07 442.20 54.87 1.12

YCLX11W 3 1 A 6 5 161.96 175.34 "5,436.86" "4,986.41" 450.44 1.09

YDL060W 4 1 A 6 6 29.84 37.13 "1,001.65" "1,055.98" -54.34 -1.05

YDR003W 4 1 A 6 7 23.99 23.22 805.48 660.25 145.22 1.22

YDR194C MSS116 4 1 A 6 8 66.58 47.16 "2,235.07" "1,341.29" 893.78 1.67

YDR386W 4 1 A 6 9 11.27 5.75 378.27 163.46 214.81 2.31

YAL047C 1 1 A 7 2 15.54 11.30 521.74 321.28 200.46 1.62

YBR012W-B 2 1 A 7 3 54.70 79.97 "1,836.29" "2,274.15" -437.86 -1.24

YBR201W DER1 2 1 A 7 4 21.67 19.57 727.49 556.64 170.85 1.31

YCR007C 3 1 A 7 5 25.02 15.96 840.01 453.76 386.25 1.85

YDL068W 4 1 A 7 6 18.32 13.11 614.83 372.78 242.05 1.65

Structure - 3D Anatomy

Function – 1D Signal

Metadata – Annotation

Overview of data mining technology

• Data Mining vs. Data Warehousing

• Data Mining as a part of Knowledge Discovery Process

• Goals of Data Mining and Knowledge Discovery

• Types of Knowledge Discovery during Data Mining

Data Mining vs. Data Warehousing

• A Data Warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated.

• This makes it much easier and more efficient to run queries over data that originally came from different sources.

• The goal of data warehousing is to support decision making with data!

Knowledge Discovery in Databases and Data Mining

• The non-trivial extraction of implicit, unknown, and potentially useful information from databases.

Goals of Data Mining and KDD

• Prediction: How certain attributes within the data will behave in the future.

• Identification: Identify the existence of an item, an event, an activity.

• Classification: Partition the data into categories.

• Optimization: Optimize the use of limited resources.

Applications of data mining

• Market analysis Marketing Strategies

Advertisement

• Risk analysis and management Finance and finance investments

Manufacturing and production

• Fraud detection and detection of unusual patterns (outliers) Telecommunication Financial transactions

Anti-terrorism (!!!)

Applications

Service Providers in the mobile phone and utilities industries. • Mobile phone and utilities companies use Data Mining and Business

Intelligence to predict ‘churn’, the terms they use for when a customer leaves their company to get their phone/gas/broadband from another provider.

• They collate billing information, customer services interactions, website visits and other metrics to give each customer a probability score, then target offers and incentives to customers whom they perceive to be at a higher risk of churning.

Retail • Retailers segment customers into ‘Recency, Frequency, Monetary’ (RFM)

groups and target marketing and promotions to those different groups. • A customer who spends little but often and last did so recently will be

handled differently to a customer who spent big but only once, and also some time ago. The former may receive a loyalty, upsell and cross-sell offers, whereas the latter may be offered a win-back deal, for instance.

Cross Selling and Up Selling

• Buy an Android Phone, Get recommendations for memory card / accessories

• Buy an Android Phone, Get recommendations for an iPhone

Winback Offers

Applications

E-commerce

• Cross-sells and up-sells through their websites

• One of the most famous of these is, of course, Amazon, who use sophisticated mining techniques to drive their, ‘People who viewed that product, also liked this’ functionality.

Applications

Supermarkets • Famously, supermarket loyalty card programmes are

usually driven mostly, if not solely, by the desire to gather comprehensive data about customers for use in data mining.

Crime prevention agencies • To spot trends across myriads of data – helping with

everything from where to deploy police manpower (where is crime most likely to happen and when?), who to search at a border crossing (based on age/type of vehicle, number/age of occupants, border crossing history) and even which intelligence to take seriously in counter-terrorism activities.

Applications of data mining

• Text mining (news group, email, documents) and

Web mining

• Stream data mining

• DNA and bio-data analysis

Diseases outcome

Effectiveness of treatments

Identify new drugs

Closing thoughts!

• Data mining is a “decision support” process in which we search for patterns of information in data.

• This technique can be used on many types of data.

• Overlaps with machine learning, statistics, artificial intelligence, databases, visualization…