Data Warehousing and Data Mining - Bharathidasan · PDF fileContent •Introduction...
Transcript of Data Warehousing and Data Mining - Bharathidasan · PDF fileContent •Introduction...
Content
• Introduction
• Overview of data mining technology
• Association rules
• Classification
• Clustering
• Applications of data mining
• Commercial tools
• Conclusion
Introduction
• What is data mining?
• Why do we need to ‘mine’ data?
• On what kind of data can we ‘mine’?
What is data mining? • The process of discovering meaningful new
correlations, patterns and trends by shifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.
• A part of Knowledge Discovery in Data (KDD) process.
Why data mining? The explosive growth in data collection
The storing of data in data warehouses
The availability of increased access to data from Web navigation and intranet
We have to find a more effective way to use these data in decision support process than just using traditional querry languages
On what kind of data? • Relational databases
• Data warehouses
• Transactional databases
• Advanced database systems Object-relational
Spatial and Temporal
Time-series
Multimedia, text
WWW
…
GeneFilter Comparison Report
GeneFilter 1 Name: GeneFilter 1 Name:
O2#1 8-20-99adjfinal N2#1finaladj
INTENSITIES
RAW NORMALIZED
ORF NAME GENE NAME CHRM F G R GF1 GF2 GF1 GF2 DIFFERENCE RATIO
YAL001C TFC3 1 1 A 1 2 12.03 7.38 403.83 209.79 194.04 1.92
YBL080C PET112 2 1 A 1 3 53.21 35.62 "1,786.11" "1,013.13" 772.98 1.76
YBR154C RPB5 2 1 A 1 4 79.26 78.51 "2,660.73" "2,232.86" 427.87 1.19
YCL044C 3 1 A 1 5 53.22 44.66 "1,786.53" "1,270.12" 516.41 1.41
YDL020C SON1 4 1 A 1 6 23.80 20.34 799.06 578.42 220.64 1.38
YDL211C 4 1 A 1 7 17.31 35.34 581.00 "1,005.18" -424.18 -1.73
YDR155C CPH1 4 1 A 1 8 349.78 401.84 "11,741.98" "11,428.10" 313.88 1.03
YDR346C 4 1 A 1 9 64.97 65.88 "2,180.87" "1,873.67" 307.21 1.16
YAL010C MDM10 1 1 A 2 2 13.73 9.61 461.03 273.36 187.67 1.69
YBL088C TEL1 2 1 A 2 3 8.50 7.74 285.38 220.01 65.37 1.30
YBR162C 2 1 A 2 4 226.84 293.83 "7,614.82" "8,356.39" -741.57 -1.10
YCL052C PBN1 3 1 A 2 5 41.28 34.79 "1,385.79" 989.41 396.38 1.40
YDL028C MPS1 4 1 A 2 6 7.95 6.24 266.99 177.34 89.65 1.51
YDL219W 4 1 A 2 7 16.08 11.33 539.93 322.20 217.74 1.68
YDR163W 4 1 A 2 8 19.13 14.19 642.17 403.56 238.61 1.59
YDR354W TRP4 4 1 A 2 9 62.24 40.74 "2,089.48" "1,158.64" 930.84 1.80
YAL018C 1 1 A 3 2 10.72 8.81 359.75 250.60 109.15 1.44
YBL096C 2 1 A 3 3 10.91 8.98 366.40 255.40 111.00 1.43
YBR169C SSE2 2 1 A 3 4 17.33 27.81 581.80 790.84 -209.05 -1.36
YCL060C 3 1 A 3 5 17.99 24.75 603.96 703.75 -99.79 -1.17
YDL036C 4 1 A 3 6 14.22 8.86 477.39 251.94 225.44 1.89
YDL227C HO 4 1 A 3 7 25.61 31.52 859.71 896.46 -36.75 -1.04
YDR171W HSP42 4 1 A 3 8 102.08 98.37 "3,426.83" "2,797.58" 629.25 1.22
YDR362C 4 1 A 3 9 16.32 12.95 547.96 368.39 179.57 1.49
YAL026C DRS2 1 1 A 4 2 11.32 7.97 379.85 226.53 153.33 1.68
YBL102W SFT2 2 1 A 4 3 55.88 63.74 "1,875.82" "1,812.81" 63.02 1.03
YBR177C 2 1 A 4 4 63.31 29.03 "2,125.20" 825.60 "1,299.60" 2.57
YCL068C 3 1 A 4 5 8.33 4.47 279.51 127.16 152.35 2.20
YDL044C MTF2 4 1 A 4 6 11.73 6.96 393.88 198.07 195.81 1.99
YDL235C YPD1 4 1 A 4 7 38.71 30.20 "1,299.33" 858.83 440.50 1.51
YDR179C 4 1 A 4 8 12.77 11.05 428.60 314.12 114.48 1.36
YDR370C 4 1 A 4 9 16.70 15.30 560.62 435.13 125.49 1.29
YAL034C FUN19 1 1 A 5 2 20.89 24.21 701.32 688.59 12.73 1.02
YBL111C 2 1 A 5 3 22.38 13.67 751.39 388.69 362.70 1.93
YBR185C MBA1 2 1 A 5 4 38.42 19.96 "1,289.61" 567.78 721.83 2.27
YCLX03C 3 1 A 5 5 8.69 3.66 291.77 104.11 187.66 2.80
YDL052C SLC1 4 1 A 5 6 52.37 49.87 "1,758.05" "1,418.33" 339.73 1.24
YDL243C 4 1 A 5 7 15.56 12.95 522.24 368.30 153.94 1.42
YDR186C 4 1 A 5 8 16.48 15.01 553.30 426.75 126.55 1.30
YDR378C 4 1 A 5 9 31.13 28.08 "1,045.01" 798.50 246.50 1.31
YAL040C CLN3 1 1 A 6 2 126.65 107.34 "4,251.70" "3,052.61" "1,199.08" 1.39
YBR006W 2 1 A 6 3 22.74 11.10 763.49 315.55 447.94 2.42
YBR193C 2 1 A 6 4 14.81 15.55 497.07 442.20 54.87 1.12
YCLX11W 3 1 A 6 5 161.96 175.34 "5,436.86" "4,986.41" 450.44 1.09
YDL060W 4 1 A 6 6 29.84 37.13 "1,001.65" "1,055.98" -54.34 -1.05
YDR003W 4 1 A 6 7 23.99 23.22 805.48 660.25 145.22 1.22
YDR194C MSS116 4 1 A 6 8 66.58 47.16 "2,235.07" "1,341.29" 893.78 1.67
YDR386W 4 1 A 6 9 11.27 5.75 378.27 163.46 214.81 2.31
YAL047C 1 1 A 7 2 15.54 11.30 521.74 321.28 200.46 1.62
YBR012W-B 2 1 A 7 3 54.70 79.97 "1,836.29" "2,274.15" -437.86 -1.24
YBR201W DER1 2 1 A 7 4 21.67 19.57 727.49 556.64 170.85 1.31
YCR007C 3 1 A 7 5 25.02 15.96 840.01 453.76 386.25 1.85
YDL068W 4 1 A 7 6 18.32 13.11 614.83 372.78 242.05 1.65
Structure - 3D Anatomy
Function – 1D Signal
Metadata – Annotation
Overview of data mining technology
• Data Mining vs. Data Warehousing
• Data Mining as a part of Knowledge Discovery Process
• Goals of Data Mining and Knowledge Discovery
• Types of Knowledge Discovery during Data Mining
Data Mining vs. Data Warehousing
• A Data Warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated.
• This makes it much easier and more efficient to run queries over data that originally came from different sources.
• The goal of data warehousing is to support decision making with data!
Knowledge Discovery in Databases and Data Mining
• The non-trivial extraction of implicit, unknown, and potentially useful information from databases.
Goals of Data Mining and KDD
• Prediction: How certain attributes within the data will behave in the future.
• Identification: Identify the existence of an item, an event, an activity.
• Classification: Partition the data into categories.
• Optimization: Optimize the use of limited resources.
Applications of data mining
• Market analysis Marketing Strategies
Advertisement
• Risk analysis and management Finance and finance investments
Manufacturing and production
• Fraud detection and detection of unusual patterns (outliers) Telecommunication Financial transactions
Anti-terrorism (!!!)
Applications
Service Providers in the mobile phone and utilities industries. • Mobile phone and utilities companies use Data Mining and Business
Intelligence to predict ‘churn’, the terms they use for when a customer leaves their company to get their phone/gas/broadband from another provider.
• They collate billing information, customer services interactions, website visits and other metrics to give each customer a probability score, then target offers and incentives to customers whom they perceive to be at a higher risk of churning.
Retail • Retailers segment customers into ‘Recency, Frequency, Monetary’ (RFM)
groups and target marketing and promotions to those different groups. • A customer who spends little but often and last did so recently will be
handled differently to a customer who spent big but only once, and also some time ago. The former may receive a loyalty, upsell and cross-sell offers, whereas the latter may be offered a win-back deal, for instance.
Cross Selling and Up Selling
• Buy an Android Phone, Get recommendations for memory card / accessories
• Buy an Android Phone, Get recommendations for an iPhone
Applications
E-commerce
• Cross-sells and up-sells through their websites
• One of the most famous of these is, of course, Amazon, who use sophisticated mining techniques to drive their, ‘People who viewed that product, also liked this’ functionality.
Applications
Supermarkets • Famously, supermarket loyalty card programmes are
usually driven mostly, if not solely, by the desire to gather comprehensive data about customers for use in data mining.
Crime prevention agencies • To spot trends across myriads of data – helping with
everything from where to deploy police manpower (where is crime most likely to happen and when?), who to search at a border crossing (based on age/type of vehicle, number/age of occupants, border crossing history) and even which intelligence to take seriously in counter-terrorism activities.
Applications of data mining
• Text mining (news group, email, documents) and
Web mining
• Stream data mining
• DNA and bio-data analysis
Diseases outcome
Effectiveness of treatments
Identify new drugs