KBS - KBS - Data Mining Intoutsi/DM1.SoSe19/lectures/... · 2019. 4. 10. · Lecture 1:...
Transcript of KBS - KBS - Data Mining Intoutsi/DM1.SoSe19/lectures/... · 2019. 4. 10. · Lecture 1:...
-
Data Mining I
Summer semester 2019
Lecture 1: Introduction
Lectures: Prof. Dr. Eirini Ntoutsi
TAs: Tai Le Quy, Vasileios Iosifidis, Maximilian Idahl, Shaheer Asghar, Wazed Ali
Fakultät für Elektrotechnik und InformatikInstitut für Verteilte Systeme
AG Intelligente Systeme - Data Mining group
-
About me
03/2016 - present Associate Professor Faculty of Electrical Engineering & Computer Science Leibniz University Hannover L3S Research Center (since May 2016)
02/2012 - 02/2016 Post-doctoral researcher & lecturer Institute for Informatics, LMU Munich, Germany
02/2010 - 01/2012 Alexander von Humboldt postdoc fellowInstitute for Informatics, LMU Munich, Germany
2009: Data Mining Expert National Hellenic Organization (OTE), Athens, Greece
04/2007 – 02/2009 Co-Founder and AI expert
NeeMo Startup, Greece
09/2003 – 09/2008 PhD in Data MiningUniversity of Piraeus, Athens, Greece
09/2001 – 09/2003 MSc, Computer Science/ Text MiningPolytechnic School, University of Patras, Greece
09/1996 – 09/2001 Diploma, Computer Engineering and Informatics/ AI GamesPolytechnic School, University of Patras, Greece
2Learning from streaming data
Current focus areas:• Data Stream Mining/ Adaptive Machine Learning• Responsible AI: Fairness-Aware Machine Learning
-
Outline
■ Why to study Data Mining?
■ Why we need Data Mining?
■ What is the KDD (Knowledge Discovery in Databases) process?
■ Main data mining tasks
■ Course logististics
■ Things you should know from this lecture
■ Homework/ Tutorial
3Data Mining I @SS19: Introduction
-
Why to study Data Mining/Machine Learning – famous quotes*
■ “A breakthrough in machine learning would be worth ten Microsofts” (Bill Gates, Chairman, Microsoft)
■ “Machine learning is the next Internet” (Tony Tether, Director, DARPA)
■ “Machine learning is the hot new thing” (John Hennessy, President, Stanford)
■ “Web rankings today are mostly a matter of machine learning” (Prabhakar Raghavan, Dir. Research, Yahoo)
■ “Machine learning is going to result in a real revolution” (Greg Papadopoulos, Former CTO, Sun)
■ “Machine learning today is one of the hottest aspects of computer science” (Steve Ballmer, CEO, Microsoft)
4
*Source: Pedro Domingos http://courses.cs.washington.edu/courses/cse446/15sp/slides/intro.pdf
Data Mining I @SS19: Introduction
Disclaimer: I use the terms data mining and machine learning (sometimes also Artificial Intelligence (AI) interchangeably here and through the lecture.We will discuss the similarities/differences later. In both cases, we talk to learning from data.
-
Data Mining – Data Science – Big Data – Machine Learning – Deep Learning Analytics …
■ New fancy words for knowledge discovery from data
❑ Data mining, machine learning have been focusing on knowledge discovery from data for decades
❑ Well defined set of tasks and solutions
■ Big data and analytics are more business terms and ill-defined
■ The same holds today for AI
5
“Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.“
Source: Dan Ariely, Duke University
Data Mining I @SS19: Introduction
-
Ever increasing interest … the ``rebranding’’ effect
6
Source: Google trends, query on 9.4.2019
Data Mining I @SS19: Introduction
-
Why to study Data Mining - Data Scientist: The sexiest job of 21st century
7
“If “sexy” means having rare qualities that are much in demand, data scientists are alreadythere. They are difficult and expensive to hire and, given the very competitive market for theirservices, difficult to retain. There simply aren’t a lot of people with their combination ofscientific background and computational and analytical skills.”
Source: Harvard Business Review. Data Scientist: The Sexiest Job of the 21st Century. October 2012 link
Data Mining I @SS19: Introduction
Source: https://www.slideshare.net/IBMBDA/myths-and-mathemagical-superpowers-of-data-scientists
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/https://www.slideshare.net/IBMBDA/myths-and-mathemagical-superpowers-of-data-scientists
-
A good conjuncture for ML/DM/DS (data-driven learning)
8Data Mining I @SS19: Introduction
Data deluge Machine Learningadvances
Computer power Enthusiasm
-
World-wide competition on Artificial Intelligence (AI)
■ From FORSCHUNGSGIPFEL 2019 - Künstliche Intelligenz – Innovationstreiber einer neuen Generation, 19/3/2019, Berlin
■ Cédric Villani’s talk on the geopolitics of AIThere are 3 fierce competitions
■ Competition for human talent
■ Competition for infrastructure
■ Competition for data
■ More info on
❑ Track #fogipf19 in Twitter
❑ Check videos online: http://www.forschungsgipfel.de/2019/videos
9Data Mining I @SS19: Introduction
https://twitter.com/hashtag/fogipf19?src=hashhttp://www.forschungsgipfel.de/2019/videos
-
Outline
■ Why to study Data Mining?
■ Why we need Data Mining?
■ What is the KDD (Knowledge Discovery in Databases) process?
■ Main data mining tasks
■ Course logististics
■ Things you should know from this lecture
■ Homework/ Tutorial
11Data Mining I @SS19: Introduction
-
Why we need Data Mining
■ Huge amounts of data are collected nowadays from different application domains
■ “We are drowning in information but starving for knowledge” John Naibett link
■ The amount and the complexity of the collected data does not allow for manual analysis.
Telecommunication
Astronomy
Banks Biology
Internet
Supermarkets
12
IoT
Data Mining I @SS19: Introduction
http://www.kdnuggets.com/news/2007/n06/3i.html
-
Examples of data sources: The Internet
■ Internet users
13
Web 2.0: A world of opinionsUser generated content
Data Mining I @SS19: Introduction
Source: http://www.internetlivestats.com/internet-users/
-
Examples of data sources: Internet of things
■ The Internet of Things (IoT) is the network of physical objects or "things" embedded with electronics, software, sensors, and network connectivity, which enables these objects to collect and exchange data.
Source: https://en.wikipedia.org/wiki/Internet_of_Things
14
Image source:http://tinyurl.com/prtfqxf
Source: http://blogs.cisco.com/diversity/the-internet-of-things-infographic
During 2008, the number of things connected to the internet surpassed the number of people on earth… By 2020 there will be 50 billion … vs 7.3 billion people (2015).
These things are everything, smartphones, tablets, refrigerators …. cattle.
Data Mining I @SS19: Introduction
https://en.wikipedia.org/wiki/Internet_of_Things
-
Examples of data sources: data intensive science
15
Slide from:http://research.microsoft.com/en-us/um/people/gray/talks/nrc-cstb_escience.ppt
“Increasingly, scientific breakthroughs willbe powered by advanced computingcapabilities that help researchersmanipulate and explore massive datasets.”
-The Fourth Paradigm – Microsoft
Examples of e-science applications:• Earth and environment• Health and wellbeing
− E.g., The Human Genome Project (HGP)
• Citizen science• Scholarly communication• Basic science
− E.g., CERN
Data Mining I @SS19: Introduction
-
Examples of data sources: Manufacturing
■ Andrew Ng Says Factories Are AI’s Next Frontier
Source: https://www.technologyreview.com/s/609770/andrew-ng-says-factories-are-ais-next-frontier/
16
Image source: https://images.readwrite.com/wp-content/uploads/2018/03/AAEAAQAAAAAAAAueAAAAJDY1NmFl
N2NhLWExZTUtNDRhNy1iMWQ5LTViZGM3NTFlODczYQ.jpg
Companies are making major investments in AI and industrial analytics to help drive their digital transformation
Data Mining I @SS19: Introduction
Image source: https://cdn-sv1.deepsense.ai/wp-content/uploads/2018/04/Spot-the-flaw-Visual-quality-control-in-
manufacturing-1140x337.jpg
https://www.technologyreview.com/s/609770/andrew-ng-says-factories-are-ais-next-frontier/
-
Examples of data sources: We … the data subjects
■ Wherever we go, we are "datafied".
■ Smartphones are tracking our locations.
■ We leave a data trail in our web browsing.
■ Interaction in social networks.
■ Privacy is an important issue … not covered though in this lecture → privacy aware data mining
❑ Check the EU General Data Protection Regulation (https://eugdpr.org/)
■ e.g., https://www.whitecase.com/publications/article/chapter-5-key-definitions-unlocking-eu-general-data-protection-regulation
17Data Mining I @SS19: Introduction
https://eugdpr.org/
-
From data to knowledge of different types
18Data Mining I @SS19: Introduction
Data Methods Knowledge
Call records
Movie ratings
Telescope images
Outlier Detection Detect fraud cases
Collaborative filtering Recommend movies to users
ClassificationIs it an «early», «intermediate» or «late formation» star?
News articles ClusteringWhat are the topics people discuss about in the news today?
-
Short break (5’) – Get to know us better
■ What is your field of study?
❑ Informatik? Informationstechnik? Elektrotechnik?
■ What is your interest in the data mining field?
■ Are there people from Physics, Medicine, Engineering Sciences in the audience?
19Data Mining I @SS19: Introduction
-
Outline
■ Why to study Data Mining?
■ Why we need Data Mining?
■ What is the KDD (Knowledge Discovery in Databases) process?
■ Main data mining tasks
■ Course logististics
■ Things you should know from this lecture
■ Homework/ Tutorial
20Data Mining I @SS19: Introduction
-
What is KDD
Knowledge Discovery in Databases (KDD) is the nontrivial process of identifying valid, novel, potentially
useful, and ultimately understandable patterns in data.
[Fayyad, Piatetsky-Shapiro, and Smyth 1996]
Remarks:
● valid: the discovered patterns should also hold for new, previously unseen problem instances.
● novel: at least to the system and preferably to the user
● potentially useful: they should lead to some benefit to the user or task
● ultimately understandable: the end user should be able to interpret the patterns either immediately or after some post-processing
21Data Mining I @SS19: Introduction
Clarification: The term databases does not refer exclusively to relational databases storing structured data … it can be any data storage and also structured, semi-structured, non-structured data
-
The KDD process and the Data Mining step
22
Patterns
Knowledge
[Fayyad, Piatetsky-Shapiro & Smyth, 1996]
Transformed data
Target data
Preprocessed data
Sele
ctio
n:
•Se
lect
a r
elev
ant
dat
aset
or
focu
s o
n a
su
bse
t o
f a
dat
aset
•Fi
le /
DB
/
Pre
pro
cess
ing/
Cle
anin
g:•
Inte
grat
ion
of
dat
a fr
om
d
iffe
ren
t d
ata
sou
rces
•N
ois
e re
mo
val
•M
issi
ng
valu
es
Tran
sfo
rmat
ion
:•
Sele
ct u
sefu
l fea
ture
s•
Feat
ure
tra
nsf
orm
atio
n/
dis
cret
izat
ion
•D
imen
sio
nal
ity
red
uct
ion
Dat
a M
inin
g:•
Sear
ch f
or
pat
tern
s o
f in
tere
st
Eval
uat
ion
:•
Eval
uat
e p
atte
rns
bas
ed o
n
inte
rest
ingn
ess
mea
sure
s•
Stat
isti
cal v
alid
atio
n o
f th
e M
od
els
•V
isu
aliz
atio
n•
Des
crip
tive
Sta
tist
ics
Data
Data Mining I @SS19: Introduction
-
A modern version: The Data Science process
23Data Mining I @SS19: Introduction
-
The interdisciplinary nature of KDD 1/2
24
KDD
Machine Learning
Databases
Statistics
Data visualization
Pattern recognition
Algorithms Other disciplines
Data Mining I @SS19: Introduction
-
The interdisciplinary nature of KDD 2/2
25
Statistics Machine Learning
Databases
KDD
Model based inferenceFocus on numerical
data
Theory + methodsFocus on small datasets
Scalability to large data setsNew data types (web data, micro-arrays, social data ...)
Integration with commercial databases[Chen, Han & Yu 1996]
[Berthold & Hand 1999] [Mitchell 1997]
Data Mining I @SS19: Introduction
-
How do machines learn?
■ ML “gives computers the ability to learn without being explicitly programmed” (Arthur Samuel, 1959)
■ We don’t codify the solution. We don’t even know it!
■ Data is the key & the learning algorithm
26Data Mining I @SS19: Introduction
Algorithms
Models
Models
(semi)Automatic
decision making
Data
How can we build computer programs that automatically improve with experience?
Tom Mitchell, Machine Learning book
-
More formally: How do machines learn?
■ A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Tom Mitchell, Machine Learning 1997.
■ Example: A backgammon learning problem
❑ Task T: playing backgammon
❑ Performance measure P: % of games won against opponents
❑ Training experience E: playing practice games against itself
■ Example: Exam performance
❑ Task T: predict whether a student will pass the final DM exam or not
❑ Experience E: historical records of students that took the DM exam
❑ Performance measure P: % of correctly identified students
27Data Mining I @SS19: Introduction
-
(Machine) Learning from experience/feedback 1/2
■ “Experience comes in terms of data (the so called, instances or examples) from the specific problem/ application”
■ Datasets consists of instances (also known as examples or objects)
❑ e.g., in a university database: students, professors, courses, grades,…
❑ e.g., in a library database: books, users, loans, publishers, ….
❑ e.g., in a movie database: movies, actors, director,…
■ Instances are described through features (also known as attributes or variables)
❑ E.g. a course is described in terms of a title, description, lecturer, teaching frequency etc.
❑ An easy to visualize example: if our data are in a database table, the rows are the instances and the columns are the features.
28Data Mining I @SS19: Introduction
-
(Machine) Learning from experience/feedback 2/2
■ Except for the instance description, we might also have feedback on those instances from some “teacher”/”expert“
❑ E.g., whether a student passed the exam
■ The direct feedback is known as label, i.e., each instance is associated with a label labeleddataset
■ But we might have no feedback at all unlabeled dataset
■ There might be also indirect feedback
29Data Mining I @SS19: Introduction
Unlabeled datasetLabeled dataset
Lecture 2 is devoted on getting to know our data!!!
-
Short break (5’) – Modeling students data for the exam performance task
■ Recall our learning example
■ Example: Exam performance
❑ Task T: predict whether a student will pass the final DM exam or not
❑ Experience E: historical records of students that took the DM exam
❑ Performance measure P: % of correctly identified students
■ If students are the learning instances, what sort of features could I use to describe each of them?
■ What could be the feedback (direct, indirect) for the learning model (if any)?
30Data Mining I @SS19: Introduction
-
Outline
■ Why to study Data Mining?
■ Why we need Data Mining?
■ What is the KDD (Knowledge Discovery in Databases) process?
■ Main data mining tasks
■ Course logististics
■ Things you should know from this lecture
■ Homework/ Tutorial
31Data Mining I @SS19: Introduction
-
Different learning tasks
Based on the feedback we have on the data, we can distinguish between:
■ Direct-feedback instances
❑ the correct response /label is provided for each instance by the “teacher”
❑ e.g., good or bad product
■ No-feedback instances
❑ no evaluation/label of the instance is provided, since there is no “teacher“
❑ e.g., no information on whether a product is good or bad, just the description of the product/instance
■ Indirect-feedback instances
❑ less feedback is given, since not the proper action, but only an evaluation of the chosen action is given by the teacher
32Data Mining I @SS19: Introduction
Supervised learning
Reinforcement learning
Unsupervised learning
-
Different learning tasks: Supervised learning
■ Supervised learning/ Predictive:
❑ A description of the instances and their class labels is available (training set)
❑ The goal is to learn a mapping from the instances to the class labels, i.e., given a future unseen instance to predict its class label
■ Typical examples covered in this lecture:
❑ Classification
❑ Outlier detection
❑ Regression
33Data Mining I @SS19: Introduction
-
Classification: an example
■ The goal is to learn a mapping from the “height, width space” to the class space (nails, screw,paper clips)
■ For the new objects, the result of the classification if one of the class labels {nails, screw,paper clips}
34Data Mining I @SS19: Introduction
Screw
Nails
Paper clips
Hei
ght
[cm
]
Width[cm]
instance width height class
1 2,6 4,5 Screw
2 3,7 7,3 Nails
3 4,1 6,5 Paper Clips
4 8,5 8,1 Screw
5 9,5 5,5 Nails
… … … …
New objectNew object
-
Classification applications 1/2
■ Application: Fraud Detection
❑ Goal: Predict fraudulent cases in credit card transactions.
❑ Approach:
■ Use credit card transactions and the information on its account-holder as attributes.
❑ When does a customer buy, what does he buy, how often he pays on time, etc
■ Label past transactions as fraud or fair transactions. This forms the class attribute.
■ Learn a model for the class of the transactions.
■ Use this model to detect fraud by observing credit card transactions on an account.
35Data Mining I @SS19: Introduction
-
Classification applications 2/2
■ Application: Churn prediction in telco
❑ Goal: Predict whether a customer is likely to be lost to a competitor
❑ Approach:
■ Use detailed record of transactions with each of the past and present customers, to find attributes.
❑ How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc.
■ Label the customers as loyal or disloyal (class attribute).
■ Find a model for customer loyalty
■ Use this model to predict churn and organize possible retain strategies.
36Data Mining I @SS19: Introduction
-
Example: Google News
37Data Mining I @SS19: Introduction
-
A huge variety of classification algorithms
38Data Mining I @SS19: Introduction
Decision trees k nearest neighbours
Support vector machines
Neural networks Bayesian classifiers
Ensembles
-
Supervised learning: Regression
■ Similar to classification, but the feature-result to be learned is continuous rather than discrete.
■ Goal: Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.
39Data Mining I @SS19: Introduction
Given this data, a friend has a house 750 square feet - how much can they be expected to get?
Source: Andrew Ng ML course, Coursera
-
Application: Precision farming
■ Create a production curve depending on multiple parameters like soil characteristics, weather, used fertilizers.
■ Only the appropriate amount of fertilizers given the environmental settings (soil, weather) will result in maximum yield.
■ Controlling the effects of over-fertilization on the environment is also important
40
Water capacity
Soil parametersWeatherFertilizers
…
Fertilizers
productionproduction
curve
Data Mining I @SS19: Introduction
-
Different learning tasks: Unsupervised learning
■ Unsupervised learning/ Descriptive:
❑ Only a description of the instances is available
❑ No feedback/labels are available
❑ The goal is to discover groups of similar instances
■ Typical subtasks covered in this lecture:
❑ clustering
❑ association rules mining
❑ outlier detection
41Data Mining I @SS19: Introduction
-
Clustering: an example
■ Each point described in terms of its height and width
■ No information on the actual classes (nails, paper clips) is available to the clustering algorithm.
42
Cluster 1Cluster 2
Hei
ght
[cm
]
Width[cm]
Data Mining I @SS19: Introduction
instance width height
1 2,6 4,5
2 3,7 7,3
3 4,1 6,5
4 8,5 8,1
5 9,5 5,5
… … …
-
Clustering applications 1/2
Application: Market Segmentation
■ Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.
■ Approach:
❑ Collect different attributes of customers based on their geographical and lifestyle related information.
■ E.g., age, income, education, family status, ….
❑ Find clusters of similar customers.
❑ Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.
43Data Mining I @SS19: Introduction
-
Clustering applications 2/2
Application: Document clustering
■ Find groups of documents (topics) that are similar to each other based on the important terms appearing in them.
■ Approach:
❑ Identify important terms in each document.
❑ Form a similarity measure between documents.
❑ Cluster based on the similarity measure.
■ Gain:
❑ Help the end user to navigate in the collection of documents (based on the extracted clusters).
❑ Utilize the clusters to relate a new document or search term to clustered documents.
■ Check for example, Google News.
44Data Mining I @SS19: Introduction
-
Example: Google News
45Data Mining I @SS19: Introduction
-
A huge variety of clustering algorithms
46Data Mining I @SS19: Introduction
Partitioning methods (k-Means)
Grid-based methods (CLIQUE)
Model-based methods (DBSCAN)
Hierarchical methods
Constraint-based methods
Model-based methods(EM)
-
Unsupervised learning: Association rules mining
■ Task: Find all rules in the database, in the following form:
If x, y, z are contained in a set M, then t is also contained in M with a probability of at least X%.
47
a,b,c,d,eb,c,da,b,c,da,b,c,d,ea,c,e,fd,c,e,fa,b,c,d,f
In 5 out of 7 cases (~71 %)b,c,d appear together
a,b,c,d,eb,c,da,b,c,da,b,c,d,ea,c,e,fd,c,e,fa,b,c,d,f
In 5 out of 5 cases (100 %) it holds that:If b,c then d also exists.
Data Mining I @SS19: Introduction
• a= milk• b=cheese• c =wine
• d= pasta• e= yogurt• f = apples
-
Application: Market basket analysis
■ Result:
❑ Frequently purchased items together may be better to be positioned close to each other: E.g. since diapers are often purchased together with beers => Place beer in the way from diapers to the checkout
❑ Generate recommendations for customers with similar baskets:=> e.g. Customers that bought „Star Wars“, might be also interested in „The lord of the rings “.
48
Shopping basket
DataWarehouse
Possible generalizations:• Paprika-Chips Snacks • Enrichment of customer data
Association rules
Data Mining I @SS19: Introduction
-
Unsupervised|Supervised Learning: Outlier detection
■ Outlier detection is defined as identification of non-typical data
■ Outliers might indicate
❑ possible abuse of credit cards, mobile phones
❑ data errors
❑ device failures
49Data Mining I @SS19: Introduction
-
Application
■ Analysis of the SAT.1-Ran-Soccer-Database (Season 1998/99)
❑ 375 players
❑ Primary attributes: Name, #games, #goals, playing position (goalkeeper, defense, midfield, offense),
❑ Derived attribute: Goals per game
❑ Outlier analysis (playing position, #games, #goals)
■ Result: Top 5 outliers
50
Rank Name # games #goals position Explanation
1 Michael Preetz 34 23 Offense Top scorer overall
2 Michael Schjönberg 15 6 Defense Top scoring defense player
3 Hans-Jörg Butt 34 7 Goalkeeper Goalkeeper with the most goals
4 Ulf Kirsten 31 19 Offense 2nd scorer overall
5 Giovanne Elber 21 13 Offense High #goals/per game
Data Mining I @SS19: Introduction
Note: “Outliers” is not necessarily a negative term.
-
Short break (5’) – Learning from the student data
■ Recall our learning example
■ Example: Exam performance
❑ Task T: predict whether a student will pass the final DM exam or not
❑ Experience E: historical records of students that took the DM exam
❑ Performance measure P: % of correctly identified students
■ If students are the learning instances, what sort of features could I use to describe each of them?
■ What could be the feedback/label for the learning model (if any)?
■ What could be a supervised learning task here?
❑ For classification? For prediction?
■ What could be an unsupervised learning task ?
❑ For clustering, frequent itemsets mining?
■ What could be an outlier detection problem here?
51Data Mining I @SS19: Introduction
-
Outline
■ Why to study Data Mining?
■ Why we need Data Mining?
■ What is the KDD (Knowledge Discovery in Databases) process?
■ Main data mining tasks
■ Course logististics
■ Things you should know from this lecture
■ Homework/ Tutorial
52Data Mining I @SS19: Introduction
-
Course logistics 1/3
■ Class schedule
❑ Lectures: Wednesdays, 12:15 - 13:45, Multimedia-Hörsaal (3703 - 023), Appelstraße 4.
❑ Tutorials: Monday: 10:00 - 11:30 , Monday: 13:30 - 15:00, Tuesday: 11:45 - 13:15, Tuesday: 13:30 - 15:00, Room 235, Gebaeude 3703, Appelstraße 4.
■ StudIP as a common information sharing place
❑ Up to date announcements and material
❑ Use the forum for your questions. They might benefit everyone!
■ Exam:
❑ Written exam, 90’
■ You are allowed to bring a hand-written A4 with formulas etc (No need to memorize) – Each student should have her own A4 (copied are not allowed)
❑ The exam will be based on the material discussed in the class plus the tutorials.
■ Exam date: Monday 28.8.2019, 08:30-11:00, Rooms: F 102, F 303
53Data Mining I @SS19: Introduction
-
Overview of the lectures (current planning)
1. Introduction
2. Getting to know our data
3. Association Rules Mining
4. Clustering
6. Classification
7. Outlier Detection
54Data Mining I @SS19: Introduction
-
Course logistics 2/3
■ Projects
❑ Focus on the complete KDD pipeline for two different learning tasks:
■ Classification: 15/5/2019 & Clustering: 26/6/2019
■ Groups of 2 (Please form the teams by yourselves)
■ Goal: how to run a data mining case study? From data preprocessing to transformation, learning algorithm, evaluation and presentation of the results. Both analysis and presentation part are important.
■ We will use Kaggle for result submission (but you have to submit the report separately)
■ We will have a poster session at the end where each team present its results
■ Bonus schema
❑ Pass both projects: you switch to the next best grade
■ e.g., from 1.7→1,3
❑ Each member ``inherits’’ the grade of the group
❑ Extra bonus for those that score best in Kaggle (system) & those with the best poster (voting)
55Data Mining I @SS19: Introduction
-
Course logistics 3/3
56Data Mining I @SS19: Introduction
■ Teaching Assistants
❑ Vasileios Iosifidis
■ Room 240, 2nd floor, Appelstraße 4
❑ Tai Le Quy
■ Room 010, Ground floor, Appelstraße 4
❑ Maximilian Idahl
■ -
❑ Wazed Ali
❑ Shaheer Asghar
■ Lecturer
❑ Prof. Dr. Eirini Ntoutsi
■ Room 203, 2nd floor, Appelstraße 4
Contact via email:[email protected]
Please use [DM1] in the subject
mailto:[email protected]
-
Tutorials: Organization
■ 4 tutorial groups
❑ Times:
■ Monday, 10:00 – 11:30 and 13:30 – 15:00
■ Tuesday, 11:45 – 13:15 and 13:30 – 15:00
❑ Room: 235, Appelstraße 4
❑ Registration for groups in Stud.IP, unlocked today at 20:00
57Data Mining I @SS19: Introduction
-
Tutorials: Why to attend
Why should you attend the tutorials?
Solving theoretical and algorithmic parts is an excellent preparation for the exam
Working on the implementation part is useful for the bonus projects
Each tutorial will consist of
1. a theoretical part (e.g., properties of an algorithm)
2. an algorithmic part (e.g., applying an algorithm on a particular dataset)
3. an implementation part (e.g., how to run a data mining analysis in Python)
58SoSe19: DM I - Tutorial
-
Tutorials: structure
1 worksheet per week
Announced after the lecture, so you have the chance to prepare beforehand
Solutions will be available via Stud.IP
Theoretical and algorithmic parts will be mainly presented at the blackboard (with help from you)
Implementation part in form of jupyter notebooks (interactive python)
Use Stud.IP for any tutorial related question.
This is the fastest way to get an answer
Your question might be relevant to other students
Posing and answering questions is a great way to learn
Or send an e-mail to [email protected]
59SoSe19: DM I - Tutorial
-
Tutorials: For next week (1st tutorial)
Take a look at python and jupyter notebooks
Installation:
Anaconda python distribution includes most packages needed for this course
Step-by-step installation guide for Windows/Mac/Linux: https://docs.anaconda.com/anaconda/
Quick start guide for jupyter notebooks: https://jupyter.readthedocs.io/en/latest/content-quickstart.html
Alternative:
Jupyter notebook environment Google Colab: https://colab.research.google.com/notebooks/welcome.ipynb
Free, requires no setup, runs entirely in the cloud
60SoSe19: DM I - Tutorial
https://docs.anaconda.com/anaconda/https://jupyter.readthedocs.io/en/latest/content-quickstart.htmlhttps://colab.research.google.com/notebooks/welcome.ipynb
-
Tutorials: Python resources
There are lots of great python tutorials
Recommended: http://scipy-lectures.org/intro/language/python_language.html
Official: https://docs.python.org/3/tutorial/
If you prefer video tutorials:
https://pythonprogramming.net/python-fundamental-tutorials/ or
https://youtu.be/YYXdXT2l-Gg
And on jupyter notebooks
https://medium.com/codingthesmartway-com-blog/getting-started-with-jupyter-notebook-for-python-4e7082bd5d46 or
http://opentechschool.github.io/python-data-intro/core/notebook.html or
https://youtu.be/HW29067qVWk
61SoSe19: DM I - Tutorial
http://scipy-lectures.org/intro/language/python_language.htmlhttps://docs.python.org/3/tutorial/https://pythonprogramming.net/python-fundamental-tutorials/https://youtu.be/YYXdXT2l-Gghttps://medium.com/codingthesmartway-com-blog/getting-started-with-jupyter-notebook-for-python-4e7082bd5d46http://opentechschool.github.io/python-data-intro/core/notebook.htmlhttps://youtu.be/HW29067qVWk
-
Tutorials 1-2 plan
62SoSe19: DM I - Tutorial
Tutorial 1 + 2 will include introductions to
Python basics
Arrays in NumPy
Data manipulation with Pandas
Visualization using matplotlib
Data mining and analysis tools in scikit-learn
Goal: Running a data analysis process in python, from data selection to pattern evaluation
Useful for the projects
Learning-by-doing
-
Textbook and recommended readings
■ Textbook:
❑ Tan P.-N., Steinbach M., Kumar V., Introduction to Data Mining, Addison-Wesley, 2014
❑ New edition is expected in May 2019
■ Recommended readings
❑ Meira and Zaki, Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, 2014
❑ Mitchell T. M., Machine Learning, McGraw-Hill, 1997
❑ Han J., Kamber M., Pei J., Data Mining: Concepts and Techniques, 3rd ed., Morgan Kaufmann, 2011
❑ C. Aggarwal, Data Mining the textbook, 2015
❑ Witten I. H., Frank E., Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann Publishers, 2016.
63Data Mining I @SS19: Introduction
http://images-eu.amazon.com/images/P/0071154671.03.LZZZZZZZ.jpg
-
Online resources
■ Machine Learning class by Andrew Ng, Stanford
❑ http://ml-class.org/
■ Tom Mitchel’s lectures on youtube
❑ www.youtube.com/playlist?list=PLAJ0alZrN8rD63LD0FkzKFiFgkOmEtltQ
■ Kdnuggets: Data Mining and Analytics resources
❑ http://www.kdnuggets.com/
64Data Mining I @SS19: Introduction
-
Tools
■ Several options for either commercial or free/ open source tools
❑ Check an up to date list at: http://www.kdnuggets.com/software/suites.html
■ Commercial tools offered by major vendors
❑ e.g., IBM, Microsoft, Oracle …
■ Free/ open source tools
65
Weka
Elki
R
SciPy + NumPy
OrangeRapid Miner (free, commercial versions)
Data Mining I @SS19: Introduction
http://www.kdnuggets.com/software/suites.html
-
Outline
■ Why to study Data Mining?
■ Why we need Data Mining?
■ What is the KDD (Knowledge Discovery in Databases) process?
■ Main data mining tasks
■ Course logististics
■ Things you should know from this lecture
■ Homework/ Tutorial
66Data Mining I @SS19: Introduction
-
Things you should know from this lecture
■ KDD definition
■ KDD process
■ DM step
■ Supervised vs Unsupervised learning
■ Main DM tasks
❑ Clustering
❑ Classification
❑ Regression
❑ Association rules mining
❑ Outlier detection
67Data Mining I @SS19: Introduction
-
Outline
■ Why to study Data Mining?
■ Why we need Data Mining?
■ What is the KDD (Knowledge Discovery in Databases) process?
■ Main data mining tasks
■ Course logististics
■ Things you should know from this lecture
■ Homework/ Tutorial
68Data Mining I @SS19: Introduction
-
Homework/ Tutorial
■ Homework: Think of some real world applications that you find suitable for Data Mining.
❑ Why?
❑ What type of patterns would you look for?
❑ Would you approach it as a supervised or unsupervised learning task?
■ Readings:
❑ Tan P.-N., Steinbach M., Kumar V book, Chapter 1.
❑ U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press.
69Data Mining I @SS19: Introduction
-
Acknowledgement
■ The slides are based on
❑ KDD I lecture at LMU Munich (Johannes Aßfalg, Christian Böhm, Karsten Borgwardt, Martin Ester, EshrefJanuzaj, Karin Kailing, Peer Kröger, Eirini Ntoutsi, Jörg Sander, Matthias Schubert, Arthur Zimek, Andreas Züfle)
❑ Introduction to Data Mining book slides at http://www-users.cs.umn.edu/~kumar/dmbook/
❑ Pedro Domingos Machine Lecture course slides at the University of Washington
❑ Machine Learning book by T. Mitchel slides at http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html
❑ Thank you to all TAs contributing to their improvement, namely Vasileios Iosifidis, Damianos Melidis, Tai Le Quy, Han Tran
70Data Mining I @SS19: Introduction
http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html