Post on 06-Apr-2017
طریقة المحاضرة
شرح المصطلحات والمفاھیم األساسیة•المصطلحات وأغلب شرائح العرض باإلنجلیزي والشرح بالعربي•شرح مفصل لألساسیات•إشارات سریعة للخوارزمیات الشھیرة•أمثلة عملیة بسیطة•خطة تعلم مقترحة•
ما ھو تعلم اآللة؟
ھو علم یمكن الحاسب التعلم من نفسھ بدالً من برمجتھ بالتفصیل•، واتخاذ القرارات والتوقعات )models(اختزال جوھر البیانات عن طریق بناء نماذج •
المستقبلیة بناًء علیھا
Regression
•Regression analysis is a statistical process for estimating the relationships among variables
•Used to predict continuous outcomes
Regression Examples
Linear Regression
Line Equation𝑦𝑦 = 𝑏𝑏 + 𝑎𝑎𝑎𝑎
�𝑦𝑦 = 𝑤𝑤0 + 𝑤𝑤1𝑎𝑎
intercept slope
x
y
square meter
pric
e
Model/hypothesis
Linear Regression
x
y �𝑦𝑦 = 𝑤𝑤0 + 𝑤𝑤1𝑎𝑎
square meter
Pric
e (*
1000
)
example𝑤𝑤0=50, 𝑤𝑤1=1.8,
x=500�𝑦𝑦 = 950
Linear Regression
x
y How to quantify error?
square meter
pric
e
Linear Regression
𝑅𝑅𝑅𝑅𝑅𝑅 𝑤𝑤0,𝑤𝑤1 = �𝑖𝑖=1
𝑁𝑁
( �𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖)2
Residual Sum of Squares (RSS)
Where �𝑦𝑦𝑖𝑖 = 𝑤𝑤0 + 𝑤𝑤1𝑎𝑎𝑖𝑖
Cost function
Linear Regression
How to choose best model?
Choose w0 and w1that give lowest RSS value = Find The Best Line
x
y
square meter
pric
e
Optimization
Image from https://ccse.lbl.gov/Research/Optimization/index.html
Optimization (convex)
derivative = 0
derivative > 0derivative < 0
The gradient points in the directionof the greatest rate of increase of the function, and its magnitude is the slope of the graph in that direction. - Wikipedia
Optimization (convex)
Image from http://codingwiththomas.blogspot.com/2012/09/particle-swarm-optimization.html
𝑅𝑅𝑅𝑅𝑅𝑅
𝑤𝑤0
𝑤𝑤1
Optimization (convex)
• First, lets compute the gradient of our cost function
• To find best lines, there are two ways:• Analytical (normal equation)• Iterative (gradient descent)
Where �𝑦𝑦𝑖𝑖 = 𝑤𝑤0 + 𝑤𝑤1𝑎𝑎𝑖𝑖
𝛻𝛻𝑅𝑅𝑅𝑅𝑅𝑅(𝑤𝑤0,𝑤𝑤1) =−2∑𝑖𝑖=1𝑁𝑁 [ �𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖]−2∑𝑖𝑖=1𝑁𝑁 [ �𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖]𝑎𝑎𝑖𝑖
Gradient Descent
𝒘𝒘𝑡𝑡+1 = 𝒘𝒘𝑡𝑡 − 𝜂𝜂𝛻𝛻𝑅𝑅𝑅𝑅𝑅𝑅𝑤𝑤0𝑡𝑡+1
𝑤𝑤1𝑡𝑡+1=
𝑤𝑤0𝑡𝑡
𝑤𝑤1𝑡𝑡− 𝜂𝜂
−2∑𝑖𝑖=1𝑁𝑁 [ �𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖]−2∑𝑖𝑖=1𝑁𝑁 [ �𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖]𝑎𝑎𝑖𝑖
Update the weights to minimize the cost function𝜂𝜂 is the step size (important hyper-parameter)
Gradient Descent
Image from wikimedia.org
Linear Regression: Algorithm
• Objective: min𝑤𝑤0,𝑤𝑤1
𝐽𝐽(𝑤𝑤0,𝑤𝑤1) , here 𝐽𝐽 𝑤𝑤0,𝑤𝑤1 = 𝑅𝑅𝑅𝑅𝑅𝑅(𝑤𝑤0,𝑤𝑤1)
• Initialize 𝑤𝑤0,𝑤𝑤1, e.g. random numbers or zeros• 𝑓𝑓𝑓𝑓𝑓𝑓 𝑛𝑛𝑛𝑛𝑛𝑛𝑏𝑏𝑛𝑛𝑓𝑓 𝑓𝑓𝑓𝑓 𝑖𝑖𝑖𝑖𝑛𝑛𝑓𝑓𝑎𝑎𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛𝑖𝑖 𝑓𝑓𝑓𝑓 𝑖𝑖𝑖𝑖𝑓𝑓𝑠𝑠𝑠𝑠𝑖𝑖𝑛𝑛𝑠𝑠 𝑐𝑐𝑓𝑓𝑖𝑖𝑖𝑖𝑛𝑛𝑓𝑓𝑖𝑖𝑎𝑎:
• Compute the gradient: 𝛻𝛻𝐽𝐽• 𝑊𝑊𝑡𝑡+1 = 𝑊𝑊𝑡𝑡 − 𝜂𝜂𝛻𝛻𝐽𝐽, where W = 𝑤𝑤0
𝑤𝑤1
Model/hypothesis
Cost function
Optimization
�𝑦𝑦 = 𝑤𝑤0 + 𝑤𝑤1𝑎𝑎
𝑅𝑅𝑅𝑅𝑅𝑅 𝑤𝑤0,𝑤𝑤1 = �𝑖𝑖=1
𝑁𝑁
(𝑦𝑦𝑖𝑖−�𝑦𝑦𝑖𝑖)2
𝒘𝒘𝑡𝑡+1 = 𝒘𝒘𝑡𝑡 − 𝜂𝜂𝛻𝛻𝑅𝑅𝑅𝑅𝑅𝑅
The Essence of Machine Learning
Linear Regression: Multiple features
• Example: for house pricing, in addition to size in square meters, we can use city, location, number of rooms, number of bathrooms, etc
• The model/hypothesis becomes
�𝑦𝑦 = 𝑤𝑤0 + 𝑤𝑤1𝑎𝑎 1 + 𝑤𝑤2𝑎𝑎2 + ⋯+ 𝑤𝑤𝑛𝑛𝑎𝑎𝑛𝑛𝑤𝑤𝑛𝑛ℎ𝑓𝑓𝑛𝑛 𝑛𝑛 = 𝑛𝑛𝑛𝑛𝑛𝑛𝑏𝑏𝑛𝑛𝑓𝑓 𝑓𝑓𝑓𝑓 𝑓𝑓𝑛𝑛𝑎𝑎𝑖𝑖𝑛𝑛𝑓𝑓𝑛𝑛𝑖𝑖
Representation
• Vector representation of 𝑛𝑛 features
�𝑦𝑦 = 𝑤𝑤0 + 𝑤𝑤1𝑎𝑎 1 + 𝑤𝑤2𝑎𝑎2 + ⋯+ 𝑤𝑤𝑛𝑛𝑎𝑎𝑛𝑛
𝒙𝒙 =
𝑎𝑎0 = 1𝑎𝑎1𝑎𝑎2⋮𝑎𝑎𝑛𝑛
𝒘𝒘 =
𝑤𝑤0𝑤𝑤1𝑤𝑤2⋮𝑤𝑤𝑛𝑛
�𝑦𝑦 = 𝒘𝒘 𝑇𝑇𝒙𝒙
�𝑦𝑦 = 𝑤𝑤0 𝑤𝑤1 𝑤𝑤2 ⋯ 𝑤𝑤𝑛𝑛
𝑎𝑎0𝑎𝑎1𝑎𝑎2⋮𝑎𝑎𝑛𝑛
Representation
• matrix representation of 𝑛𝑛 data samples and 𝑛𝑛 features
�𝑦𝑦(𝑖𝑖) = 𝑤𝑤0 + 𝑤𝑤1𝑎𝑎1(𝑖𝑖) + 𝑤𝑤2𝑎𝑎2
(𝑖𝑖) + ⋯+ 𝑤𝑤𝑛𝑛𝑎𝑎𝑛𝑛(𝑖𝑖)
𝑋𝑋 =
𝑎𝑎0(0) 𝑎𝑎1
(0) ⋯ 𝑎𝑎𝑛𝑛(0)
𝑎𝑎0(1) 𝑎𝑎1
(1) ⋯ 𝑎𝑎𝑛𝑛(1)
⋮𝑎𝑎0
(𝑚𝑚)⋮
𝑎𝑎1(𝑚𝑚)
⋱…
⋮𝑎𝑎𝑛𝑛
(𝑚𝑚)
𝒘𝒘 =
𝑤𝑤0𝑤𝑤1𝑤𝑤2⋮𝑤𝑤𝑛𝑛
�𝒚𝒚 = 𝑋𝑋𝒘𝒘𝑖𝑖 is the 𝑖𝑖th data sample
Size: m x n Size: n x 1
�𝑦𝑦(0)
�𝑦𝑦(1)
�𝑦𝑦(2)
⋮�𝑦𝑦(𝑚𝑚)
=
𝑎𝑎0(0) 𝑎𝑎1
(0) ⋯ 𝑎𝑎𝑛𝑛(0)
𝑎𝑎0(1) 𝑎𝑎1
(1) ⋯ 𝑎𝑎𝑛𝑛(1)
⋮𝑎𝑎0
(𝑚𝑚)⋮
𝑎𝑎1(𝑚𝑚)
⋱…
⋮𝑎𝑎𝑛𝑛
(𝑚𝑚)
×
𝑤𝑤0𝑤𝑤1𝑤𝑤2⋮𝑤𝑤𝑛𝑛
Analytical solution (normal equation)
𝒘𝒘 = (𝑋𝑋𝑇𝑇𝑋𝑋)−1𝑋𝑋𝑇𝑇𝒚𝒚
Analytical vs. Gradient Descent
• Gradient descent: must select parameter 𝜂𝜂• Analytical solution: no parameter selection
• Gradient descent: a lot of iterations• Analytical solution: no need for iterations
• Gradient descent: works with large number of features • Analytical solution: slow with large number of features
Demo
• Matrices operations• Simple linear regression implementation• Scikit-learn library’s linear regression
Classification
classification
x1
x2
Classification Examples
Logistic Regression
• How to turn regression problem into classification one?• y = 0 or 1• Map values to [0 1] range
𝑠𝑠 𝑎𝑎 =1
1 + 𝑛𝑛−𝑥𝑥
Sigmoid/Logistic Function
Logistic Regression
• Model (sigmoid\logistic function)
• Interpretation (probability)
ℎ𝒘𝒘 𝒙𝒙 = 𝑠𝑠 𝒘𝒘𝑇𝑇𝒙𝒙 =1
1 + 𝑛𝑛−𝒘𝒘𝑇𝑇𝒙𝒙
ℎ𝒘𝒘 𝒙𝒙 = 𝑠𝑠 𝑦𝑦 = 1|𝒙𝒙;𝒘𝒘𝑖𝑖𝑓𝑓 ℎ𝒘𝒘 𝒙𝒙 ≥ 0.5 ⇒ 𝑦𝑦 = 1𝑖𝑖𝑓𝑓ℎ𝒘𝒘 𝒙𝒙 < 0.5 ⇒ 𝑦𝑦 = 0
Logistic Regression
x1
x2 Decision Boundary
Logistic Regression
• Cost function
𝐽𝐽 𝒘𝒘 = 𝑦𝑦 log(ℎ𝒘𝒘 𝒙𝒙 ) + 1 − 𝑦𝑦 log(1 − ℎ𝒘𝒘(𝒙𝒙))
ℎ𝒘𝒘 𝒙𝒙 = 𝑠𝑠 𝒘𝒘𝑇𝑇𝒙𝒙 =1
1 + 𝑛𝑛−𝒘𝒘𝑇𝑇𝒙𝒙
Logistic Regression
•Optimization: Gradient Descent•Exactly like linear regression•Find best w parameters that minimize the cost function
Logistic Regression: Algorithm
• Objective: min𝑤𝑤0,𝑤𝑤1
𝐽𝐽(𝑤𝑤0,𝑤𝑤1) , here 𝐽𝐽 𝑤𝑤0,𝑤𝑤1 is the logistic regression cost function
• Initialize 𝑤𝑤0,𝑤𝑤1, e.g. random numbers or zeros• 𝑓𝑓𝑓𝑓𝑓𝑓 𝑛𝑛𝑛𝑛𝑛𝑛𝑏𝑏𝑛𝑛𝑓𝑓 𝑓𝑓𝑓𝑓 𝑖𝑖𝑖𝑖𝑛𝑛𝑓𝑓𝑎𝑎𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛𝑖𝑖 𝑓𝑓𝑓𝑓 𝑖𝑖𝑖𝑖𝑓𝑓𝑠𝑠𝑠𝑠𝑖𝑖𝑛𝑛𝑠𝑠 𝑐𝑐𝑓𝑓𝑖𝑖𝑖𝑖𝑛𝑛𝑓𝑓𝑖𝑖𝑎𝑎:
• Compute the gradient: 𝛻𝛻𝐽𝐽 (not discussed here)• 𝒘𝒘𝑡𝑡+1 = 𝒘𝒘𝑡𝑡 − 𝜂𝜂𝛻𝛻𝐽𝐽, where 𝐰𝐰 = 𝑤𝑤0
𝑤𝑤1
Logistic Regression
• Multi-class classification
Logistic Regression
One-vs-All
Logistic Regression: Multiple Features
x1
x2
x1
x2x3
Line - Plane - Hyperplane
Demo
• Scikit-learn library’s logistic regression
Other Classification Algorithms
Neural Networks
input hidden output
Support Victor Machines (SVM)
x1
x2
Decision TreesCredit?
Term? Incom?
Term?
Safe
SafeRisky
SafeRisky
Risky
excellent poor
fair
3 year 5 year high low
3 year 5 year
K Nearest Neighbors (KNN)
x1
x2
?
5 Nearest Neighbors
Clustering
Clustering
•Unsupervised learning•Group similar items into clusters•K-Mean algorithm
Clustering Examples
K-Mean
x1
x2
K-Mean
x1
x2
K-Mean
x1
x2
K-Mean
x1
x2
K-Mean
x1
x2
K-Mean
x1
x2
K-Mean
x1
x2
K-Mean
x1
x2
K-Mean
x1
x2
K-Mean Algorithm
• Select number of clusters: 𝐾𝐾 (number of centroids 𝜇𝜇1, … , 𝜇𝜇𝐾𝐾)• Given dataset of size N• 𝑓𝑓𝑓𝑓𝑓𝑓 𝑛𝑛𝑛𝑛𝑛𝑛𝑏𝑏𝑛𝑛𝑓𝑓 𝑓𝑓𝑓𝑓 𝑖𝑖𝑖𝑖𝑛𝑛𝑓𝑓𝑎𝑎𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛𝑖𝑖 𝑖𝑖:• 𝑓𝑓𝑓𝑓𝑓𝑓 𝑖𝑖 = 1 𝑖𝑖𝑓𝑓 𝑁𝑁:
• 𝑐𝑐𝑖𝑖 ≔ assign cluster 𝑐𝑐𝑖𝑖 to sample 𝑎𝑎𝑖𝑖 as the smallest Euclidean distance between 𝑎𝑎𝑖𝑖 and the centroids
• 𝑓𝑓𝑓𝑓𝑓𝑓 𝑘𝑘 = 1 𝑖𝑖𝑓𝑓 𝐾𝐾:• 𝜇𝜇𝑘𝑘 ≔ mean of the points assigned to cluster 𝑐𝑐𝑘𝑘
Demo
• Scikit-learn library’s k-mean
Machine Learning
Supervised Unsupervised
ClassificationRegression Clustering
Other Machine Learning Algorithms
•Probabilistic models•Ensemble methods•Reinforcement Learning•Recommendation algorithms (e.g., Matrix Factorization)
•Deep Learning
Linear vs Non-linear
x1
x2
x
y
square meter
pric
eMulti-layer Neural NetworksSupport Victor Machines (kernel trick)
Kernel Trick 𝐾𝐾 𝑎𝑎1, 𝑎𝑎2 = [𝑎𝑎1, 𝑎𝑎2, 𝑎𝑎12 + 𝑎𝑎22]
Image from http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html
Practical aspects
Data Preprocessing
• Missing data• Elimination (samples/features)• Imputation
• Categorical data• Mapping (for ordinal features)• One-hot-encoding (for nominal features)
• Features scaling (normalization, standardization)
• Data/problem specific preprocessing (e.g., images, signals, text)
𝑎𝑎𝑛𝑛𝑛𝑛𝑛𝑛𝑚𝑚(𝑖𝑖) =
𝑎𝑎(𝑖𝑖) − 𝑎𝑎𝑚𝑚𝑚𝑚𝑥𝑥
𝑎𝑎𝑚𝑚𝑚𝑚𝑥𝑥 − 𝑎𝑎𝑚𝑚𝑖𝑖𝑛𝑛
𝑎𝑎𝑠𝑠𝑡𝑡𝑠𝑠(𝑖𝑖) =
𝑎𝑎(𝑖𝑖) − 𝜇𝜇𝑥𝑥𝜎𝜎𝑥𝑥
,𝑤𝑤ℎ𝑛𝑛𝑓𝑓𝑛𝑛 𝜇𝜇𝑥𝑥:𝑛𝑛𝑛𝑛𝑎𝑎𝑛𝑛 𝑓𝑓𝑓𝑓 𝑓𝑓𝑛𝑛𝑎𝑎𝑖𝑖𝑛𝑛𝑓𝑓𝑛𝑛 𝑎𝑎,𝜎𝜎𝑥𝑥: 𝑖𝑖𝑖𝑖𝑎𝑎𝑛𝑛𝑠𝑠𝑎𝑎𝑓𝑓𝑠𝑠 𝑠𝑠𝑛𝑛𝑑𝑑𝑖𝑖𝑎𝑎𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛
Model Evaluation
• Splitting data (training, validation, testing) IMPORTANT• No hard rule: usually 60%-20%-20% will be fine
• k-fold cross-validation• If dataset is very small• Leave-one-out
• Fine-tuning hyper-parameters• Automated hyper-parameter selection• Using validation set
Training Validation Testing
Performance Measures
• Depending on the problem• Some of the well-known measure are:• Classification measures
• Accuracy• Confusion matrix and related measures
• Regression• Mean Squared Error• R2 metric
• Clustering performance measure is not straight forward, and will not be discussed here
Performance Measures: Accuracy
• If we have 100 persons, one of them having cancer. What is the accuracy if classify all of them as having no cancer?
• Accuracy is not good for heavily biased class distribution
𝐴𝐴𝑐𝑐𝑐𝑐𝑛𝑛𝑓𝑓𝑎𝑎𝑐𝑐𝑦𝑦 =𝑐𝑐𝑓𝑓𝑓𝑓𝑓𝑓𝑛𝑛𝑐𝑐𝑖𝑖 𝑠𝑠𝑓𝑓𝑛𝑛𝑠𝑠𝑖𝑖𝑐𝑐𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛𝑎𝑎𝑎𝑎𝑎𝑎 𝑠𝑠𝑓𝑓𝑛𝑛𝑠𝑠𝑖𝑖𝑐𝑐𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛
Performance Measures: Confusion matrix
NegativePositive
False Positives (FP)Ture Positives (TP)Positive
True Negatives (TN)
False Negatives (FN)Negative
PredicatedClass
Actual Class
𝐴𝐴𝑐𝑐𝑐𝑐𝑛𝑛𝑓𝑓𝑎𝑎𝑐𝑐𝑦𝑦 =𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑁𝑁
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝑇𝑇 + 𝐹𝐹𝑁𝑁 + 𝑇𝑇𝑁𝑁
𝑇𝑇𝑓𝑓𝑛𝑛𝑐𝑐𝑖𝑖𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛 =𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝑇𝑇 𝑅𝑅𝑛𝑛𝑐𝑐𝑎𝑎𝑎𝑎𝑎𝑎 =𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝑁𝑁 𝐹𝐹 −𝑛𝑛𝑛𝑛𝑎𝑎𝑖𝑖𝑛𝑛𝑓𝑓𝑛𝑛 = 2 ∗𝑇𝑇𝑓𝑓𝑛𝑛𝑐𝑐𝑖𝑖𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛 ∗ 𝑅𝑅𝑛𝑛𝑐𝑐𝑎𝑎𝑎𝑎𝑎𝑎𝑇𝑇𝑓𝑓𝑛𝑛𝑐𝑐𝑖𝑖𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛 + 𝑅𝑅𝑛𝑛𝑐𝑐𝑎𝑎𝑎𝑎𝑎𝑎
a measure of result relevancy a measure of how many truly relevant results are returned
the harmonic mean of precision and recall
Performance Measures: Mean Squared Error (MSE)
• Defined as
• Gives an idea of how wrong the predictions were• Only gives an idea of the magnitude of the error, but not the direction (e.g. over
or under predicting)• Root Mean Squared Error (RMSE) is the square root of MSE, which has the same
unit of the data
𝑀𝑀𝑅𝑅𝑀𝑀 =1𝑛𝑛�𝑖𝑖=1
𝑛𝑛
(�𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖)2
Performance Measures: R2
• Is a statistical measure of how close the data are to the fitted regression line
• Also known as the coefficient of determination• Has a value between 0 and 1 for no-fit and perfect fit, respectively
𝑅𝑅𝑅𝑅𝑛𝑛𝑟𝑟𝑠𝑠 = �𝑖𝑖
(𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖)2
𝑅𝑅𝑅𝑅𝑡𝑡𝑛𝑛𝑡𝑡 = �𝑖𝑖
(𝑦𝑦𝑖𝑖 − �𝑦𝑦)2 ,𝑤𝑤ℎ𝑛𝑛𝑓𝑓𝑛𝑛 �𝑦𝑦 𝑖𝑖𝑖𝑖 𝑖𝑖ℎ𝑛𝑛 𝑛𝑛𝑛𝑛𝑎𝑎𝑛𝑛 𝑓𝑓𝑓𝑓 𝑖𝑖ℎ𝑛𝑛 𝑓𝑓𝑏𝑏𝑖𝑖𝑛𝑛𝑓𝑓𝑑𝑑𝑛𝑛𝑠𝑠 𝑠𝑠𝑎𝑎𝑖𝑖𝑎𝑎
𝑅𝑅2 = 1 −𝑅𝑅𝑅𝑅𝑛𝑛𝑟𝑟𝑠𝑠𝑅𝑅𝑅𝑅𝑡𝑡𝑛𝑛𝑡𝑡
Dimensionality Reduction
Curse of dimensionality
Image from http://www.newsnshit.com/curse-of-dimensionality-interactive-demo/
when the dimensionality increases the volume of the space increases so fast that the available data become sparse
Feature selection
• Comprehensive (all subsets)• Forward stepwise• Backward stepwise• Forward-Backward• Many more…
Features compression/projection
• Project to lower dimensional space while preserving as much information as possible
• Principle Component Analysis (PCA)
• Unsupervised method
Image from http://compbio.pbworks.com/w/page/16252905/Microarray%20Dimension%20Reduction
Overfitting and Underfitting
Overfitting
x
y
square meter
Pric
e (*
1000
)
x1
x2
High Variance
Underfitting
x1
x2
x
y
square meter
Pric
e (*
1000
)
High Bias
Training vs. Testing Errors
•Accuracy on training set is not representative of model performance
•We need to calculate the accuracy on the test set (a new unseen examples)
•The goal is to generalize the model to work on unseen data
Bias and variance trade-off
Model Complexity
erro
r
Low High
Testing
Training
High Bias High Variance
The optimal is to have low bias and low
variance
Learning Curves
High Bias High Variance
Number of Training Samples
erro
r
Low High
Validation
Training
Number of Training Sampleser
ror
Low High
Validation
Training
Regularization
• To prevent overfitting• Decrease the complexity of the model• Example of regularized regression model (Ridge
Regression)
• 𝜆𝜆 is a very important hyper-parameter
𝑅𝑅𝑅𝑅𝑅𝑅 𝑤𝑤0,𝑤𝑤1 = ∑𝑖𝑖=1𝑁𝑁 ( �𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖)2 + 𝜆𝜆∑𝑗𝑗𝑘𝑘 𝑤𝑤𝑗𝑗2, 𝑘𝑘 = 𝑛𝑛𝑛𝑛𝑛𝑛𝑏𝑏𝑛𝑛𝑓𝑓 𝑓𝑓𝑓𝑓 𝑤𝑤𝑛𝑛𝑖𝑖𝑠𝑠ℎ𝑖𝑖𝑖𝑖
Debugging a Learning Algorithm
• From “Machine Learning” course on coursera.org, by Andrew Ng
• Get more training examples fixes high variance• Try smaller sets of features fixes high variance• Try getting additional features fixes high bias• Try adding polynomial features (e.g., 𝑎𝑎12, 𝑎𝑎22, 𝑎𝑎1, 𝑎𝑎2, 𝑛𝑛𝑖𝑖𝑐𝑐) fixes high
bias• Try decreasing λ fixes high bias• Try increasing 𝜆𝜆 fixes high variance
What is the best ml algorithm?
•“No free lunch” theorem: there is no one model that works best for every problem
•We need to try and compare different models and assumptions
•Machine learning is full of uncertainty
ویكامتالزمة Weka syndrome
نصائح لتطبیق تعلم اآللة
افھم المشكلة التي أنت بصدد حلھا جیداً •ما ھي المشكلة؟•ما المطلوب حلھ بالضبط؟•...)طبي، تسویق، (ھل تحتاج أن تتعلم أمور متعلقة بالمجال؟ •
افھم البیانات المتاحة جیداً •حجم البیانات•على البیانات ) descriptive statistics(إجراء بعض اإلحصاءات الوصفیة •ھل توجد أجزاء ناقصة؟ كیف تتعامل معھا؟•
نصائح لتطبیق تعلم اآللة
انات حدد عّدة خوارزمیات الختبارھا بناًء على وصف المشكلة والبی•المتاحة
•Regression? Classification? Clustering? Other?؟regularizationھل تحتاج •
featuresالخصائص الممیزة •)مثالً من صورة أو نصوص(ھل ھي جاھزة، أم تحتاج أن تستخلصھا؟ •)feature selection or projection(ھل تحتاج لتقلیلھا؟ •؟scalingھل تحتاج إلى •
نصائح لتطبیق تعلم اآللة
صمم االختبار•)training, 20% validation, 20% testing %60(كیف تقسم البیانات؟ ••Evaluation Measures•Hyper-parameters selection (using validation split)•Plot learning curves to asses bias and variance
ماذا تفعل؟•المزید من البیانات؟•تقلیل الخصائص أو زیادتھا أو مزجھا؟•
testingبعد أن تنتھي من كل ھذا، طبق الخوارزمیات التي اخترتھا على بیانات االختبار •splitواختر منھا ما تعتقد أنھ األنسب ،
برنامج مقترح لتعلم المجال
مراجعة المواضیع التالیة في الریاضیات واإلحصاء••Descriptive Statistics•Inferential Statistics•Probability•Linear Algebra•Basics of differential equations
Python Machine Learning: قراءة كتاب•
برنامج مقترح لتعلم المجال
coursera.orgفي ”Machine Learning“التسجیل في دورة ••www.coursera.org/learn/machine-learningوحل جمیع التمارین•
تعلم لغة برمجة والمكتبات ذات العالقة بتعلم اآللة••Matlab•Python•R
برنامج مقترح لتعلم المجال
، والعمل على بعض البیانات )www.kaggle.com(كاجلالتسجیل في •.والتحدیات المتاحة
:مقترح للبدایة••Titanic: Machine Learning from Disaster
https://www.kaggle.com/c/titanic•House Prices: Advanced Regression Techniques
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
•Digit Recognizerhttps://www.kaggle.com/c/digit-recognizer
برنامج مقترح لتعلم المجال
مراجعة أخرى لإلحصاء والریاضیات لتقویة األمور التي تعرف اآلن أنك تحتاجھا•ً (كتب أخرى مقترحة لتعلم اآللة • )أكثر تقدما
برنامج مقترح لتعلم المجال
التركیز على مجال تریده•)كالتسویق(التنبؤ المستقبلي لألعمال •معالجة اللغات الطبیعیة•رؤیة الحاسب•
اعرض نتائج أعمالك•واطلب رأیھم) مثل الكود والنتائج(شارك الناس عملك •أقم دورة في المجال الذي تخصصت بھ •
شكراً لكم على حضوركم وإنصاتكم