مدخل إلى تعلم الآلة

Post on 06-Apr-2017

449 views 1 download

Transcript of مدخل إلى تعلم الآلة

www.qunaieer.comتنظیم

طریقة المحاضرة

شرح المصطلحات والمفاھیم األساسیة•المصطلحات وأغلب شرائح العرض باإلنجلیزي والشرح بالعربي•شرح مفصل لألساسیات•إشارات سریعة للخوارزمیات الشھیرة•أمثلة عملیة بسیطة•خطة تعلم مقترحة•

ما ھو تعلم اآللة؟

ھو علم یمكن الحاسب التعلم من نفسھ بدالً من برمجتھ بالتفصیل•، واتخاذ القرارات والتوقعات )models(اختزال جوھر البیانات عن طریق بناء نماذج •

المستقبلیة بناًء علیھا

Regression

•Regression analysis is a statistical process for estimating the relationships among variables

•Used to predict continuous outcomes

Regression Examples

Linear Regression

Line Equation𝑦𝑦 = 𝑏𝑏 + 𝑎𝑎𝑎𝑎

�𝑦𝑦 = 𝑤𝑤0 + 𝑤𝑤1𝑎𝑎

intercept slope

x

y

square meter

pric

e

Model/hypothesis

Linear Regression

x

y �𝑦𝑦 = 𝑤𝑤0 + 𝑤𝑤1𝑎𝑎

square meter

Pric

e (*

1000

)

example𝑤𝑤0=50, 𝑤𝑤1=1.8,

x=500�𝑦𝑦 = 950

Linear Regression

x

y How to quantify error?

square meter

pric

e

Linear Regression

𝑅𝑅𝑅𝑅𝑅𝑅 𝑤𝑤0,𝑤𝑤1 = �𝑖𝑖=1

𝑁𝑁

( �𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖)2

Residual Sum of Squares (RSS)

Where �𝑦𝑦𝑖𝑖 = 𝑤𝑤0 + 𝑤𝑤1𝑎𝑎𝑖𝑖

Cost function

Linear Regression

How to choose best model?

Choose w0 and w1that give lowest RSS value = Find The Best Line

x

y

square meter

pric

e

Optimization

Image from https://ccse.lbl.gov/Research/Optimization/index.html

Optimization (convex)

derivative = 0

derivative > 0derivative < 0

The gradient points in the directionof the greatest rate of increase of the function, and its magnitude is the slope of the graph in that direction. - Wikipedia

Optimization (convex)

Image from http://codingwiththomas.blogspot.com/2012/09/particle-swarm-optimization.html

𝑅𝑅𝑅𝑅𝑅𝑅

𝑤𝑤0

𝑤𝑤1

Optimization (convex)

• First, lets compute the gradient of our cost function

• To find best lines, there are two ways:• Analytical (normal equation)• Iterative (gradient descent)

Where �𝑦𝑦𝑖𝑖 = 𝑤𝑤0 + 𝑤𝑤1𝑎𝑎𝑖𝑖

𝛻𝛻𝑅𝑅𝑅𝑅𝑅𝑅(𝑤𝑤0,𝑤𝑤1) =−2∑𝑖𝑖=1𝑁𝑁 [ �𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖]−2∑𝑖𝑖=1𝑁𝑁 [ �𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖]𝑎𝑎𝑖𝑖

Gradient Descent

𝒘𝒘𝑡𝑡+1 = 𝒘𝒘𝑡𝑡 − 𝜂𝜂𝛻𝛻𝑅𝑅𝑅𝑅𝑅𝑅𝑤𝑤0𝑡𝑡+1

𝑤𝑤1𝑡𝑡+1=

𝑤𝑤0𝑡𝑡

𝑤𝑤1𝑡𝑡− 𝜂𝜂

−2∑𝑖𝑖=1𝑁𝑁 [ �𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖]−2∑𝑖𝑖=1𝑁𝑁 [ �𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖]𝑎𝑎𝑖𝑖

Update the weights to minimize the cost function𝜂𝜂 is the step size (important hyper-parameter)

Gradient Descent

Image from wikimedia.org

Linear Regression: Algorithm

• Objective: min𝑤𝑤0,𝑤𝑤1

𝐽𝐽(𝑤𝑤0,𝑤𝑤1) , here 𝐽𝐽 𝑤𝑤0,𝑤𝑤1 = 𝑅𝑅𝑅𝑅𝑅𝑅(𝑤𝑤0,𝑤𝑤1)

• Initialize 𝑤𝑤0,𝑤𝑤1, e.g. random numbers or zeros• 𝑓𝑓𝑓𝑓𝑓𝑓 𝑛𝑛𝑛𝑛𝑛𝑛𝑏𝑏𝑛𝑛𝑓𝑓 𝑓𝑓𝑓𝑓 𝑖𝑖𝑖𝑖𝑛𝑛𝑓𝑓𝑎𝑎𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛𝑖𝑖 𝑓𝑓𝑓𝑓 𝑖𝑖𝑖𝑖𝑓𝑓𝑠𝑠𝑠𝑠𝑖𝑖𝑛𝑛𝑠𝑠 𝑐𝑐𝑓𝑓𝑖𝑖𝑖𝑖𝑛𝑛𝑓𝑓𝑖𝑖𝑎𝑎:

• Compute the gradient: 𝛻𝛻𝐽𝐽• 𝑊𝑊𝑡𝑡+1 = 𝑊𝑊𝑡𝑡 − 𝜂𝜂𝛻𝛻𝐽𝐽, where W = 𝑤𝑤0

𝑤𝑤1

Model/hypothesis

Cost function

Optimization

�𝑦𝑦 = 𝑤𝑤0 + 𝑤𝑤1𝑎𝑎

𝑅𝑅𝑅𝑅𝑅𝑅 𝑤𝑤0,𝑤𝑤1 = �𝑖𝑖=1

𝑁𝑁

(𝑦𝑦𝑖𝑖−�𝑦𝑦𝑖𝑖)2

𝒘𝒘𝑡𝑡+1 = 𝒘𝒘𝑡𝑡 − 𝜂𝜂𝛻𝛻𝑅𝑅𝑅𝑅𝑅𝑅

The Essence of Machine Learning

Linear Regression: Multiple features

• Example: for house pricing, in addition to size in square meters, we can use city, location, number of rooms, number of bathrooms, etc

• The model/hypothesis becomes

�𝑦𝑦 = 𝑤𝑤0 + 𝑤𝑤1𝑎𝑎 1 + 𝑤𝑤2𝑎𝑎2 + ⋯+ 𝑤𝑤𝑛𝑛𝑎𝑎𝑛𝑛𝑤𝑤𝑛𝑛ℎ𝑓𝑓𝑛𝑛 𝑛𝑛 = 𝑛𝑛𝑛𝑛𝑛𝑛𝑏𝑏𝑛𝑛𝑓𝑓 𝑓𝑓𝑓𝑓 𝑓𝑓𝑛𝑛𝑎𝑎𝑖𝑖𝑛𝑛𝑓𝑓𝑛𝑛𝑖𝑖

Representation

• Vector representation of 𝑛𝑛 features

�𝑦𝑦 = 𝑤𝑤0 + 𝑤𝑤1𝑎𝑎 1 + 𝑤𝑤2𝑎𝑎2 + ⋯+ 𝑤𝑤𝑛𝑛𝑎𝑎𝑛𝑛

𝒙𝒙 =

𝑎𝑎0 = 1𝑎𝑎1𝑎𝑎2⋮𝑎𝑎𝑛𝑛

𝒘𝒘 =

𝑤𝑤0𝑤𝑤1𝑤𝑤2⋮𝑤𝑤𝑛𝑛

�𝑦𝑦 = 𝒘𝒘 𝑇𝑇𝒙𝒙

�𝑦𝑦 = 𝑤𝑤0 𝑤𝑤1 𝑤𝑤2 ⋯ 𝑤𝑤𝑛𝑛

𝑎𝑎0𝑎𝑎1𝑎𝑎2⋮𝑎𝑎𝑛𝑛

Representation

• matrix representation of 𝑛𝑛 data samples and 𝑛𝑛 features

�𝑦𝑦(𝑖𝑖) = 𝑤𝑤0 + 𝑤𝑤1𝑎𝑎1(𝑖𝑖) + 𝑤𝑤2𝑎𝑎2

(𝑖𝑖) + ⋯+ 𝑤𝑤𝑛𝑛𝑎𝑎𝑛𝑛(𝑖𝑖)

𝑋𝑋 =

𝑎𝑎0(0) 𝑎𝑎1

(0) ⋯ 𝑎𝑎𝑛𝑛(0)

𝑎𝑎0(1) 𝑎𝑎1

(1) ⋯ 𝑎𝑎𝑛𝑛(1)

⋮𝑎𝑎0

(𝑚𝑚)⋮

𝑎𝑎1(𝑚𝑚)

⋱…

⋮𝑎𝑎𝑛𝑛

(𝑚𝑚)

𝒘𝒘 =

𝑤𝑤0𝑤𝑤1𝑤𝑤2⋮𝑤𝑤𝑛𝑛

�𝒚𝒚 = 𝑋𝑋𝒘𝒘𝑖𝑖 is the 𝑖𝑖th data sample

Size: m x n Size: n x 1

�𝑦𝑦(0)

�𝑦𝑦(1)

�𝑦𝑦(2)

⋮�𝑦𝑦(𝑚𝑚)

=

𝑎𝑎0(0) 𝑎𝑎1

(0) ⋯ 𝑎𝑎𝑛𝑛(0)

𝑎𝑎0(1) 𝑎𝑎1

(1) ⋯ 𝑎𝑎𝑛𝑛(1)

⋮𝑎𝑎0

(𝑚𝑚)⋮

𝑎𝑎1(𝑚𝑚)

⋱…

⋮𝑎𝑎𝑛𝑛

(𝑚𝑚)

×

𝑤𝑤0𝑤𝑤1𝑤𝑤2⋮𝑤𝑤𝑛𝑛

Analytical solution (normal equation)

𝒘𝒘 = (𝑋𝑋𝑇𝑇𝑋𝑋)−1𝑋𝑋𝑇𝑇𝒚𝒚

Analytical vs. Gradient Descent

• Gradient descent: must select parameter 𝜂𝜂• Analytical solution: no parameter selection

• Gradient descent: a lot of iterations• Analytical solution: no need for iterations

• Gradient descent: works with large number of features • Analytical solution: slow with large number of features

Demo

• Matrices operations• Simple linear regression implementation• Scikit-learn library’s linear regression

Classification

classification

x1

x2

Classification Examples

Logistic Regression

• How to turn regression problem into classification one?• y = 0 or 1• Map values to [0 1] range

𝑠𝑠 𝑎𝑎 =1

1 + 𝑛𝑛−𝑥𝑥

Sigmoid/Logistic Function

Logistic Regression

• Model (sigmoid\logistic function)

• Interpretation (probability)

ℎ𝒘𝒘 𝒙𝒙 = 𝑠𝑠 𝒘𝒘𝑇𝑇𝒙𝒙 =1

1 + 𝑛𝑛−𝒘𝒘𝑇𝑇𝒙𝒙

ℎ𝒘𝒘 𝒙𝒙 = 𝑠𝑠 𝑦𝑦 = 1|𝒙𝒙;𝒘𝒘𝑖𝑖𝑓𝑓 ℎ𝒘𝒘 𝒙𝒙 ≥ 0.5 ⇒ 𝑦𝑦 = 1𝑖𝑖𝑓𝑓ℎ𝒘𝒘 𝒙𝒙 < 0.5 ⇒ 𝑦𝑦 = 0

Logistic Regression

x1

x2 Decision Boundary

Logistic Regression

• Cost function

𝐽𝐽 𝒘𝒘 = 𝑦𝑦 log(ℎ𝒘𝒘 𝒙𝒙 ) + 1 − 𝑦𝑦 log(1 − ℎ𝒘𝒘(𝒙𝒙))

ℎ𝒘𝒘 𝒙𝒙 = 𝑠𝑠 𝒘𝒘𝑇𝑇𝒙𝒙 =1

1 + 𝑛𝑛−𝒘𝒘𝑇𝑇𝒙𝒙

Logistic Regression

•Optimization: Gradient Descent•Exactly like linear regression•Find best w parameters that minimize the cost function

Logistic Regression: Algorithm

• Objective: min𝑤𝑤0,𝑤𝑤1

𝐽𝐽(𝑤𝑤0,𝑤𝑤1) , here 𝐽𝐽 𝑤𝑤0,𝑤𝑤1 is the logistic regression cost function

• Initialize 𝑤𝑤0,𝑤𝑤1, e.g. random numbers or zeros• 𝑓𝑓𝑓𝑓𝑓𝑓 𝑛𝑛𝑛𝑛𝑛𝑛𝑏𝑏𝑛𝑛𝑓𝑓 𝑓𝑓𝑓𝑓 𝑖𝑖𝑖𝑖𝑛𝑛𝑓𝑓𝑎𝑎𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛𝑖𝑖 𝑓𝑓𝑓𝑓 𝑖𝑖𝑖𝑖𝑓𝑓𝑠𝑠𝑠𝑠𝑖𝑖𝑛𝑛𝑠𝑠 𝑐𝑐𝑓𝑓𝑖𝑖𝑖𝑖𝑛𝑛𝑓𝑓𝑖𝑖𝑎𝑎:

• Compute the gradient: 𝛻𝛻𝐽𝐽 (not discussed here)• 𝒘𝒘𝑡𝑡+1 = 𝒘𝒘𝑡𝑡 − 𝜂𝜂𝛻𝛻𝐽𝐽, where 𝐰𝐰 = 𝑤𝑤0

𝑤𝑤1

Logistic Regression

• Multi-class classification

Logistic Regression

One-vs-All

Logistic Regression: Multiple Features

x1

x2

x1

x2x3

Line - Plane - Hyperplane

Demo

• Scikit-learn library’s logistic regression

Other Classification Algorithms

Neural Networks

input hidden output

Support Victor Machines (SVM)

x1

x2

Decision TreesCredit?

Term? Incom?

Term?

Safe

SafeRisky

SafeRisky

Risky

excellent poor

fair

3 year 5 year high low

3 year 5 year

K Nearest Neighbors (KNN)

x1

x2

?

5 Nearest Neighbors

Clustering

Clustering

•Unsupervised learning•Group similar items into clusters•K-Mean algorithm

Clustering Examples

K-Mean

x1

x2

K-Mean

x1

x2

K-Mean

x1

x2

K-Mean

x1

x2

K-Mean

x1

x2

K-Mean

x1

x2

K-Mean

x1

x2

K-Mean

x1

x2

K-Mean

x1

x2

K-Mean Algorithm

• Select number of clusters: 𝐾𝐾 (number of centroids 𝜇𝜇1, … , 𝜇𝜇𝐾𝐾)• Given dataset of size N• 𝑓𝑓𝑓𝑓𝑓𝑓 𝑛𝑛𝑛𝑛𝑛𝑛𝑏𝑏𝑛𝑛𝑓𝑓 𝑓𝑓𝑓𝑓 𝑖𝑖𝑖𝑖𝑛𝑛𝑓𝑓𝑎𝑎𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛𝑖𝑖 𝑖𝑖:• 𝑓𝑓𝑓𝑓𝑓𝑓 𝑖𝑖 = 1 𝑖𝑖𝑓𝑓 𝑁𝑁:

• 𝑐𝑐𝑖𝑖 ≔ assign cluster 𝑐𝑐𝑖𝑖 to sample 𝑎𝑎𝑖𝑖 as the smallest Euclidean distance between 𝑎𝑎𝑖𝑖 and the centroids

• 𝑓𝑓𝑓𝑓𝑓𝑓 𝑘𝑘 = 1 𝑖𝑖𝑓𝑓 𝐾𝐾:• 𝜇𝜇𝑘𝑘 ≔ mean of the points assigned to cluster 𝑐𝑐𝑘𝑘

Demo

• Scikit-learn library’s k-mean

Machine Learning

Supervised Unsupervised

ClassificationRegression Clustering

Other Machine Learning Algorithms

•Probabilistic models•Ensemble methods•Reinforcement Learning•Recommendation algorithms (e.g., Matrix Factorization)

•Deep Learning

Linear vs Non-linear

x1

x2

x

y

square meter

pric

eMulti-layer Neural NetworksSupport Victor Machines (kernel trick)

Kernel Trick 𝐾𝐾 𝑎𝑎1, 𝑎𝑎2 = [𝑎𝑎1, 𝑎𝑎2, 𝑎𝑎12 + 𝑎𝑎22]

Image from http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html

Practical aspects

Data Preprocessing

• Missing data• Elimination (samples/features)• Imputation

• Categorical data• Mapping (for ordinal features)• One-hot-encoding (for nominal features)

• Features scaling (normalization, standardization)

• Data/problem specific preprocessing (e.g., images, signals, text)

𝑎𝑎𝑛𝑛𝑛𝑛𝑛𝑛𝑚𝑚(𝑖𝑖) =

𝑎𝑎(𝑖𝑖) − 𝑎𝑎𝑚𝑚𝑚𝑚𝑥𝑥

𝑎𝑎𝑚𝑚𝑚𝑚𝑥𝑥 − 𝑎𝑎𝑚𝑚𝑖𝑖𝑛𝑛

𝑎𝑎𝑠𝑠𝑡𝑡𝑠𝑠(𝑖𝑖) =

𝑎𝑎(𝑖𝑖) − 𝜇𝜇𝑥𝑥𝜎𝜎𝑥𝑥

,𝑤𝑤ℎ𝑛𝑛𝑓𝑓𝑛𝑛 𝜇𝜇𝑥𝑥:𝑛𝑛𝑛𝑛𝑎𝑎𝑛𝑛 𝑓𝑓𝑓𝑓 𝑓𝑓𝑛𝑛𝑎𝑎𝑖𝑖𝑛𝑛𝑓𝑓𝑛𝑛 𝑎𝑎,𝜎𝜎𝑥𝑥: 𝑖𝑖𝑖𝑖𝑎𝑎𝑛𝑛𝑠𝑠𝑎𝑎𝑓𝑓𝑠𝑠 𝑠𝑠𝑛𝑛𝑑𝑑𝑖𝑖𝑎𝑎𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛

Model Evaluation

• Splitting data (training, validation, testing) IMPORTANT• No hard rule: usually 60%-20%-20% will be fine

• k-fold cross-validation• If dataset is very small• Leave-one-out

• Fine-tuning hyper-parameters• Automated hyper-parameter selection• Using validation set

Training Validation Testing

Performance Measures

• Depending on the problem• Some of the well-known measure are:• Classification measures

• Accuracy• Confusion matrix and related measures

• Regression• Mean Squared Error• R2 metric

• Clustering performance measure is not straight forward, and will not be discussed here

Performance Measures: Accuracy

• If we have 100 persons, one of them having cancer. What is the accuracy if classify all of them as having no cancer?

• Accuracy is not good for heavily biased class distribution

𝐴𝐴𝑐𝑐𝑐𝑐𝑛𝑛𝑓𝑓𝑎𝑎𝑐𝑐𝑦𝑦 =𝑐𝑐𝑓𝑓𝑓𝑓𝑓𝑓𝑛𝑛𝑐𝑐𝑖𝑖 𝑠𝑠𝑓𝑓𝑛𝑛𝑠𝑠𝑖𝑖𝑐𝑐𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛𝑎𝑎𝑎𝑎𝑎𝑎 𝑠𝑠𝑓𝑓𝑛𝑛𝑠𝑠𝑖𝑖𝑐𝑐𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛

Performance Measures: Confusion matrix

NegativePositive

False Positives (FP)Ture Positives (TP)Positive

True Negatives (TN)

False Negatives (FN)Negative

PredicatedClass

Actual Class

𝐴𝐴𝑐𝑐𝑐𝑐𝑛𝑛𝑓𝑓𝑎𝑎𝑐𝑐𝑦𝑦 =𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑁𝑁

𝑇𝑇𝑇𝑇 + 𝐹𝐹𝑇𝑇 + 𝐹𝐹𝑁𝑁 + 𝑇𝑇𝑁𝑁

𝑇𝑇𝑓𝑓𝑛𝑛𝑐𝑐𝑖𝑖𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛 =𝑇𝑇𝑇𝑇

𝑇𝑇𝑇𝑇 + 𝐹𝐹𝑇𝑇 𝑅𝑅𝑛𝑛𝑐𝑐𝑎𝑎𝑎𝑎𝑎𝑎 =𝑇𝑇𝑇𝑇

𝑇𝑇𝑇𝑇 + 𝐹𝐹𝑁𝑁 𝐹𝐹 −𝑛𝑛𝑛𝑛𝑎𝑎𝑖𝑖𝑛𝑛𝑓𝑓𝑛𝑛 = 2 ∗𝑇𝑇𝑓𝑓𝑛𝑛𝑐𝑐𝑖𝑖𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛 ∗ 𝑅𝑅𝑛𝑛𝑐𝑐𝑎𝑎𝑎𝑎𝑎𝑎𝑇𝑇𝑓𝑓𝑛𝑛𝑐𝑐𝑖𝑖𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛 + 𝑅𝑅𝑛𝑛𝑐𝑐𝑎𝑎𝑎𝑎𝑎𝑎

a measure of result relevancy a measure of how many truly relevant results are returned

the harmonic mean of precision and recall

Performance Measures: Mean Squared Error (MSE)

• Defined as

• Gives an idea of how wrong the predictions were• Only gives an idea of the magnitude of the error, but not the direction (e.g. over

or under predicting)• Root Mean Squared Error (RMSE) is the square root of MSE, which has the same

unit of the data

𝑀𝑀𝑅𝑅𝑀𝑀 =1𝑛𝑛�𝑖𝑖=1

𝑛𝑛

(�𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖)2

Performance Measures: R2

• Is a statistical measure of how close the data are to the fitted regression line

• Also known as the coefficient of determination• Has a value between 0 and 1 for no-fit and perfect fit, respectively

𝑅𝑅𝑅𝑅𝑛𝑛𝑟𝑟𝑠𝑠 = �𝑖𝑖

(𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖)2

𝑅𝑅𝑅𝑅𝑡𝑡𝑛𝑛𝑡𝑡 = �𝑖𝑖

(𝑦𝑦𝑖𝑖 − �𝑦𝑦)2 ,𝑤𝑤ℎ𝑛𝑛𝑓𝑓𝑛𝑛 �𝑦𝑦 𝑖𝑖𝑖𝑖 𝑖𝑖ℎ𝑛𝑛 𝑛𝑛𝑛𝑛𝑎𝑎𝑛𝑛 𝑓𝑓𝑓𝑓 𝑖𝑖ℎ𝑛𝑛 𝑓𝑓𝑏𝑏𝑖𝑖𝑛𝑛𝑓𝑓𝑑𝑑𝑛𝑛𝑠𝑠 𝑠𝑠𝑎𝑎𝑖𝑖𝑎𝑎

𝑅𝑅2 = 1 −𝑅𝑅𝑅𝑅𝑛𝑛𝑟𝑟𝑠𝑠𝑅𝑅𝑅𝑅𝑡𝑡𝑛𝑛𝑡𝑡

Dimensionality Reduction

Curse of dimensionality

Image from http://www.newsnshit.com/curse-of-dimensionality-interactive-demo/

when the dimensionality increases the volume of the space increases so fast that the available data become sparse

Feature selection

• Comprehensive (all subsets)• Forward stepwise• Backward stepwise• Forward-Backward• Many more…

Features compression/projection

• Project to lower dimensional space while preserving as much information as possible

• Principle Component Analysis (PCA)

• Unsupervised method

Image from http://compbio.pbworks.com/w/page/16252905/Microarray%20Dimension%20Reduction

Overfitting and Underfitting

Overfitting

x

y

square meter

Pric

e (*

1000

)

x1

x2

High Variance

Underfitting

x1

x2

x

y

square meter

Pric

e (*

1000

)

High Bias

Training vs. Testing Errors

•Accuracy on training set is not representative of model performance

•We need to calculate the accuracy on the test set (a new unseen examples)

•The goal is to generalize the model to work on unseen data

Bias and variance trade-off

Model Complexity

erro

r

Low High

Testing

Training

High Bias High Variance

The optimal is to have low bias and low

variance

Learning Curves

High Bias High Variance

Number of Training Samples

erro

r

Low High

Validation

Training

Number of Training Sampleser

ror

Low High

Validation

Training

Regularization

• To prevent overfitting• Decrease the complexity of the model• Example of regularized regression model (Ridge

Regression)

• 𝜆𝜆 is a very important hyper-parameter

𝑅𝑅𝑅𝑅𝑅𝑅 𝑤𝑤0,𝑤𝑤1 = ∑𝑖𝑖=1𝑁𝑁 ( �𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖)2 + 𝜆𝜆∑𝑗𝑗𝑘𝑘 𝑤𝑤𝑗𝑗2, 𝑘𝑘 = 𝑛𝑛𝑛𝑛𝑛𝑛𝑏𝑏𝑛𝑛𝑓𝑓 𝑓𝑓𝑓𝑓 𝑤𝑤𝑛𝑛𝑖𝑖𝑠𝑠ℎ𝑖𝑖𝑖𝑖

Debugging a Learning Algorithm

• From “Machine Learning” course on coursera.org, by Andrew Ng

• Get more training examples fixes high variance• Try smaller sets of features fixes high variance• Try getting additional features fixes high bias• Try adding polynomial features (e.g., 𝑎𝑎12, 𝑎𝑎22, 𝑎𝑎1, 𝑎𝑎2, 𝑛𝑛𝑖𝑖𝑐𝑐) fixes high

bias• Try decreasing λ fixes high bias• Try increasing 𝜆𝜆 fixes high variance

What is the best ml algorithm?

•“No free lunch” theorem: there is no one model that works best for every problem

•We need to try and compare different models and assumptions

•Machine learning is full of uncertainty

ویكامتالزمة Weka syndrome

نصائح لتطبیق تعلم اآللة

افھم المشكلة التي أنت بصدد حلھا جیداً •ما ھي المشكلة؟•ما المطلوب حلھ بالضبط؟•...)طبي، تسویق، (ھل تحتاج أن تتعلم أمور متعلقة بالمجال؟ •

افھم البیانات المتاحة جیداً •حجم البیانات•على البیانات ) descriptive statistics(إجراء بعض اإلحصاءات الوصفیة •ھل توجد أجزاء ناقصة؟ كیف تتعامل معھا؟•

نصائح لتطبیق تعلم اآللة

انات حدد عّدة خوارزمیات الختبارھا بناًء على وصف المشكلة والبی•المتاحة

•Regression? Classification? Clustering? Other?؟regularizationھل تحتاج •

featuresالخصائص الممیزة •)مثالً من صورة أو نصوص(ھل ھي جاھزة، أم تحتاج أن تستخلصھا؟ •)feature selection or projection(ھل تحتاج لتقلیلھا؟ •؟scalingھل تحتاج إلى •

نصائح لتطبیق تعلم اآللة

صمم االختبار•)training, 20% validation, 20% testing %60(كیف تقسم البیانات؟ ••Evaluation Measures•Hyper-parameters selection (using validation split)•Plot learning curves to asses bias and variance

ماذا تفعل؟•المزید من البیانات؟•تقلیل الخصائص أو زیادتھا أو مزجھا؟•

testingبعد أن تنتھي من كل ھذا، طبق الخوارزمیات التي اخترتھا على بیانات االختبار •splitواختر منھا ما تعتقد أنھ األنسب ،

برنامج مقترح لتعلم المجال

مراجعة المواضیع التالیة في الریاضیات واإلحصاء••Descriptive Statistics•Inferential Statistics•Probability•Linear Algebra•Basics of differential equations

Python Machine Learning: قراءة كتاب•

برنامج مقترح لتعلم المجال

coursera.orgفي ”Machine Learning“التسجیل في دورة ••www.coursera.org/learn/machine-learningوحل جمیع التمارین•

تعلم لغة برمجة والمكتبات ذات العالقة بتعلم اآللة••Matlab•Python•R

برنامج مقترح لتعلم المجال

، والعمل على بعض البیانات )www.kaggle.com(كاجلالتسجیل في •.والتحدیات المتاحة

:مقترح للبدایة••Titanic: Machine Learning from Disaster

https://www.kaggle.com/c/titanic•House Prices: Advanced Regression Techniques

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

•Digit Recognizerhttps://www.kaggle.com/c/digit-recognizer

برنامج مقترح لتعلم المجال

مراجعة أخرى لإلحصاء والریاضیات لتقویة األمور التي تعرف اآلن أنك تحتاجھا•ً (كتب أخرى مقترحة لتعلم اآللة • )أكثر تقدما

برنامج مقترح لتعلم المجال

التركیز على مجال تریده•)كالتسویق(التنبؤ المستقبلي لألعمال •معالجة اللغات الطبیعیة•رؤیة الحاسب•

اعرض نتائج أعمالك•واطلب رأیھم) مثل الكود والنتائج(شارك الناس عملك •أقم دورة في المجال الذي تخصصت بھ •

شكراً لكم على حضوركم وإنصاتكم