Introduction to Neural Networks -...
Transcript of Introduction to Neural Networks -...
Media IC &System Lab
Introduction to Neural Networks
柯揚
Media IC &System Lab
Agenda
• Origin of Neural Networks
• Math of Neural Networks
• Machine Learning
• Deep Learning
• Convolutional Neural Networks
• Recurrent Neural Networks
2
Media IC &System Lab
Origin
Theory
PredictionVerification
Observation
3
Media IC &System Lab
Origin
4
Media IC &System Lab
Origin
• What if we don’t have a theory?• In physics, we look for variables and their relations: 𝑓 = 𝑚𝑎
• What if we have variables but cannot figure out their relations?
• Examples:• Recognize a person
• Variables: Pixels of an image → Identity
• Solve a riddle• Variables: Riddle description → Solution description
• …
5
Media IC &System Lab
Origin
Take a look at nature
6
Media IC &System Lab
Origin
• Hubel and Wiesel
7
Media IC &System Lab
Origin →Math
• Single cells that do a simple computation:• Hidden:
• 𝑦𝑖 = 𝜎 σ𝑗=1,2,3𝑤𝑖,𝑗𝐻 𝑥𝑗 + 𝑏𝑖
𝐻 for 𝑖 = 1,2,3,4
• 𝜎:ℝ → ℝ … activation function, traditionally tanh 𝑥
• σ𝑗=1,2,3𝑤𝑖,𝑗𝐻 𝑥𝑗 … is a scalar product: < 𝒘𝒊
𝑯, 𝒙 >
• 𝒛 = 𝜎 𝑊𝐻𝒙 + 𝒃𝑯 in matrix-vector notation
• Output:• 𝒚 = 𝑊𝑂𝒛 + 𝒃𝑶
• We call 𝑊𝐻 and 𝑊𝑂 weights, because they give (more or less) weight to the input• 𝒃𝑯 and 𝒃𝑶 are called bias, because they bias (偏誤,偏差) the output to one
direction
• Two matrix multiplications. How many MAC (multiply-accumulate operations)?
8
𝑥1
𝑥2
𝑥3
𝑧1
𝑧2
𝑧3
𝑧4
𝑦1
𝑦2
Media IC &System Lab
Math
• Neural Network:• Compute 𝒛 = 𝜎 𝑊𝐻𝒙 + 𝒃𝑯 and 𝒚 = 𝑊𝑂𝒛 + 𝒃𝑶
• 𝒙 are our inputs. 𝒚 are our outputs.
• How do we get our parameters: 𝜃 = 𝑊𝐻 , 𝒃𝑯,𝑊𝑂, 𝒃𝑶 ?
• The neural network should do something very well, so optimise it to do that…
9
Media IC &System Lab
Math
• Example: Predict the temperature tomorrow from the same day of the last five years• Input 𝑥 ∈ ℝ5 temperature vector for the last five years
• Output 𝑦 ∈ ℝ temperature tomorrow
• Network produces predictions ො𝑦 ∈ ℝ
• To measure how well our Neural Network is doing, we need some kind of loss measure• Take mean-squared-error in this example
10
T @ July 17th, 2018
T @ July 17th, 2017
T @ July 17th, 2016
T @ July 17th, 2015
T @ July 17th, 2014
July 17th , 2019
Media IC &System Lab
Math →Machine Learning
• We have:• A Neural Network 𝑓 𝑥; 𝜃 = ො𝑦
• A Loss: ℒ 𝑌, 𝑌 =1
𝐷σ𝑖=1𝐷 𝑦𝑖 − ො𝑦𝑖
2
• 𝑌 = 𝑦1, … , 𝑦𝐷 𝑌 = ො𝑦1, … , ො𝑦𝐷
• How do we get our truth, 𝑌?• Collect data…
• An observation is the best truth you can get.
11
Media IC &System Lab
Machine Learning
• Learning from Observations
• Recipe for practitioners:1. Collect (lots of) examples 𝑋 → 𝑌 (data + labels)
2. Choose a suitable function 𝑓 𝑥; 𝜃
3. Choose a loss function ℒ
4. Optimize 𝜃 to minimize ℒ
12
Media IC &System Lab
Machine Learning
• 1. Collect Data• Most algorithms assume that data is IID
• Independent & Identically distributed
• An example:• Tell dogs apart from cats
• Give your algorithm 990 images of dogs, 10 images of cats
• If your algorithm always answers “dog”, what’s the error?
→ Data is not IID
13
Media IC &System Lab
Machine Learning
• 2. Choose a suitable function 𝑓 𝑥; 𝜃
• How to choose a good function?
• Avoid under/overfitting:
• How to check this?• Train on one part of the training data 𝑋𝑇𝑟𝑎𝑖𝑛, 𝑌𝑇𝑟𝑎𝑖𝑛• Validate on the other part 𝑋𝑇𝑒𝑠𝑡 , 𝑌𝑇𝑒𝑠𝑡
14
Media IC &System Lab
Machine Learning
• 3. Loss ℒ & 4. Optimisation
• Optimisation in high school: 𝜕ℒ 𝑌,𝑓 𝑋;𝜃
𝜕𝜃= 0
• No closed form solution…
• Instead gradient descent:
15
Media IC &System Lab
Machine Learning
• 3. Loss ℒ & 4. Optimisation
• Gradient Descent: iterative minimisation, take the steepest direction in each step:
• 𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝜕ℒ
𝜕𝜃
• 𝜂 … step size (how far you walk before you check your direction)
•𝜕ℒ
𝜕𝜃… gradient (you need a loss function that is differentiable)
• Does not give a global optimum!
16
Media IC &System Lab
Machine Learning → Deep Learning
• Deep Learning: Machine Learning with Deep Neural Networks• Deep = a lot of layers
• Each layer is a function 𝑦𝑙 = 𝜎 𝑊𝑙𝑥𝑙 + 𝑏𝑙 where• 𝑥𝑙 is each layer’s input (previous layer’s output)• 𝑦𝑙 is each layer’s output• 𝜃𝑙 = 𝑊𝑙 , 𝑏𝑙 are the parameters of each layer
• Three-layer neural network:
• 𝑦3 = 𝜎 𝑊3𝜎 𝑊2𝜎 𝑊1𝑥1 + 𝑏1 + 𝑏2 + 𝑏3
• How do we compute the gradient to update the parameters?
•𝜕ℒ
𝜕𝜃3=
𝜕ℒ
𝜕𝑦3
𝜕𝑦3
𝜕𝜃3with
𝜕𝑦3
𝜕𝑏3= 𝜎′ and
𝜕𝑦3
𝜕𝑊3= 𝜎′𝑥3
•𝜕ℒ
𝜕𝜃2=
𝜕ℒ
𝜕𝑦3
𝜕𝑦3
𝜕𝑦2
𝜕𝑦2
𝜕𝜃2with
𝜕𝑦2
𝜕𝑏2= 𝜎′ and
𝜕𝑦2
𝜕𝑊2= 𝜎′𝑥2
•𝜕ℒ
𝜕𝜃1=
𝜕ℒ
𝜕𝑦3
𝜕𝑦3
𝜕𝑦2
𝜕𝑦2
𝜕𝑦1
𝜕𝑦1
𝜕𝜃1with
𝜕𝑦1
𝜕𝑏1= 𝜎′ and
𝜕𝑦1
𝜕𝑊1= 𝜎′𝑥1
17
Media IC &System Lab
Deep Learning
• How do we compute the gradient to update the parameters?
•𝜕ℒ
𝜕𝜃3=
𝜕ℒ
𝜕𝑦3
𝜕𝑦3
𝜕𝜃3
•𝜕ℒ
𝜕𝜃2=
𝜕ℒ
𝜕𝑦3
𝜕𝑦3
𝜕𝑦2
𝜕𝑦2
𝜕𝜃2
•𝜕ℒ
𝜕𝜃1=
𝜕ℒ
𝜕𝑦3
𝜕𝑦3
𝜕𝑦2
𝜕𝑦2
𝜕𝑦1
𝜕𝑦1
𝜕𝜃1
• Gradient computation with chain rules yields shared operations• Compute them only once:
•𝜕ℒ
𝜕𝜃3=
𝜕ℒ
𝜕𝑦3
𝜕𝑦3
𝜕𝜃3
•𝜕ℒ
𝜕𝜃2=
𝜕𝑦3
𝜕𝑦2
𝜕𝑦2
𝜕𝜃2
•𝜕ℒ
𝜕𝜃1=
𝜕𝑦2
𝜕𝑦1
𝜕𝑦1
𝜕𝜃1
→ Backpropagation
18
Media IC &System Lab
Deep Learning
• In practice, for standard operations, this isn’t implemented manually
• Modern deep learning frameworks support automatic differentiation→ Tell them what 𝑓 looks like and they will take care of the rest
19
Media IC &System Lab
Deep Learning
• The non-linearity… 𝜎
• Let’s look at the last layer’s gradient: 𝜕ℒ
𝜕𝜃3=
𝜕ℒ
𝜕𝑦3
𝜕𝑦3
𝜕𝜃3• 𝑦3 = 𝜎 𝑊3𝑥3 + 𝑏3
• For each parameter: 𝜕𝑦3
𝜕𝑏3= 𝜎′ and
𝜕𝑦3
𝜕𝑊3= 𝜎′𝑥3
• 𝜎′ is shorthand for 𝜎′ 𝑊3𝑥3 + 𝑏3• Traditionally: 𝜎 𝑥 = tanh 𝑥 (from biology)
• What if 𝑊3𝑥3 + 𝑏3 is very far away from 0?
20
Media IC &System Lab
Deep Learning
• The non-linearity… 𝜎
• Vanishing gradient:• Gradient may become very small in a network
• Modern non-linearity:
• 𝜎 𝑥 = 𝑅𝑒𝐿𝑈 𝑥 = ቊ𝑥 𝑥 ≥ 00 𝑒𝑙𝑠𝑒
• 𝜎′ 𝑥 = 𝑅𝑒𝐿𝑈′ 𝑥 = ቊ1 𝑥 ≥ 00 𝑒𝑙𝑠𝑒
• Easy to compute and only one half vanishes…
21
Media IC &System Lab
Deep Learning
• Optimisation
• (Iterative) Gradient Descent: 𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝜕ℒ
𝜕𝜃
• In each iteration: compute gradients for all examples
• With big data, that takes a lot of time…
→Stochastic Gradient Descent (SGD):• Randomly choose a small set of examples (called a “batch”)
• Compute the gradient only for a subset of examples
22
Media IC &System Lab
Deep Learning → Convolutional Neural Networks• How to choose the right 𝑓?
• Sample from a salary questionnaire:• Age, Gender, Experience, Education, Salary, Bonus, …• Can I change their order without changing the meaning of the sample?
• Salary, Gender, Education, Bonus, Age, Experience, …
• Natural Language Sentence:• 我喜歡機器學習• Can I change the order?
• 機器喜歡我學習
23
Media IC &System Lab
Convolutional Neural Networks
• In a lot of data, we have direct neighborhood relations• Images• Audio• Video• Natural Language• …
• All data that is sampled according to a single (or multiple) changing variable(s)• Time series: sampled at different times• Images: sampled at different locations
24
Media IC &System Lab
Convolutional Neural Networks
• Convolution: neighborhood operator
25
Media IC &System Lab
Convolutional Neural Networks
• How to use convolutions in a neural network?• Example for images:
• Data has four dimensions: (𝐵, 𝐶, 𝐻,𝑊)• 𝐵 “batch”: the examples in a batch
• 𝐶 “channels”: the channels of each example (RGB for an image)
• 𝐻,𝑊 “spatial dimensions”: the intensities of a single channel of one example
• The weights of a single layer have four dimensions: 𝐹, 𝐶, 𝐾𝐻, 𝐾𝑊• 𝐹 “filter”: the filters of a layer
• 𝐶 “channels”: the channels of a layer, correspond to the channels of the data
• 𝐾𝐻 , 𝐾𝑊 “the kernel dimensions”: the actual kernel that is applied to a single channel
26
Media IC &System Lab
Convolutional Neural Networks
• How to use convolutions in a neural network?• Input: Single Image
• 1 × 3 × 4 × 4
• Weight: Single Kernel• 1 × 3 × 2 × 2
• There’s one kernel for each channel!
• Output: Single Image with Single Channel• 1 × 1 × 3 × 3
• 𝑰𝑹 ∗ 𝑲𝑶 + 𝑰𝑮 ∗ 𝑲𝑮 + 𝑰𝑩 ∗ 𝑲𝑷
27
Media IC &System Lab
Convolutional Neural Networks
• How to use convolutions in a neural network?• 𝑰𝑹 ∗ 𝑲𝑶 + 𝑰𝑮 ∗ 𝑲𝑮 + 𝑰𝑩 ∗ 𝑲𝑷 of size 1 × 1 × 3 × 3
• Use the non-linearity again: 𝑅𝑒𝐿𝑈 𝑰𝑹 ∗ 𝑲𝑶 + 𝑰𝑮 ∗ 𝑲𝑮 + 𝑰𝑩 ∗ 𝑲𝑷
• Use multiple filters:
• Filter 1: 1 × 3 × 2 × 2
• Filter 2: 1 × 3 × 2 × 2
• Filter 3: 1 × 3 × 2 × 2
• Together: 3 × 3 × 2 × 2 1 × 3 × 3 × 3
28
1 × 1 × 3 × 3
1 × 1 × 3 × 3
1 × 1 × 3 × 3
Media IC &System Lab
Convolutional Neural Networks
• How to use convolutions in a neural network?• One layer:
• Input: (𝐵, 𝐶, 𝐻,𝑊)
• Filters: 𝐹, 𝐶, 𝐾𝐻 , 𝐾𝑊• Output: (𝐵, 𝐹, 𝐻 − 𝐾𝐻 + 1,𝑊 − 𝐾𝑊 + 1)
• Apply non-linearity to each of the outputs
• Each layer is characterized by • #Filters
• Kernel size
• Channels are the filters of the previous layer!
29
Media IC &System Lab
Convolutional Neural Networks
• Calculation exercise• How many MACs for the example with three filters?
• Input 1 × 3 × 4 × 4
• Filter 3 × 3 × 2 × 2
• For 𝑁 images?• Input 𝑁 × 3 × 4 × 4
• Filter 3 × 3 × 2 × 2
30
Media IC &System Lab
Convolutional Neural Networks
• Padding: Add 0s around the outside of the input data to preserve size• Example
• 3 × 3 kernel
• Pad input with 1 line of zeros on the outside
• Pooling: Reduce spatial size• Higher layers work on more “abstract” data
• Max pooling: Keep the maximum out of a 𝐾𝐻 × 𝐾𝑊 patch (no calculation)
• Average pooling: Compute the mean value in a 𝐾𝐻 × 𝐾𝑊 patch
• For 𝐾𝐻 × 𝐾𝑊 = 2 × 2 the data is reduced by factor 2:• 1 × 3 × 4 × 4 → 1 × 3 × 2 × 2
31
Media IC &System Lab
Convolutional Neural Networks
• Calculation exercise• How many MACs with padding and 2 × 2 max pooling?
• Input 1 × 3 × 4 × 4
• Filter 4 × 3 × 3 × 3
• What’s the output size?
• How many MACs with padding and 2 × 2 average pooling?• Input 1 × 3 × 4 × 4
• Filter 4 × 3 × 3 × 3
• What’s the output size?
32
Media IC &System Lab
Convolutional Neural Networks
• Multiple Layers:• Input 1 × 3 × 4 × 4
• Layer 1, Filter 4 × 3 × 3 × 3, w/ padding
• 2 × 2 Max pooling
• Layer 2, Filter 1 × 4 × 1 × 1, w/ padding
• Q• What’s the output of layer 1?
• After max pooling?
• And after layer 2?
33
Media IC &System Lab
Classification Loss Functions
• Classification loss
• “Softmax”: 𝑃 Class 𝑐|ෝ𝒚 =exp ො𝑦𝑐
σ𝑘=1𝐶 exp ො𝑦𝑘
• Converts prediction vector ෝ𝒚 to probabilities
• Cross entropy: ℒ = −σ𝑘=1𝐶 𝑦𝑘 log 𝑃 𝑘 ෝ𝒚 + 1 − 𝑦𝑘 log 1 − 𝑃 𝑘 ෝ𝒚
• 𝒚 is 1-hot-vector• It’s 0 everywhere except for where the true class is
• Cats and Dogs and Mice example:
• (1,0,0) is label for cat
• (0,1,0) for dog
• (0,0,1) for mouse
34
Media IC &System Lab
Recurrent Neural Networks
• Time series / language modelling• Input/output for each time step
• Weights are the same in each timestep (similar to convolution)
• Update state: ℎ𝑡 = 𝜎 𝑉ℎ𝑡−1 + 𝑈𝑥𝑡−1 + 𝑏ℎ• Compute output: 𝑜𝑡 = 𝜎 𝑊ℎ𝑡 + 𝑏𝑜
35
Media IC &System Lab
Recurrent Neural Networks
• Disadvantage• Difficult to parallelise
• Use of GPU is less efficient
• Multiple layers:• RNN in one direction first (left to right)
• Second RNN in the other direction using 𝑜𝑡 as inputs
36
Media IC &System Lab
Summary
• Neural Networks:• Layers of linear functions + non-linearities
• Convolutional neural networks:• Layers of convolutions + non-linearities
• Recurrent neural networks:• Time steps of linear functions + non-linearities
• Machine Learning• Learning from observations• DFLO= Data + Function + Loss + Optimisation
37