Introduction to Neural Networks -...

Media IC &System Lab

Introduction to Neural Networks

柯揚


Agenda

• Origin of Neural Networks

• Math of Neural Networks

• Machine Learning

• Deep Learning

• Convolutional Neural Networks

• Recurrent Neural Networks

2


Origin

Theory

PredictionVerification

Observation

3


Origin

4


Origin

• What if we don’t have a theory?• In physics, we look for variables and their relations: 𝑓 = 𝑚𝑎

• What if we have variables but cannot figure out their relations?

• Examples:• Recognize a person

• Variables: Pixels of an image → Identity

• Solve a riddle• Variables: Riddle description → Solution description

• …

5


Origin

Take a look at nature

6


Origin

• Hubel and Wiesel

7


Origin →Math

• Single cells that do a simple computation:• Hidden:

• 𝑦𝑖 = 𝜎 σ𝑗=1,2,3𝑤𝑖,𝑗𝐻 𝑥𝑗 + 𝑏𝑖

𝐻 for 𝑖 = 1,2,3,4

• 𝜎:ℝ → ℝ … activation function, traditionally tanh 𝑥

• σ𝑗=1,2,3𝑤𝑖,𝑗𝐻 𝑥𝑗 … is a scalar product: < 𝒘𝒊

𝑯, 𝒙 >

• 𝒛 = 𝜎 𝑊𝐻𝒙 + 𝒃𝑯 in matrix-vector notation

• Output:• 𝒚 = 𝑊𝑂𝒛 + 𝒃𝑶

• We call 𝑊𝐻 and 𝑊𝑂 weights, because they give (more or less) weight to the input• 𝒃𝑯 and 𝒃𝑶 are called bias, because they bias (偏誤，偏差) the output to one

direction

• Two matrix multiplications. How many MAC (multiply-accumulate operations)?

8

𝑥1

𝑥2

𝑥3

𝑧1

𝑧2

𝑧3

𝑧4

𝑦1

𝑦2


Math

• Neural Network:• Compute 𝒛 = 𝜎 𝑊𝐻𝒙 + 𝒃𝑯 and 𝒚 = 𝑊𝑂𝒛 + 𝒃𝑶

• 𝒙 are our inputs. 𝒚 are our outputs.

• How do we get our parameters: 𝜃 = 𝑊𝐻 , 𝒃𝑯,𝑊𝑂, 𝒃𝑶 ?

• The neural network should do something very well, so optimise it to do that…

9


Math

• Example: Predict the temperature tomorrow from the same day of the last five years• Input 𝑥 ∈ ℝ5 temperature vector for the last five years

• Output 𝑦 ∈ ℝ temperature tomorrow

• Network produces predictions ො𝑦 ∈ ℝ

• To measure how well our Neural Network is doing, we need some kind of loss measure• Take mean-squared-error in this example

10

T @ July 17th, 2018

T @ July 17th, 2017

T @ July 17th, 2016

T @ July 17th, 2015

T @ July 17th, 2014

July 17th , 2019


Math →Machine Learning

• We have:• A Neural Network 𝑓 𝑥; 𝜃 = ො𝑦

• A Loss: ℒ 𝑌, 𝑌 =1

𝐷σ𝑖=1𝐷 𝑦𝑖 − ො𝑦𝑖

2

• 𝑌 = 𝑦1, … , 𝑦𝐷 𝑌 = ො𝑦1, … , ො𝑦𝐷

• How do we get our truth, 𝑌?• Collect data…

• An observation is the best truth you can get.

11


Machine Learning

• Learning from Observations

• Recipe for practitioners:1. Collect (lots of) examples 𝑋 → 𝑌 (data + labels)

2. Choose a suitable function 𝑓 𝑥; 𝜃

3. Choose a loss function ℒ

4. Optimize 𝜃 to minimize ℒ

12


Machine Learning

• 1. Collect Data• Most algorithms assume that data is IID

• Independent & Identically distributed

• An example:• Tell dogs apart from cats

• Give your algorithm 990 images of dogs, 10 images of cats

• If your algorithm always answers “dog”, what’s the error?

→ Data is not IID

13


Machine Learning

• 2. Choose a suitable function 𝑓 𝑥; 𝜃

• How to choose a good function?

• Avoid under/overfitting:

• How to check this?• Train on one part of the training data 𝑋𝑇𝑟𝑎𝑖𝑛, 𝑌𝑇𝑟𝑎𝑖𝑛• Validate on the other part 𝑋𝑇𝑒𝑠𝑡 , 𝑌𝑇𝑒𝑠𝑡

14


Machine Learning

• 3. Loss ℒ & 4. Optimisation

• Optimisation in high school: 𝜕ℒ 𝑌,𝑓 𝑋;𝜃

𝜕𝜃= 0

• No closed form solution…

• Instead gradient descent:

15


Machine Learning

• 3. Loss ℒ & 4. Optimisation

• Gradient Descent: iterative minimisation, take the steepest direction in each step:

• 𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝜕ℒ

𝜕𝜃

• 𝜂 … step size (how far you walk before you check your direction)

•𝜕ℒ

𝜕𝜃… gradient (you need a loss function that is differentiable)

• Does not give a global optimum!

16


Machine Learning → Deep Learning

• Deep Learning: Machine Learning with Deep Neural Networks• Deep = a lot of layers

• Each layer is a function 𝑦𝑙 = 𝜎 𝑊𝑙𝑥𝑙 + 𝑏𝑙 where• 𝑥𝑙 is each layer’s input (previous layer’s output)• 𝑦𝑙 is each layer’s output• 𝜃𝑙 = 𝑊𝑙 , 𝑏𝑙 are the parameters of each layer

• Three-layer neural network:

• 𝑦3 = 𝜎 𝑊3𝜎 𝑊2𝜎 𝑊1𝑥1 + 𝑏1 + 𝑏2 + 𝑏3

• How do we compute the gradient to update the parameters?

•𝜕ℒ

𝜕𝜃3=

𝜕ℒ

𝜕𝑦3

𝜕𝑦3

𝜕𝜃3with

𝜕𝑦3

𝜕𝑏3= 𝜎′ and

𝜕𝑦3

𝜕𝑊3= 𝜎′𝑥3

•𝜕ℒ

𝜕𝜃2=

𝜕ℒ

𝜕𝑦3

𝜕𝑦3

𝜕𝑦2

𝜕𝑦2

𝜕𝜃2with

𝜕𝑦2


𝜕𝑦2

𝜕𝑊2= 𝜎′𝑥2

•𝜕ℒ

𝜕𝜃1=

𝜕ℒ

𝜕𝑦3

𝜕𝑦3

𝜕𝑦2

𝜕𝑦2

𝜕𝑦1

𝜕𝑦1

𝜕𝜃1with

𝜕𝑦1


𝜕𝑦1

𝜕𝑊1= 𝜎′𝑥1

17


Deep Learning

• How do we compute the gradient to update the parameters?

•𝜕ℒ

𝜕𝜃3=

𝜕ℒ

𝜕𝑦3

𝜕𝑦3

𝜕𝜃3

•𝜕ℒ

𝜕𝜃2=

𝜕ℒ

𝜕𝑦3

𝜕𝑦3

𝜕𝑦2

𝜕𝑦2

𝜕𝜃2

•𝜕ℒ

𝜕𝜃1=

𝜕ℒ

𝜕𝑦3

𝜕𝑦3

𝜕𝑦2

𝜕𝑦2

𝜕𝑦1

𝜕𝑦1

𝜕𝜃1

• Gradient computation with chain rules yields shared operations• Compute them only once:

•𝜕ℒ

𝜕𝜃3=

𝜕ℒ

𝜕𝑦3

𝜕𝑦3

𝜕𝜃3

•𝜕ℒ

𝜕𝜃2=

𝜕𝑦3

𝜕𝑦2

𝜕𝑦2

𝜕𝜃2

•𝜕ℒ

𝜕𝜃1=

𝜕𝑦2

𝜕𝑦1

𝜕𝑦1

𝜕𝜃1

→ Backpropagation

18


Deep Learning

• In practice, for standard operations, this isn’t implemented manually

• Modern deep learning frameworks support automatic differentiation→ Tell them what 𝑓 looks like and they will take care of the rest

19


Deep Learning

• The non-linearity… 𝜎

• Let’s look at the last layer’s gradient: 𝜕ℒ

𝜕𝜃3=

𝜕ℒ

𝜕𝑦3

𝜕𝑦3

𝜕𝜃3• 𝑦3 = 𝜎 𝑊3𝑥3 + 𝑏3

• For each parameter: 𝜕𝑦3


𝜕𝑦3

𝜕𝑊3= 𝜎′𝑥3

• 𝜎′ is shorthand for 𝜎′ 𝑊3𝑥3 + 𝑏3• Traditionally: 𝜎 𝑥 = tanh 𝑥 (from biology)

• What if 𝑊3𝑥3 + 𝑏3 is very far away from 0?

20


Deep Learning

• The non-linearity… 𝜎

• Vanishing gradient:• Gradient may become very small in a network

• Modern non-linearity:

• 𝜎 𝑥 = 𝑅𝑒𝐿𝑈 𝑥 = ቊ𝑥 𝑥 ≥ 00 𝑒𝑙𝑠𝑒

• 𝜎′ 𝑥 = 𝑅𝑒𝐿𝑈′ 𝑥 = ቊ1 𝑥 ≥ 00 𝑒𝑙𝑠𝑒

• Easy to compute and only one half vanishes…

21


Deep Learning

• Optimisation

• (Iterative) Gradient Descent: 𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝜕ℒ

𝜕𝜃

• In each iteration: compute gradients for all examples

• With big data, that takes a lot of time…

→Stochastic Gradient Descent (SGD):• Randomly choose a small set of examples (called a “batch”)

• Compute the gradient only for a subset of examples

22


Deep Learning → Convolutional Neural Networks• How to choose the right 𝑓?

• Sample from a salary questionnaire:• Age, Gender, Experience, Education, Salary, Bonus, …• Can I change their order without changing the meaning of the sample?

• Salary, Gender, Education, Bonus, Age, Experience, …

• Natural Language Sentence:• 我喜歡機器學習• Can I change the order?

• 機器喜歡我學習

23


Convolutional Neural Networks

• In a lot of data, we have direct neighborhood relations• Images• Audio• Video• Natural Language• …

• All data that is sampled according to a single (or multiple) changing variable(s)• Time series: sampled at different times• Images: sampled at different locations

24



• Convolution: neighborhood operator

25



• How to use convolutions in a neural network?• Example for images:

• Data has four dimensions: (𝐵, 𝐶, 𝐻,𝑊)• 𝐵 “batch”: the examples in a batch

• 𝐶 “channels”: the channels of each example (RGB for an image)

• 𝐻,𝑊 “spatial dimensions”: the intensities of a single channel of one example

• The weights of a single layer have four dimensions: 𝐹, 𝐶, 𝐾𝐻, 𝐾𝑊• 𝐹 “filter”: the filters of a layer

• 𝐶 “channels”: the channels of a layer, correspond to the channels of the data

• 𝐾𝐻 , 𝐾𝑊 “the kernel dimensions”: the actual kernel that is applied to a single channel

26



• How to use convolutions in a neural network?• Input: Single Image

• 1 × 3 × 4 × 4

• Weight: Single Kernel• 1 × 3 × 2 × 2

• There’s one kernel for each channel!

• Output: Single Image with Single Channel• 1 × 1 × 3 × 3

• 𝑰𝑹 ∗ 𝑲𝑶 + 𝑰𝑮 ∗ 𝑲𝑮 + 𝑰𝑩 ∗ 𝑲𝑷

27



• How to use convolutions in a neural network?• 𝑰𝑹 ∗ 𝑲𝑶 + 𝑰𝑮 ∗ 𝑲𝑮 + 𝑰𝑩 ∗ 𝑲𝑷 of size 1 × 1 × 3 × 3

• Use the non-linearity again: 𝑅𝑒𝐿𝑈 𝑰𝑹 ∗ 𝑲𝑶 + 𝑰𝑮 ∗ 𝑲𝑮 + 𝑰𝑩 ∗ 𝑲𝑷

• Use multiple filters:

• Filter 1: 1 × 3 × 2 × 2

• Filter 2: 1 × 3 × 2 × 2

• Filter 3: 1 × 3 × 2 × 2

• Together: 3 × 3 × 2 × 2 1 × 3 × 3 × 3

28

1 × 1 × 3 × 3

1 × 1 × 3 × 3

1 × 1 × 3 × 3



• How to use convolutions in a neural network?• One layer:

• Input: (𝐵, 𝐶, 𝐻,𝑊)

• Filters: 𝐹, 𝐶, 𝐾𝐻 , 𝐾𝑊• Output: (𝐵, 𝐹, 𝐻 − 𝐾𝐻 + 1,𝑊 − 𝐾𝑊 + 1)

• Apply non-linearity to each of the outputs

• Each layer is characterized by • #Filters

• Kernel size

• Channels are the filters of the previous layer!

29



• Calculation exercise• How many MACs for the example with three filters?

• Input 1 × 3 × 4 × 4

• Filter 3 × 3 × 2 × 2

• For 𝑁 images?• Input 𝑁 × 3 × 4 × 4

• Filter 3 × 3 × 2 × 2

30



• Padding: Add 0s around the outside of the input data to preserve size• Example

• 3 × 3 kernel

• Pad input with 1 line of zeros on the outside

• Pooling: Reduce spatial size• Higher layers work on more “abstract” data

• Max pooling: Keep the maximum out of a 𝐾𝐻 × 𝐾𝑊 patch (no calculation)

• Average pooling: Compute the mean value in a 𝐾𝐻 × 𝐾𝑊 patch

• For 𝐾𝐻 × 𝐾𝑊 = 2 × 2 the data is reduced by factor 2:• 1 × 3 × 4 × 4 → 1 × 3 × 2 × 2

31



• Calculation exercise• How many MACs with padding and 2 × 2 max pooling?

• Input 1 × 3 × 4 × 4

• Filter 4 × 3 × 3 × 3

• What’s the output size?

• How many MACs with padding and 2 × 2 average pooling?• Input 1 × 3 × 4 × 4

• Filter 4 × 3 × 3 × 3

• What’s the output size?

32



• Multiple Layers:• Input 1 × 3 × 4 × 4

• Layer 1, Filter 4 × 3 × 3 × 3, w/ padding

• 2 × 2 Max pooling

• Layer 2, Filter 1 × 4 × 1 × 1, w/ padding

• Q• What’s the output of layer 1?

• After max pooling?

• And after layer 2?

33


Classification Loss Functions

• Classification loss

• “Softmax”: 𝑃 Class 𝑐|ෝ𝒚 =exp ො𝑦𝑐

σ𝑘=1𝐶 exp ො𝑦𝑘

• Converts prediction vector ෝ𝒚 to probabilities

• Cross entropy: ℒ = −σ𝑘=1𝐶 𝑦𝑘 log 𝑃 𝑘 ෝ𝒚 + 1 − 𝑦𝑘 log 1 − 𝑃 𝑘 ෝ𝒚

• 𝒚 is 1-hot-vector• It’s 0 everywhere except for where the true class is

• Cats and Dogs and Mice example:

• (1,0,0) is label for cat

• (0,1,0) for dog

• (0,0,1) for mouse

34


Recurrent Neural Networks

• Time series / language modelling• Input/output for each time step

• Weights are the same in each timestep (similar to convolution)

• Update state: ℎ𝑡 = 𝜎 𝑉ℎ𝑡−1 + 𝑈𝑥𝑡−1 + 𝑏ℎ• Compute output: 𝑜𝑡 = 𝜎 𝑊ℎ𝑡 + 𝑏𝑜

35


Recurrent Neural Networks

• Disadvantage• Difficult to parallelise

• Use of GPU is less efficient

• Multiple layers:• RNN in one direction first (left to right)

• Second RNN in the other direction using 𝑜𝑡 as inputs

36


Summary

• Neural Networks:• Layers of linear functions + non-linearities

• Convolutional neural networks:• Layers of convolutions + non-linearities

• Recurrent neural networks:• Time steps of linear functions + non-linearities

• Machine Learning• Learning from observations• DFLO= Data + Function + Loss + Optimisation

37

Introduction to Neural Networks -...

Documents

Transcript of Introduction to Neural Networks -...