Deep Learning Theory and Practice - Computer Action...

Deep Learning Theory and PracticeLecture 5

Introduction to deep neural networks

Dr. Ted Willke [email protected]

Tuesday, January 21, 2020

mailto:[email protected]

Review of Lecture 4• Adaline ‘neuron’ minimizes the squared loss

2

update the weights w(t + 1) = w(t) + η ⋅ (y* − s(t)) ⋅ x*

compute s = wTx

w(1) = 0

for iteration t = 1, 2, 3, . . .

pick a point (at random) (x*, y*) ∈ D

t ← t + 1

Forward pass of ‘signal’:

Backward pass of updates:

The Adaline Algorithm:

1:

2:

3:

4:

5:

6:

Review of Lecture 4• Logistic regression: Better classification

3

hw(x) = θ (wTx)Uses , where θ(s) =1

1 + e−s

yGives us the probability of being the label:

• Learning should strive to maximize this jointprobability over the training data:

P(y1, . . . , yN |x1, . . . , xN) =N

∏n=1

P(yn |xn) .

• The principle of maximum likelihood says we can do this if we minimize this error:

1. Compute the gradient

2. Move in the direction

3. Update the weights:

4. Repeat until converged! A convex problem

v̂ = − gt

gt = ∇Ein(w(t))

w(t + 1) = w(t) + ηv̂t

∇wEin(w) → 0.

• We can’t minimize this analytically, but we can numerically/iteratively set

Review: Gradient Descent

4

Ball on complicated hilly terrain

- rolls down to a local valley

this is called a local minimum

Questions:

1. How to get to the bottom of the deepest valley?

2. How to do this when we don’t have gravity :-)?

Today’s Lecture

•Review of gradient descent

•What is a deep neural network?

•How do we train one?

•How do we train one efficiently?

•Tutorial: Image classification using a logistic regression network

5(Many slides adapted from Yaser Abu-Mostafa and Malik Magdon-Ismail, with permission of the authors. Thanks guys!)

Today’s Lecture






6(Many slides adapted from Yaser Abu-Mostafa and Malik Magdon-Ismail, with permission of the authors. Thanks guys!)

Our has only one valley

7

Ein

… because is a convex function of . Ein(w) w

Can you prove this for logistic regression?

How to ‘roll’ down

8

Assume you are at weights and you take a step of size in the direction . w(t) η v̂

w(t + 1) = w(t) + ηv̂

We get to select . v̂

Select to make as small as possible.v̂ Ein(w(t + 1))

what’s the best direction to take a step in??

The gradient is the fastest way to roll down

9

Approximately the change in Ein

What choice of will maximize the gradient (minimize its negative)?v̂

Maximizing the descent

10

How do we maximize ?ΔEin

ΔEin ≈ η∇Ein(w(t))Tv̂

≈ η∥∇Ein(w(t))T∥∥v̂∥ cos(θ) , where is the angle between and .θ ∇Ein ̂v

This is maximized when , i.e., when points in direction of .cos(θ) = 1 ∇Ein(w(t))Tv̂

Therefore, we can take the largest negative step when .v̂ = −∇Ein(w(t))

∥∇Ein(w(t))∥

‘Rolling down’ = iterating the negative gradient

11

The ‘Goldilocks’ step size

12

Fixed learning rate gradient descent

13

Define

Then

(reduces step size as minimum is approached)

Gradient descent can minimize any smooth function.

Summary of linear models

Credit Analysis

Perceptron

Linear regression

Logistic regression Cross-Entropy Error (Gradient Descent)

Squared Error (Pseudo-inverse)

Classification Error(PLA)

Approve or Deny

Amount ofCredit

Probability of Default

Today’s Lecture






15

The neural network - biologically inspired

16

biological function biological structure

Biological inspiration, not bio-literalism

17

Engineering success can draw upon biological inspiration at many levels of abstraction.We must account for the unique demands and constraints of the in-silico system.

XOR: A limitation of the linear model

18

XOR: A limitation of the linear model

19

f = h1h2 + h1h2

h1(x) = sign(wT1 x) h2(x) = sign(wT

2 x)

Perceptrons for OR and AND

20

OR(x1, x2) = sign(x1 + x2 + 1.5) AND(x1, x2) = sign(x1 + x2 − 1.5)

Representing using OR and AND

21

f

f = h1h2 + h1h2

Representing using OR and AND

22

f

f = h1h2 + h1h2

The multilayer perceptron

23

wT2 x

wT0 x

3 layers ‘feedforward’

hidden layers

Universal Approximation

24

Any target function that can be decomposed into linear separators can be implemented by a 3-layer MLP.

f

A powerful model

25

Target 8 perceptrons 16 perceptrons

Red flags for generalization and optimization.

What tradeoff is involved here?

Minimizing

26

Ein

The combinatorial challenge for the MLP is even greater than that of the perceptron.

is not smooth (due to ), so cannot use gradient descent.sign( ⋅ )Ein

sign(x) ≈ tanh(x) ⟶ gradient descent to minimize Ein .

The deep neural network

27

input layer l = 0 hidden layers 0 < l < L output layer l = L

How the network operates

28

w(l)ij

1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

x(l)j = θ(s(l)

j ) = θ (d(l−1)

∑i=0

w(l)ij x(l−1)

i )

Apply to x x(0)1 . . . x(0)

d(0) → → x(L)1 = h(x)

θ(s) = tanh(s) =es − e−s

es + e−s

Today’s Lecture






29

How can we efficiently train a deep network?

30

Gradient descent minimizes: Ein(w) =1N

N

∑n=1

e(h(xn), yn)

by iterative steps along −∇Ein :

∇w = − η∇Ein(w)

∇Ein is based on ALL examples (xn, yn)

‘batch’ GD

ln(1 + e−ynwTxn) logistic regression

The stochastic aspect

31

𝔼n [−∇e(h(xn), yn)] =1N

N

∑n=1

e(h(xn), yn)‘Average’ direction:

= − ∇Ein :

Pick one at a time. Apply GD to . (xn, yn) e(h(xn), yn)

stochastic gradient descent (SGD)

A randomized version of GD.

Benefits of SGD

32

Randomization helps.

1. cheaper computation

2. randomization

3. simple

Rule of thumb:

η = 0.1 works

(empirically adjust; exponentially)

The linear signal

33

Input is a linear combination (using weights) of the outputs of the previous layer

s(l)

x(l−1) .

(recall the linear signal )s = wTx

Forward propagation: Computing

34

h(x)

Minimizing

35

Ein

Using makes differentiable, so we can use gradient descent (or SGD) local min.θ = tanh Ein ⟶

Gradient descent

36

Gradient descent of

37

Ein

We need:

Numerical Approach

38

approximate

inefficient

:-(

Algorithmic Approach :-)

39

is a function of ande(x) s(l) s(l) = (W(l))Tx(l−1)

(chain rule)

sensitivity

Computing using the chain rule

40

δ(l)

Multiple applications of the chain rule:

The backpropagation algorithm

41

Algorithm for gradient descent on

42Can do batch version or sequential version (SGD).

Ein

Digits Data

43

Today’s Lecture






44

Further reading

• Abu-Mostafa, Y. S., Magdon-Ismail, M., Lin, H.-T. (2012) Learning from data. AMLbook.com.

• Goodfellow et al. (2016) Deep Learning. https://www.deeplearningbook.org/

• Boyd, S., and Vandenberghe, L. (2018) Introduction to Applied Linear Algebra - Vectors, Matrices, and Least Squares. http://vmls-book.stanford.edu/

• VanderPlas, J. (2016) Python Data Science Handbook. https://jakevdp.github.io/PythonDataScienceHandbook/

45

http://AMLbook.com

https://www.deeplearningbook.org/

http://vmls-book.stanford.edu/

https://jakevdp.github.io/PythonDataScienceHandbook/

https://jakevdp.github.io/PythonDataScienceHandbook/

Deep Learning Theory and Practice - Computer Action...

Documents

Transcript of Deep Learning Theory and Practice - Computer Action...