Big Data Analytics - Stanford Universityinfolab.stanford.edu/~echang/BigDat2015/BigDat2015... ·...

Big Data Analytics Architectures, Algorithms

and Applications!Part #2: Intro to deep

Learning

Edward Chang 張智威 HTC (Prior: Google & U. California) !

Simon Wu!HTC (prior: Twitter & Microsoft)

http://infolab.stanford.edu/~echang/talks.html

Three Lectures •  Lecture #1: Scalable Big Data Algorithms

– Scalability issues – Key algorithms with applica=on examples

•  Lecture #2: Intro to Deep Learning – Autoencoder & Sparse Coding – Graph models: CNN, MRF, & RBM

•  Lecture #3: Analy=cs PlaMorm [by Simon Wu] –  Intro to LAMA plaMorm – Code lab

1/27/15 Ed Chang @ BigDat 2015

Acknowledging Slide Contributors

•  Geoffrey Hinton •  Yoshua Bengio •  Russ Salakhutdinov •  Kai Yu •  Yann Lecun •  Andrew Ng •  Steven Seitz


Lecture #2 Outline

•  Data Posteriors vs. Human Priors •  Learn p(x) from Big Data

– Use NN to construct Autoencoder – Sparse Coding – Dynamic Par=al

•  Graphical Models – CNN, MRF, & RBM

•  Demo


Representa=on?


Knowledge or Feature extrac1on in Image processing: involves using algorithms to detect and isolate various desired edges or shapes

Low-‐level: Edge detec=on, corner detec=on, ridge detec=on, or more generally “Scale-‐invariant feature transform” (SIFT)

Curvature: Shape informa=on, blob detec=on

Hough transform: Lines, circles/ellipse, arbitrary shapes (Generalized Hough Transform)

Typical Image/Video Representa=on Based on Domain Knowledge and Human Priors


Template matching (medical imaging)

Flexible methods for 2D, 3D or 3D+=me edge extrac=on, road detec=on, MRI, fMRI

Color and texture representa1ons: Histograms, various transforma=ons for conduc=ng frequency-‐domain analysis. e.g., wavelets

Mo1on: Mo=on detec=on: e.g., op=cal flow, global or area based

…Many Related Work on Representa=on


Key Design Goals for Representa=on

Design features x that are invariant and selec+ve •  Good Invariance

–  Same object should have the same features

•  Good Selec=vity (Disentanglement) – Different objects should exhibit different features for telling them apart

Once x has been designed, find label y for x and then learn p(y|x)


Challenges •  Invariance affected by noise

–  Environmental factor (e.g., ligh=ng condi=on, occlusion) –  Equipment factor (e.g., different camera brands different colors and gamma correc=on)

– Aliasing (e.g., cars have different models, hence different features)

•  Selec=vity requires good similarity func=ons •  Labeled data is tough to acquire

–  Learning robust model requires big data


Remedy #1 Learn ϕ from data p(x|ϕ) ≈ p*(x)

•  Instead of designing, learn features ϕ from data, from data

•  Data: not just the original data, but adding variants to the data –  E.g., adding scaled, rotated, cropped, mirrored, gamma adjusted images

•  Instead of requiring invariant features as input to a model, let the model cope with invariance

•  Then, learn features ϕ for predic=ng p(x|ϕ) accurately (p(x|ϕ)≈ p*(x)) in an unsupervised way from data already covering variant condi=ons


Remedy #2 Deep Model

•  Learn representa=on in a hierarchical way •  [T. Serre, T. Poggio; MIT 2005]


Lecture Outline


– Use NN to Construct Autoencoder – Sparse Coding – Dynamic Par=al


•  Demo


Mul=ple-‐Layer Networks Neuron Network (NN) Model

An elementary neuron with R inputs is shown below. Each input is weighted with an appropriate w. The sum of the weighted inputs and the bias forms the input to the transfer func=on f. Neurons can use any differen1able transfer func1on f to generate their output.

NN Model Transfer Func=ons (Ac=vi=on Func=on)

Mul=layer networks oren use the log-‐sigmoid transfer func=on logsig. The func=on logsig generates outputs between 0 and 1 as the neuron's net input goes from nega=ve to posi=ve infinity

NN Model Feedforward Network

A single-‐layer network of S logsig neurons having R inputs is shown below in full detail on the ler and with a layer diagram on the right.

Example Four-‐layer NN


Input Layer Hidden Layer #1 Hidden Layer #2 Output Layer

y

NN Model Learning Algorithm

The following slides describes learning process of mul=-‐layer neural network employing backpropaga1on algorithm. To illustrate this process the three layer neural network with two inputs and one output,which is shown in the picture below, is used:

Learning Algorithm: Backpropaga=on

Each neuron is composed of two units. First unit adds products of weights coefficients and input signals. The second unit realizes a nonlinear func=on, called neuron transfer (ac=va=on) func=on. Signal e is adder output signal, and y = f(e) is output signal of nonlinear element. Signal y is also output signal of neuron.

Feed Forward Pictures below illustrate how signal is forward-‐feeding through the network, Symbols w(xm)n represent weights of connec=ons between network input xm and neuron n in input layer. Symbols yn represents output signal of neuron n.

Feed Forward

Feed Forward Propaga=on of signals through the hidden layer. Symbols wmn represent weights of connec=ons between output of neuron m and input of neuron n in the next layer.

Feed Forward

Learning Algorithm: Forward Pass

Propaga=on of signals through the output layer.


To teach the neural network we need training data set. The training data set consists of input signals (x1 and x2 ) assigned with corresponding target (desired output) z. The network training is an itera=ve process. In each itera=on weights coefficients of nodes are modified using new data from training data set. Modifica=on is calculated using algorithm described below: Each teaching step starts with forcing both input signals from training set. Arer this stage we can determine output signal values for each neuron in each network layer.


In the next algorithm step the output signal of the network y is compared with the desired output value (the target z), which is found in training data set. The difference is called error signal δ of output layer neuron


The idea is to propagate error signal δ (computed in single teaching step) back to all neurons, which output signals were input for discussed neuron.


The weights' coefficients wmn used to propagate errors back are equal to this used during compu=ng output value. Only the direc=on of data flow is changed (signals are propagated from output to inputs one arer the other). This technique is used for all network layers. If propagated errors came from few neurons they are added. The illustra=on is below:


When the error signal for each neuron is computed, the weights coefficients of each neuron input node may be modified. In formulas below df(e)/de represents deriva=ve of neuron ac=va=on func=on (which weights are modified).

Sigmoid func=on f(e) and its deriva=ve f’(e)

Ed Chang @ BigDat 2015

f (e) = 11+ e−βe

, β is the paramter for slope

Hence

f ' (e) = df (e)de

=d 1

1+ e−βe"

#$

%

&'

d(1+ e−βe )df (e−βe )de

f ' (e) = −β(1+ e−βe )2 e

−βe =−β

(1+ e−βe )2 e−e

=1

(1+ e−βe )−βe−e

(1+ e−βe )= f (e) 1−β f (e)( )

For simplicity, paramter for the slope β =1f ' (e) = f (e) 1− f (e)( )

hup://link.springer.com/chapter/10.1007%2F3-‐540-‐59497-‐3_175#page-‐1

hup://mathworld.wolfram.com/SigmoidFunc=on.html

1/27/15

Autoencorder NN for Unsupervised Compression


hw,b(x) ≈x


Parameter Learning

•  10x10 images with 100 pixels

•  R100 Possible configura=ons

•  H hidden units –  H = 100? –  H = 50? PCA

•  Too computa=onal intensive to learn w


Learning Algorithm

•  Suppose ϕ (or h) to be a set of hidden variables •  Model image x with k independent hidden features ϕi with addi=ve noise v

•  The goal is to find a set of h such that posterior P(x|ϕ) us as close as P*(x) or to minimize KL divergence between the two


x = aiφi + v(x)i=1

k

∑

…Learning Algorithm •  Minimize KL divergence between the two dist.

•  Since P*(x) is constant across choice of ϕ, Min KL è Maximize the log-‐likelihood P(x|ϕ)


D(P*(x) || P(x |φ)) = P*(x)log P*(x)P(x |φ)!

"#

$

%&dx∫

φ*= argmaxφ log(P(x |φ)

φ*,a*= argminφ,a x( j ) − ai( j )φi

i=1

k

∑j=1

m∑

2

+λ S(ai( j )

i=1

k∑ ) Revisit this later

Lecture Outline




•  Demo


General Priors in Real-‐World Data [Y. Bengio, et al., 2014]

•  A Hierarchical Organiza=on of Factors •  Smoothness

–  x ≈y à f(x) ≈ f(y) –  Nearest neighbor assump=on

•  Local manifold –  Clustered –  Low degree of freedom –  E.g., PCA

•  Distributed Representa=ons –  Feature reuse, and abstract & invariant representa=ons –  Dynamic and par=al 1/27/15 Ed Chang @ BigDat 2015

•  Every learning model is a variant of the nearest neighbor model

•  Similar objects should reside in the neighborhood of a feature subspace

Smoothness Nearest Neighbor Model


Local Low dimensional manifolds

K. Yu and A. Ng, Tutorial: Feature Learning for Image classifica1on Part 3: Image Classifica1on using Sparse Coding: Advanced Topics, ECCV-‐2010.

Data Manifold

Local linear


Smooth, Local, Sparse


Data

Basis

Local linear

Data Manifold

Each datum can be represented by its neighbor anchors

Sparse Combina=on


Sparse Coding [Olshausen & Field,1996]

•  Find representa=on of data, unsupervised – Tradi=onally PCA (too contrived, why?)

•  Find over-‐complete bases in an efficient way •  x ≈ a ϕ, where x in Rn and ϕ in Rm, m > n •  Coefficients ϕ cannot be uniquely determined •  Thus, impose sparsity on ϕ •  k-‐sparsity 1/27/15 Ed Chang @ BigDat 2015

Sparse Coding


N

x

N X 1

a

N

M X 1 K

φA fixed Dictionary

N X M

What is Sparse Coding

1/27/15

mina,�

mX

i=1

��xi �

kX

j=1

ai,j⇥j

��

2

+ �mX

i=1

kX

j=1

|ai,j |

Sparse coding (Olshausen & Field,1996). Originally developed to explain early visual processing in the brain (edge detec=on in V1).

Training: given a set of random patches x, learning a dic=onary of bases [Φ1, Φ2, …]

Coding: for data vector x, solve LASSO to find the sparse coefficient vector a


Sparse Coding: Training Time Input: Images x1, x2, …, xm (each in Rd) Learn: Dic=onary of bases ϕ1, ϕ2, …, ϕk (also Rd).

mina,�

mX

i=1

��xi �

kX

j=1

ai,j⇥j

��

2

+ �mX

i=1

kX

j=1

|ai,j |

Alterna=ng op=miza=on: 1.  Fix dic=onary ϕ1, ϕ2, …, ϕk , op=mize a ( standard LASSO

problem） 2.  Fix ac=va=ons a, op=mize dic=onary ϕ1, ϕ2, …, ϕk , (a convex

QP problem)


Sparse Coding: Tes=ng Time Input: Unseen image patch xi (in Rd) and previously learned ϕi’s Output: Representa=on [ai,1, ai,2, …, ai,k] of image patch xi.

≈ 0.8 * + 0.3 * + 0.5 *

Represent xi as: ai = [0, 0, …, 0, 0.8, 0, …, 0, 0.3, 0, …, 0, 0.5, …]

mina,�

mX

i=1

��xi �

kX

j=1

ai,j⇥j

��

2

+ �mX

i=1

kX

j=1

|ai,j |


Jus=fica=ons & Examples

•  Probabilis=c Interpreta=on •  Human Visual Cortex

– Not enforcing orthogonal bases like PCA – Over-‐complete preserves more features

•  Scales, orienta=ons


Revisit Autoencoder’s Probabilis=c Interpreta=on

•  Suppose ϕ (or h) to be a set of hidden variables •  Model image x with k independent hidden features ϕi with addi=ve noise v

•  The goal is to find a set of h such that posterior P(x|ϕ) us as close as P*(x) or to minimize KL divergence between the two


x = aiφi + v(x)i=1

k

∑

…Probabilis=c Interpreta=on •  Minimize KL divergence between the two dist.

•  Since P*(x) is constant across choice of h •  Maximize the log-‐likelihood P(x|ϕ)


D(P*(x) || P(x |φ)) = P*(x)log P*(x)P(x |φ)!

"#

$

%&dx∫


…Probabilis=c Interpreta=on •  We need two terms P(x|a, ϕ) and p(a) because

•  Assume white noise v is Gaussian with variance σ2

•  To determine P(x|ϕ), we need the prior P(a). Assume the independence of source features


P(x | a,φ) = 1Zexp −

(x − aiφi )2

i=1

k∑2σ 2

#

$

%%

&

'

((

P(a) = p(ai )i=1

k

∏

P(x |φ) = P(x | a,φ)P(a)da∫

…Probabilis=c Interpreta=on •  Add sparsity assump=on-‐-‐-‐every image is a product of few

features, we would like probability distribu=on of ai to be peaked at zero and have a high kurtosis, S(ai) controls the shape


P(ai ) =1Zexp(−βS(ai ))

P(x |φ) = P(x | a,φ)P(a)da∫

P(a) = p(ai )i=1

k

∏

…Probabilis=c Interpreta=on •  The problem is reduced to that over all input data



Max logj=1

m∑ P(x | a,φ)P(a)da∫

=Max log exp(−(x − aiφi )

2∑2σ 2∫

j=1

m

∑ ) exp(−βS(ai ))∏

=Max logj=1

m

∑ exp(−(x − aiφi )2∑∫ − βS(ai ))∑

→Min x( j ) − ai( j )φii=1

k∑

j=1

m

∑2

+λ S(ai( j )

i=1

k∑ )

…Probabilis=c Interpreta=on •  Maximizing log likelihood is equivalent to minimizing

energy func=on

•  The choices of S(.), L1 or log penalty, correspond to the use of the Laplacian and the Cauchy prior, respec=vely


P(ai )∝ exp(−β ai )

P(ai )∝β

1+ ai2


φ*,a*= argminφ,a x( j ) − ai( j )φi

i=1

k

∑j=1

m∑

2

+λ S(ai( j )

i=1

k∑ )

Jus=fica=ons & Examples

•  Probabilis=c Interpreta=on •  Human Visual Cortex

– Not enforcing orthogonal bases like PCA – Over-‐complete preserves more features

•  Scales, orienta=ons


Feature Invariance •  Human visual system works so well •  “Mental” model (T. Serre, T. Poggio; MIT 2005)

– Ventral visual pathway – Deep learning


Visual Pathway [Hubel Wiesle, 68]


Primary Visual Cortex (V1)


Extrastriate cortex

Feedforward Path of Ventral Stream •  Invariance (overcomplete)

–  V1, star=ng with scale/posi=on/orienta=on invariance over a restricted range

–  Then invariance of view points and other transforma=ons

•  Mul=-‐layer, mul=-‐area (deep) –  V2 and V3 (shape), Improve complexity of op=mal s=mulus

•  Feedforward –  First 150 millisecond of percep=on – No color informa=on (in V4) – W/o feedback


Six Steps of HMAX [T. Serre, T. Poggio; MIT 2005]


Mul=-‐layer Visual Pathway


•  Edge detec=on, mul=-‐scale, mul=-‐direc=on (on/off, simple) – Using mul=-‐scale mul=-‐direc=on Gabor filters

•  Edge pooling (max, invariance) –  Keep “strong” features”

•  Unsupervised clustering (or) –  Clustering edges into patches

V1 Like Bases


Mul=-‐layer Visual Pathway


•  Part Detec=on (on/off, simple) –  Find matching patches in photos

•  Part Pooling (max, invariance) –  Iden=fy useful patches/parts

•  Supervised Learning – Object ß parts

Edges and Parts


Six Steps of HMAX [T. Serre, T. Poggio; MIT 2005]

•  Edge Detec=on, mul=-‐/scale,direc=on (on/off, simple) –  Using mul=-‐scale mul=-‐orienta=on Gabor filters

•  Edge Pooling (max, invariance) –  Keep “strong” features”

•  Unsupervised Clustering (or) –  Clustering edges into patches

•  Part Detec=on (on/off, simple) –  Find matching patches in photos

•  Part Pooling (max, invariance) –  Iden=fy useful patches/parts

•  Supervised Learning –  Object ß parts


Revisit Challenges of Representa=on Learning

•  Invariance affected by noise –  Environmental factor (e.g., ligh=ng condi=on, occlusion) –  Equipment factor (e.g., different camera brands different colors and gamma correc=on)

– Aliasing (e.g., cars have different models, hence different features)

•  Labeled data is tough to acquire –  Robust models require big data

•  Selec=vity requires good similarity func=ons


Lecture Outline




•  Demo


Example of Sparse Models

1/27/15

•  because the 2nd and 4th elements of w are non-zero, these are the two selected features in x

•  globally-aligned sparse representation

x1 [ | | | | | | ］

x2 [ | | | | | | ］

xm [ | | | | | | ］

…

x3 [ | | | | | | ］

[ 0 | 0 | 0 0 ］

[ 0 | 0 | 0 0 ］

[ 0 | 0 | 0 0 ］

…

[ 0 | 0 | 0 0 ］

f(x) = <w,x>, where w=[0, 0.2, 0, 0.1, 0, 0]


Example of Sparse Ac=va=ons

1/27/15

•  Different x has different dimensions ac=vated •  Locally-‐shared sparse representa=on: similar x’s tend to have

similar non-‐zero dimensions, but not all

a1 [ 0 | | | 0 … 0 ］

a2 [ | | | 0 0 … 0 ］

am [ 0 0 0 | | … 0 ］

…

a3 [ | 0 | | 0 … 0 ］

x1

x2 x3

xm


Example of Sparse Ac=va=ons

1/27/15

•  Preserving manifold structure •  i.e., clusters, manifolds,

a1 [ | | | 0 0 … 0 ］ a2 [ 0 | | | 0 … 0 ］

am [ 0 0 0 0 | … 0 ］

…

a3 [ 0 0 | | | … 0 ］

x1 x2 x3

xm



Similarity Theories

•  Objects are similar in all respects (Richardson 1928)

•  Objects are similar in some respects (Tversky 1977)

•  Similarity is a process of determining respects, rather than using predefined respects (Goldstone 94)


Similarity Theories

•  Objects are similar in all or some respects

•  Minkowski Func=on – D = (Σi = 1..M (pi -‐ qi)n)1/n

•  Weighted Minkowski Func=on – D = (Σi = 1..M, wi(pi -‐ qi)n)1/n

•  Same w is imposed to app pairs of objects p and q

[ 0 | 0 | 0 0 ］

[ 0 | 0 | 0 0 ］

[ 0 | 0 | 0 0 ］

…

[ 0 | 0 | 0 0 ］


DPF: Dynamic Par=al Func=on [B. Li, E. Chang, et al, MM Systems 2013]

•  Similarity is a process of determining respects, rather than using predefined respects (Goldstone 94)

a1 [ 0 | | | 0 … 0 ］

a2 [ | | | 0 0 … 0 ］

am [ 0 0 0 | | … 0 ］

…

a3 [ | 0 | | 0 … 0 ］

a1 [ | | | 0 0 … 0 ］ a2 [ 0 | | | 0 … 0 ］

am [ 0 0 0 0 | … 0 ］

…

a3 [ 0 0 | | | … 0 ］


0 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 0

0.007545 0.01307 0.004637 0.002413 0.002635 0.002954 0.0020070.014669 0.02717 0.010578 0.006734 0.007725 0.006379 0.0057660.012615 0.023055 0.009333 0.006764 0.007363 0.006593 0.0054430.082128 0.212612 0.068016 0.037835 0.032241 0.018068 0.0132030.061564 0.176548 0.045542 0.026445 0.026374 0.018583 0.0220370.019243 0.037016 0.015684 0.010834 0.012792 0.013536 0.0093460.09418 0.153677 0.066896 0.040249 0.036368 0.030341 0.0211380.1284 0.335405 0.13774 0.072613 0.054947 0.039216 0.043319

0.041414 0.101403 0.035881 0.022633 0.018991 0.017131 0.019450.014024 0.049782 0.01457 0.0053 0.004439 0.003041 0.0052260.049319 0.120274 0.045804 0.020165 0.019499 0.013805 0.018513

GIF

00.020.040.060.080.10.120.14

1 11 21 31 41 51 61 71 81 91 101

111

121

131

141

Feature Number

Ave

rage

Dis

tanc

e

00.050.10.150.20.250.30.350.4

1 11 21 31 41 51 61 71 81 91 101

111

121

131

141

0 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0

0.002923 0.004377 0.029086 0.017063 0.007649 0.002019 0.001984 0.011560.006648 0.010143 0.070708 0.046142 0.023502 0.005178 0.005169 0.030140.006298 0.009264 0.075118 0.042225 0.020053 0.006285 0.006533 0.0300430.010198 0.056025 0.052869 0.033199 0.018294 0.00688 0.006858 0.023620.017066 0.047514 0.104013 0.073459 0.037468 0.013849 0.01293 0.0483440.008148 0.015337 0.074134 0.044238 0.021222 0.005197 0.005099 0.0299780.013529 0.051743 0.063263 0.038084 0.020885 0.010481 0.009844 0.0285110.045746 0.104141 0.145924 0.11276 0.065015 0.026333 0.02593 0.0751920.026167 0.034522 0.085067 0.054154 0.02918 0.015887 0.014371 0.0397320.002676 0.012148 0.008913 0.004682 0.002452 0.000913 0.000905 0.0035730.014527 0.036084 0.046779 0.024712 0.017418 0.004182 0.004991 0.0196160.012121 0.030269 0.045198 0.022268 0.012468 0.004706 0.004955 0.017919

Scale up/down

00.050.10.150.20.250.30.350.4

1 11 21 31 41 51 61 71 81 91 101

111

121

131

141

Feature Number

Aver

age

Dis

tanc

e

0.024788 0.069615 0.0226 0.009364 0.01 0.00678 0.0097120.094781 0.227558 0.099002 0.046466 0.047815 0.036883 0.0246990.093399 0.233519 0.188091 0.043026 0.037991 0.022151 0.0240640.040228 0.102763 0.034949 0.014184 0.01465 0.010237 0.0155170.001163 0.000896 0.000722 0.000627 0.000349 0.000452 0.0027580.006947 0.006769 0.003541 0.006377 0.002048 0.005515 0.0130060.006365 0.005313 0.002064 0.004006 0.002055 0.003338 0.01010.011705 0.010935 0.006615 0.007506 0.003319 0.005911 0.0152110.009434 0.010169 0.004484 0.006306 0.002582 0.004798 0.0136570.006305 0.005997 0.003392 0.005719 0.002382 0.004853 0.0128020.005835 0.00945 0.004323 0.00564 0.002688 0.004535 0.0063320.008149 0.009636 0.0047 0.006213 0.002564 0.003375 0.0064210.006776 0.010315 0.005393 0.008004 0.003845 0.005659 0.0132030.001526 0.002551 0.000576 0.000371 0.000331 0.000286 0.000380.016302 0.022657 0.007055 0.00353 0.002171 0.004162 0.003980.012414 0.020159 0.007076 0.003102 0.00188 0.004606 0.003490.007231 0.013591 0.004979 0.001092 0.000582 0.002766 0.0007410.011588 0.015102 0.005764 0.003855 0.00262 0.004584 0.0037920.01212 0.016013 0.006441 0.004048 0.002728 0.004856 0.0042410.012235 0.01671 0.00483 0.002616 0.00197 0.00268 0.001672

Cropping

00.050.10.150.20.250.30.35

1 11 21 31 41 51 61 71 81 91 101

111

121

131

141

Feature Number

Ave

rage

Dis

tanc

e

0.006109 0.019169 0.032795 0.015229 0.008667 0.002357 0.00292 0.0123940.01223 0.070665 0.046472 0.02549 0.017445 0.008694 0.00841 0.0213020.019067 0.08113 0.04592 0.024327 0.014169 0.004995 0.005275 0.0189370.011323 0.029089 0.063856 0.037716 0.01988 0.00522 0.005556 0.0264460.000995 0.000971 0.00241 0.001415 0.000736 0.000275 0.000272 0.0010220.007103 0.006337 0.015615 0.008709 0.003433 0.001572 0.002071 0.006280.004321 0.004457 0.012494 0.007507 0.003403 0.001351 0.001976 0.0053460.007451 0.008135 0.017145 0.008711 0.003192 0.001154 0.00223 0.0064860.00576 0.006822 0.015235 0.00869 0.003676 0.001193 0.002159 0.0061910.006491 0.005948 0.013473 0.007436 0.003165 0.001777 0.002377 0.0056460.003832 0.005257 0.011884 0.008077 0.002654 0.001227 0.001213 0.0050110.004812 0.005389 0.011737 0.00729 0.003216 0.001534 0.002039 0.0051630.008795 0.007888 0.016303 0.008801 0.004048 0.002367 0.0027 0.0068440.000451 0.000707 0.002277 0.001346 0.000797 0.000253 0.000239 0.0009820.004914 0.006924 0.01499 0.009123 0.006657 0.003364 0.003391 0.0075050.004473 0.006398 0.017247 0.008858 0.005219 0.002338 0.002392 0.0072110.001723 0.003639 0.010426 0.005216 0.003024 0.00043 0.000423 0.0039040.00427 0.005712 0.011221 0.00856 0.006923 0.004464 0.004462 0.0071260.004978 0.006186 0.009864 0.007161 0.005881 0.003835 0.003847 0.0061180.001722 0.0046 0.015611 0.007291 0.00338 0.000508 0.00049 0.005456

Rotation

0

0.02

0.04

0.06

0.08

0.1

0.12

1 10 19 28 37 46 55 64 73 82 91 100

109

118

127

136

Feature Number

Ave

rage

Dis

tanc

e


DPF: Dynamic Par=al Func=on [B. Li, E. Chang, et al, MM Systems 2013]

• Which Place is Similar to Kyoto? •  Par=al •  Dynamic •  Dynamic Par=al Func=on


Precision/Recall

Par=al, Dynamic Low dimensional manifolds


Data Manifold

Local linear


Part #1 Summary

•  Overcomplete Representa=on •  Sparse weigh=ng vector a for x •  Autoencoders & Sparse Coding

– Equivalent models – One with implicit and one with explicit f(x)


Autoencoders

-‐  also involve ac=va=on and reconstruc=on -‐  but have explicit f(x), e.g., sigmoid func=on -‐  not necessarily enforce sparsity on a -‐  but if put sparsity on a, oren get improved results [e.g. sparse RBM, Lee et al. NIPS08]

1/27/15

x

a

f(x) x’

a

g(a) encoding decoding


Sparse Coding

1/27/15

mina,�

mX

i=1

��xi �

kX

j=1

ai,j⇥j

��

2

+ �mX

i=1

kX

j=1

|ai,j |

-‐  a is sparse -‐  a is oren higher dimension than x -‐  Ac=va=on a = f(x) is nonlinear implicit func=on of x -‐  reconstruc=on x’ = g(a) is linear & explicit

x

a

f(x) x’

a

g(a) encoding decoding


Hierarchical Sparse Coding

Sparse Coding Pooling Sparse Coding Pooling

Learning from unlabeled data

Yu, Lin, & Lafferty, CVPR 11 Mauhew D. Zeiler, Graham W. Taylor, and Rob Fergus, ICCV 11


DEEP MODELS CNN, MRF & RBM


Recap NN

•  Other network architectures –  how the different neurons are connected to each other

Layer 3 Layer 1 Layer 2 Layer 4 In tradi1onal NN, neurons in a layer are fully connected to all neurons in the next layer.


CNN: NN Considers Sparse Coding


The replicated feature approach (Hinton: the dominant approach for neural networks)

•  Use many different copies of the same feature detector with different posi=ons. –  Could also replicate across scale and orienta=on (tricky and expensive)

–  Replica=on greatly reduces the number of free parameters to be learned.

•  Use several different feature types, each with its own map of replicated detectors. –  Allows each patch of image to be represented in several ways à overcomplete

The red connec=ons all have the same weight.


CNN Architecture: Convolu=onal Layers

Spa=ally-‐local correla=on – Spa=al informa=on is encoded in the network – Sparse connec=vity

Layer 1

Layer 2

…

v … v

…

Par1al Convolu1onal Layer


Pooling the Outputs of Replicated Feature Detectors

Get a small amount of transla=onal invariance at each level by averaging four neighboring replicated detectors to give a single output to the next level.

– This reduces the number of inputs to the next layer of feature extrac=on, thus allowing us to have many more different feature maps.

– Taking the maximum of the four (like HMAX) works slightly beuer (G. Hinton).


Convolu=onal Networks [LeCun 97]

•  Convolu=on (feature detec=on) •  Sub-‐sampling (mul=-‐scale) •  Perform C & S itera=vely to form a deep-‐learning network

•  Learning weights from data

•  Loca=on informa=on (where an object is at) lost


The 82 errors made by LeNet5

No=ce that most of the errors are cases that people find quite easy.

The human error rate is probably 20 to 30 errors but nobody has had the pa=ence to measure it.


Hinton NIPS 2013

Ciresan’s brute force approach •  LeNet uses knowledge about the invariances to design: –  the local connec=vity –  the weight-‐sharing –  the pooling.

•  Achieves about 80 errors –  This can be reduced to about 40 errors by using many different transforma=ons of the input and other tricks (Ranzato 2008)

•  Ciresan et. al. (2010) inject knowledge of invariances by crea=ng a huge amount of carefully designed extra training data: –  For each training image, they produce many new training examples by applying many different transforma=ons.

–  They can then train a large, deep, dumb net on a GPU without much overfi�ng.

•  Improves to 35 errors


Hinton NIPS 2013

The errors made by the Ciresan et. al. net

The top printed digit is the right answer. The bouom two printed digits are the network’s best two guesses. The right answer is almost always in the top 2 guesses. With model averaging they can now get about 25 errors.


Hinton NIPS 2013

From hand-‐wriuen digits to 3-‐D objects

•  Recognizing real objects in color photographs downloaded from the web is much more complicated than recognizing hand-‐wriuen digits: –  Hundred =mes as many classes (1,000 vs 10) –  Hundred =mes as many pixels (256 x 256 color vs 28 x 28 gray) –  Two dimensional image of three-‐dimensional scene. –  Cluuered scenes requiring segmenta=on – Mul=ple objects in each image.

•  Will the same type of CNN work?


Hinton NIPS 2013

The ILSVRC-‐2012 compe==on on ImageNet

•  The dataset has 1.2 million high-‐resolu=on training images.

•  The classifica=on task: –  Get the “correct” class in your top 5 bets. There are 1,000 classes.

•  The localiza=on task: –  For each bet, put a box around the object. Your box must have at least 50% overlap with the correct box.


Hinton NIPS 2013

Examples


Hinton NIPS 2013

Error rates on the ILSVRC-‐2012 compe==on

•  University of Tokyo •  Oxford University Computer Vision Group

•  INRIA (French na=onal research ins=tute in CS) + XRCE (Xerox Research Center Europe)

•  University of Amsterdam

•  26.1% 53.6% •  26.9% 50.0%

•  27.0%

•  29.5%

•  University of Toronto (Alex Krizhevsky) 16.4% 34.1% • 

classifica=on classifica=on &localiza=on


Hinton NIPS 2013

A neural network for ImageNet

•  Alex Krizhevsky (NIPS 2012) developed a very deep convolu=onal neural net of the type pioneered by Yann Le Cun. Its architecture was: –  7 hidden layers not coun=ng some max pooling layers.

–  The early layers were convolu=onal.

–  The last two layers were globally connected.

•  The ac=va=on func=ons were:

–  Rec=fied linear units in every hidden layer f(x) = max(0, x). These train much faster and are more expressive than logis=c units.

–  Compe==ve normaliza=on to suppress hidden ac=vi=es when nearby units have stronger ac=vi=es. This helps with varia=ons in intensity.


Hinton NIPS 2013

Tricks that significantly improve generaliza=on

•  Bagging Train on random 224x224 patches from the 256x256 images to get more data. Also use ler-‐right reflec=ons of the images. At test =me, combine the opinions from ten different patches: The four 224x224 corner patches plus the central 224x224 patch plus the reflec=ons of those five patches.

•  Dropout (Sparsifica=on) Use “dropout” to regularize the weights in the globally connected layers (which contain most of the parameters). Dropout means that half of the hidden units in a layer are randomly removed for each training example. This stops hidden units from relying too much on other hidden units.


Hinton NIPS 2013

Dropout: An efficient way to average many large neural nets (hup://arxiv.org/abs/1207.0580)

•  Consider a neural net with one hidden layer.

•  Each =me we present a training example, we randomly omit each hidden unit with probability 0.5.

•  So we are randomly sampling from 2H different architectures. All architectures share weights.


Hinton NIPS 2013

Dropout as a form of model averaging Bagging

•  Sample from 2H models, so only a few of the models ever get trained, and they only get one training example. – This is as extreme as Bagging can get.

•  The sharing of the weights means that every model is very strongly regularized. –  It’s a much beuer regularizer than L2 or L1 penal=es that pull the weights towards zero.


Hinton NIPS 2013

DEEP MODELS CNN, MRF & RBM



Russ S. KDD 04 Tutorial

Directed Graph Bayesian Networks

General Factoriza=on pak denotes parents of xk



1/27/15

“Explaining Away” •  Cause inference for directed graphs has one subtlety

•  Illustra=on: pixel colour in an image

image colour

surface colour

ligh=ng colour


C. Bishop, ECCV tutorial

Shortcomings of Back-‐propaga=on •  It requires labeled training data

– Almost all data is unlabeled. •  The learning =me does not scale well

– It is very slow in networks with mul=ple hidden layers.

– Backward pass: signal = dE/dy, diminishing as # layers increases

•  It can get stuck in poor local op=ma – These are oren quite good, but for deep nets they are far from op=mal.


MRF & RBM Directed à Undirected Graph


Markov Random Field (MRF) Components •  A set of sites or pixels: P={1,…,m} : each pixel is a site. •  Each pixel’s Neighborhood N={Np | p ∈ P} •  A set of random variables (random field), one for each pixel

X={Xp | p ∈ P} •  Denotes the label at each pixel.

Each random variable takes a value xp from the set of labels L={l1,…,lk}

•  We have a joint event {X1=x1,…, Xm=xm} , or a configura=on, abbreviated as X=x

•  The joint prob. Of such configura=on: p(X=x) or p(x) •  Many possible configura=ons: k^m

From Slides by S. Seitz - University of Washington 1/27/15 Ed Chang @ BigDat 2015

1/27/15

Markov Random Field Hammersley-‐Clifford Theorem

•  p(x) joint distribu=on is product of non-‐nega=ve func=ons over the cliques (neighbourhoods) of the graph

•  where are the clique poten+als, and Z is a normaliza=on constant


Equilibrium Interpreta=on

•  Expected value of product of states at thermal equilibrium when nothing is clamped


•  Expected value of product of states at thermal equilibrium when the training data is clamped on the visible units

∂L(θ )∂θij

= EPdata[xix j ]−EPθ

[xix j ]

Model Learning Similar to MRF

•  Expensive to compute with exponen=al # of configura=ons (over all possible images)

•  Use MCMC


•  Simple to compute

∂L(θ )∂θij

= EPdata[vihj ]−EPθ

[vihj ]

Latest ImageNet Compe==on Update


Key References

1/27/15

•  Deep Learning video lectures hup://videolectures.net/Top/Computer_Science/Machine_Learning/Deep_Learning/

•  A Data-‐Driven Study on Image Feature Extrac<on and Fusion, Zhiyu Wang, Fangtao Li, Edward Y. Chang, and Shiqiang Yang, Google Technical Report, April 2012

•  Founda<ons of Large-‐Scale Mul<media Informa<on Management and Retrieval, E. Y. Chang, Springer, 2011

•  Convolu<onal deep belief networks for scalable unsupervised learning of hierarchical representa<ons, Honglak Lee, Roger Grosse, Rajesh Ranganath and Andrew Y. Ng. In Proceedings of the Twenty-‐Sixth Interna+onal Conference on Machine Learning, 2009

•  Robust Object Recogni<on with Cortex-‐like Mechanisms, T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, IEEE Transac=ons on Pauern Analysis and Machine Intelligence, 29(3):411–426, 2007.

•  Object Recogni<on from Local Scale-‐Invariant Features, D.G. Lowe, In IEEE Interna=onal Conference on Computer Vision (ICCV), 1999.


…Key References

1/27/15

•  A Tutorial on Energy-‐Based Learning, Yann LeCun, et al, Predic=ng Structured Data, MIT Press, 2006.

•  Dropout, A Simple Way to Prevent Neural Networks from OverfiNng, N. Srivastava, G. Hinton, A. Krizhevsky, U. Sutskever, and R. Salakhutdinov, Journal of Machine Learning, 2014.

•  A Fast Learning Algorithm for Deep Belief Nets, G. Hinton, S. Osindero, and Y. The, Neural Computa=on, 2006

•  Representa<on Learning Tutorial, Yoshua Bengio, ICML 2012. •  Representa<on Learning: A Review and New Perspec<ves, Y. Bengio, A. Courville,

and P. Vincent, April 2014 •  Convolu<onal networks for images, speech, and <me series, Y. LeCun and Y.

Bengio, The handbook of brain theory and neural networks 3361, 310, 1995. •  Sparse Coding with an Overcomplete Basis Set, A Strategy Employed by V1,

Olshausen & Field,Vision Research, 37(23), p.3311-‐3325, 1997. •  Deep Learning Tutorial, R. Salakhutdinov KDD, 2014

Ed C,hang @ BigDat 2015

APPENDIX


Big Data Analytics - Stanford Universityinfolab.stanford.edu/~echang/BigDat2015/BigDat2015... ·...

Documents

Transcript of Big Data Analytics - Stanford Universityinfolab.stanford.edu/~echang/BigDat2015/BigDat2015... ·...