Big Data Analytics - Stanford Universityinfolab.stanford.edu/~echang/BigDat2015/BigDat2015... ·...
Transcript of Big Data Analytics - Stanford Universityinfolab.stanford.edu/~echang/BigDat2015/BigDat2015... ·...
Big Data Analytics Architectures, Algorithms
and Applications!Part #2: Intro to deep
Learning
Edward Chang 張智威 HTC (Prior: Google & U. California) !
Simon Wu!HTC (prior: Twitter & Microsoft)
Three Lectures • Lecture #1: Scalable Big Data Algorithms
– Scalability issues – Key algorithms with applica=on examples
• Lecture #2: Intro to Deep Learning – Autoencoder & Sparse Coding – Graph models: CNN, MRF, & RBM
• Lecture #3: Analy=cs PlaMorm [by Simon Wu] – Intro to LAMA plaMorm – Code lab
1/27/15 Ed Chang @ BigDat 2015
Acknowledging Slide Contributors
• Geoffrey Hinton • Yoshua Bengio • Russ Salakhutdinov • Kai Yu • Yann Lecun • Andrew Ng • Steven Seitz
1/27/15 Ed Chang @ BigDat 2015
Lecture #2 Outline
• Data Posteriors vs. Human Priors • Learn p(x) from Big Data
– Use NN to construct Autoencoder – Sparse Coding – Dynamic Par=al
• Graphical Models – CNN, MRF, & RBM
• Demo
1/27/15 Ed Chang @ BigDat 2015
Representa=on?
1/27/15 Ed Chang @ BigDat 2015
Knowledge or Feature extrac1on in Image processing: involves using algorithms to detect and isolate various desired edges or shapes
Low-‐level: Edge detec=on, corner detec=on, ridge detec=on, or more generally “Scale-‐invariant feature transform” (SIFT)
Curvature: Shape informa=on, blob detec=on
Hough transform: Lines, circles/ellipse, arbitrary shapes (Generalized Hough Transform)
Typical Image/Video Representa=on Based on Domain Knowledge and Human Priors
1/27/15 Ed Chang @ BigDat 2015
Template matching (medical imaging)
Flexible methods for 2D, 3D or 3D+=me edge extrac=on, road detec=on, MRI, fMRI
Color and texture representa1ons: Histograms, various transforma=ons for conduc=ng frequency-‐domain analysis. e.g., wavelets
Mo1on: Mo=on detec=on: e.g., op=cal flow, global or area based
…Many Related Work on Representa=on
1/27/15 Ed Chang @ BigDat 2015
Key Design Goals for Representa=on
Design features x that are invariant and selec+ve • Good Invariance
– Same object should have the same features
• Good Selec=vity (Disentanglement) – Different objects should exhibit different features for telling them apart
Once x has been designed, find label y for x and then learn p(y|x)
1/27/15 Ed Chang @ BigDat 2015
Challenges • Invariance affected by noise
– Environmental factor (e.g., ligh=ng condi=on, occlusion) – Equipment factor (e.g., different camera brands different colors and gamma correc=on)
– Aliasing (e.g., cars have different models, hence different features)
• Selec=vity requires good similarity func=ons • Labeled data is tough to acquire
– Learning robust model requires big data
1/27/15 Ed Chang @ BigDat 2015
Remedy #1 Learn ϕ from data p(x|ϕ) ≈ p*(x)
• Instead of designing, learn features ϕ from data, from data
• Data: not just the original data, but adding variants to the data – E.g., adding scaled, rotated, cropped, mirrored, gamma adjusted images
• Instead of requiring invariant features as input to a model, let the model cope with invariance
• Then, learn features ϕ for predic=ng p(x|ϕ) accurately (p(x|ϕ)≈ p*(x)) in an unsupervised way from data already covering variant condi=ons
1/27/15 Ed Chang @ BigDat 2015
Remedy #2 Deep Model
• Learn representa=on in a hierarchical way • [T. Serre, T. Poggio; MIT 2005]
1/27/15 Ed Chang @ BigDat 2015
Lecture Outline
• Data Posteriors vs. Human Priors • Learn p(x) from Big Data
– Use NN to Construct Autoencoder – Sparse Coding – Dynamic Par=al
• Graphical Models – CNN, MRF, & RBM
• Demo
1/27/15 Ed Chang @ BigDat 2015
Mul=ple-‐Layer Networks Neuron Network (NN) Model
An elementary neuron with R inputs is shown below. Each input is weighted with an appropriate w. The sum of the weighted inputs and the bias forms the input to the transfer func=on f. Neurons can use any differen1able transfer func1on f to generate their output.
NN Model Transfer Func=ons (Ac=vi=on Func=on)
Mul=layer networks oren use the log-‐sigmoid transfer func=on logsig. The func=on logsig generates outputs between 0 and 1 as the neuron's net input goes from nega=ve to posi=ve infinity
NN Model Feedforward Network
A single-‐layer network of S logsig neurons having R inputs is shown below in full detail on the ler and with a layer diagram on the right.
Example Four-‐layer NN
1/27/15 Ed Chang @ BigDat 2015
Input Layer Hidden Layer #1 Hidden Layer #2 Output Layer
y
NN Model Learning Algorithm
The following slides describes learning process of mul=-‐layer neural network employing backpropaga1on algorithm. To illustrate this process the three layer neural network with two inputs and one output,which is shown in the picture below, is used:
Learning Algorithm: Backpropaga=on
Each neuron is composed of two units. First unit adds products of weights coefficients and input signals. The second unit realizes a nonlinear func=on, called neuron transfer (ac=va=on) func=on. Signal e is adder output signal, and y = f(e) is output signal of nonlinear element. Signal y is also output signal of neuron.
Feed Forward Pictures below illustrate how signal is forward-‐feeding through the network, Symbols w(xm)n represent weights of connec=ons between network input xm and neuron n in input layer. Symbols yn represents output signal of neuron n.
Feed Forward
Feed Forward
Feed Forward Propaga=on of signals through the hidden layer. Symbols wmn represent weights of connec=ons between output of neuron m and input of neuron n in the next layer.
Feed Forward
Learning Algorithm: Forward Pass
Propaga=on of signals through the output layer.
Learning Algorithm: Backpropaga=on
To teach the neural network we need training data set. The training data set consists of input signals (x1 and x2 ) assigned with corresponding target (desired output) z. The network training is an itera=ve process. In each itera=on weights coefficients of nodes are modified using new data from training data set. Modifica=on is calculated using algorithm described below: Each teaching step starts with forcing both input signals from training set. Arer this stage we can determine output signal values for each neuron in each network layer.
Learning Algorithm: Backpropaga=on
In the next algorithm step the output signal of the network y is compared with the desired output value (the target z), which is found in training data set. The difference is called error signal δ of output layer neuron
Learning Algorithm: Backpropaga=on
The idea is to propagate error signal δ (computed in single teaching step) back to all neurons, which output signals were input for discussed neuron.
Learning Algorithm: Backpropaga=on
The idea is to propagate error signal δ (computed in single teaching step) back to all neurons, which output signals were input for discussed neuron.
Learning Algorithm: Backpropaga=on
The weights' coefficients wmn used to propagate errors back are equal to this used during compu=ng output value. Only the direc=on of data flow is changed (signals are propagated from output to inputs one arer the other). This technique is used for all network layers. If propagated errors came from few neurons they are added. The illustra=on is below:
Learning Algorithm: Backpropaga=on
When the error signal for each neuron is computed, the weights coefficients of each neuron input node may be modified. In formulas below df(e)/de represents deriva=ve of neuron ac=va=on func=on (which weights are modified).
Learning Algorithm: Backpropaga=on
When the error signal for each neuron is computed, the weights coefficients of each neuron input node may be modified. In formulas below df(e)/de represents deriva=ve of neuron ac=va=on func=on (which weights are modified).
Learning Algorithm: Backpropaga=on
When the error signal for each neuron is computed, the weights coefficients of each neuron input node may be modified. In formulas below df(e)/de represents deriva=ve of neuron ac=va=on func=on (which weights are modified).
Sigmoid func=on f(e) and its deriva=ve f’(e)
Ed Chang @ BigDat 2015
f (e) = 11+ e−βe
, β is the paramter for slope
Hence
f ' (e) = df (e)de
=d 1
1+ e−βe"
#$
%
&'
d(1+ e−βe )df (e−βe )de
f ' (e) = −β(1+ e−βe )2 e
−βe =−β
(1+ e−βe )2 e−e
=1
(1+ e−βe )−βe−e
(1+ e−βe )= f (e) 1−β f (e)( )
For simplicity, paramter for the slope β =1f ' (e) = f (e) 1− f (e)( )
hup://link.springer.com/chapter/10.1007%2F3-‐540-‐59497-‐3_175#page-‐1
hup://mathworld.wolfram.com/SigmoidFunc=on.html
1/27/15
Autoencorder NN for Unsupervised Compression
1/27/15 Ed Chang @ BigDat 2015
hw,b(x) ≈x
1/27/15 Ed Chang @ BigDat 2015
Parameter Learning
• 10x10 images with 100 pixels
• R100 Possible configura=ons
• H hidden units – H = 100? – H = 50? PCA
• Too computa=onal intensive to learn w
1/27/15 Ed Chang @ BigDat 2015
Learning Algorithm
• Suppose ϕ (or h) to be a set of hidden variables • Model image x with k independent hidden features ϕi with addi=ve noise v
• The goal is to find a set of h such that posterior P(x|ϕ) us as close as P*(x) or to minimize KL divergence between the two
1/27/15 Ed Chang @ BigDat 2015
x = aiφi + v(x)i=1
k
∑
…Learning Algorithm • Minimize KL divergence between the two dist.
• Since P*(x) is constant across choice of ϕ, Min KL è Maximize the log-‐likelihood P(x|ϕ)
1/27/15 Ed Chang @ BigDat 2015
D(P*(x) || P(x |φ)) = P*(x)log P*(x)P(x |φ)!
"#
$
%&dx∫
φ*= argmaxφ log(P(x |φ)
φ*,a*= argminφ,a x( j ) − ai( j )φi
i=1
k
∑j=1
m∑
2
+λ S(ai( j )
i=1
k∑ ) Revisit this later
Lecture Outline
• Data Posteriors vs. Human Priors • Learn p(x) from Big Data
– Use NN to Construct Autoencoder – Sparse Coding – Dynamic Par=al
• Graphical Models – CNN, MRF, & RBM
• Demo
1/27/15 Ed Chang @ BigDat 2015
General Priors in Real-‐World Data [Y. Bengio, et al., 2014]
• A Hierarchical Organiza=on of Factors • Smoothness
– x ≈y à f(x) ≈ f(y) – Nearest neighbor assump=on
• Local manifold – Clustered – Low degree of freedom – E.g., PCA
• Distributed Representa=ons – Feature reuse, and abstract & invariant representa=ons – Dynamic and par=al 1/27/15 Ed Chang @ BigDat 2015
• Every learning model is a variant of the nearest neighbor model
• Similar objects should reside in the neighborhood of a feature subspace
Smoothness Nearest Neighbor Model
1/27/15 Ed Chang @ BigDat 2015
Local Low dimensional manifolds
K. Yu and A. Ng, Tutorial: Feature Learning for Image classifica1on Part 3: Image Classifica1on using Sparse Coding: Advanced Topics, ECCV-‐2010.
Data Manifold
Local linear
1/27/15 Ed Chang @ BigDat 2015
Smooth, Local, Sparse
K. Yu and A. Ng, Tutorial: Feature Learning for Image classifica1on Part 3: Image Classifica1on using Sparse Coding: Advanced Topics, ECCV-‐2010.
Data
Basis
Local linear
Data Manifold
Each datum can be represented by its neighbor anchors
Sparse Combina=on
1/27/15 Ed Chang @ BigDat 2015
Sparse Coding [Olshausen & Field,1996]
• Find representa=on of data, unsupervised – Tradi=onally PCA (too contrived, why?)
• Find over-‐complete bases in an efficient way • x ≈ a ϕ, where x in Rn and ϕ in Rm, m > n • Coefficients ϕ cannot be uniquely determined • Thus, impose sparsity on ϕ • k-‐sparsity 1/27/15 Ed Chang @ BigDat 2015
Sparse Coding
1/27/15 Ed Chang @ BigDat 2015
N
x
N X 1
a
N
M X 1 K
φA fixed Dictionary
N X M
What is Sparse Coding
1/27/15
mina,�
mX
i=1
������xi �
kX
j=1
ai,j⇥j
������
2
+ �mX
i=1
kX
j=1
|ai,j |
Sparse coding (Olshausen & Field,1996). Originally developed to explain early visual processing in the brain (edge detec=on in V1).
Training: given a set of random patches x, learning a dic=onary of bases [Φ1, Φ2, …]
Coding: for data vector x, solve LASSO to find the sparse coefficient vector a
Ed Chang @ BigDat 2015
Sparse Coding: Training Time Input: Images x1, x2, …, xm (each in Rd) Learn: Dic=onary of bases ϕ1, ϕ2, …, ϕk (also Rd).
mina,�
mX
i=1
������xi �
kX
j=1
ai,j⇥j
������
2
+ �mX
i=1
kX
j=1
|ai,j |
Alterna=ng op=miza=on: 1. Fix dic=onary ϕ1, ϕ2, …, ϕk , op=mize a ( standard LASSO
problem) 2. Fix ac=va=ons a, op=mize dic=onary ϕ1, ϕ2, …, ϕk , (a convex
QP problem)
1/27/15 Ed Chang @ BigDat 2015
Sparse Coding: Tes=ng Time Input: Unseen image patch xi (in Rd) and previously learned ϕi’s Output: Representa=on [ai,1, ai,2, …, ai,k] of image patch xi.
≈ 0.8 * + 0.3 * + 0.5 *
Represent xi as: ai = [0, 0, …, 0, 0.8, 0, …, 0, 0.3, 0, …, 0, 0.5, …]
mina,�
mX
i=1
������xi �
kX
j=1
ai,j⇥j
������
2
+ �mX
i=1
kX
j=1
|ai,j |
1/27/15 Ed Chang @ BigDat 2015
Jus=fica=ons & Examples
• Probabilis=c Interpreta=on • Human Visual Cortex
– Not enforcing orthogonal bases like PCA – Over-‐complete preserves more features
• Scales, orienta=ons
1/27/15 Ed Chang @ BigDat 2015
Revisit Autoencoder’s Probabilis=c Interpreta=on
• Suppose ϕ (or h) to be a set of hidden variables • Model image x with k independent hidden features ϕi with addi=ve noise v
• The goal is to find a set of h such that posterior P(x|ϕ) us as close as P*(x) or to minimize KL divergence between the two
1/27/15 Ed Chang @ BigDat 2015
x = aiφi + v(x)i=1
k
∑
…Probabilis=c Interpreta=on • Minimize KL divergence between the two dist.
• Since P*(x) is constant across choice of h • Maximize the log-‐likelihood P(x|ϕ)
1/27/15 Ed Chang @ BigDat 2015
D(P*(x) || P(x |φ)) = P*(x)log P*(x)P(x |φ)!
"#
$
%&dx∫
φ*= argmaxφ log(P(x |φ)
…Probabilis=c Interpreta=on • We need two terms P(x|a, ϕ) and p(a) because
• Assume white noise v is Gaussian with variance σ2
• To determine P(x|ϕ), we need the prior P(a). Assume the independence of source features
1/27/15 Ed Chang @ BigDat 2015
P(x | a,φ) = 1Zexp −
(x − aiφi )2
i=1
k∑2σ 2
#
$
%%
&
'
((
P(a) = p(ai )i=1
k
∏
P(x |φ) = P(x | a,φ)P(a)da∫
…Probabilis=c Interpreta=on • Add sparsity assump=on-‐-‐-‐every image is a product of few
features, we would like probability distribu=on of ai to be peaked at zero and have a high kurtosis, S(ai) controls the shape
1/27/15 Ed Chang @ BigDat 2015
P(ai ) =1Zexp(−βS(ai ))
P(x |φ) = P(x | a,φ)P(a)da∫
P(a) = p(ai )i=1
k
∏
…Probabilis=c Interpreta=on • The problem is reduced to that over all input data
1/27/15 Ed Chang @ BigDat 2015
φ*= argmaxφ log(P(x |φ)
Max logj=1
m∑ P(x | a,φ)P(a)da∫
=Max log exp(−(x − aiφi )
2∑2σ 2∫
j=1
m
∑ ) exp(−βS(ai ))∏
=Max logj=1
m
∑ exp(−(x − aiφi )2∑∫ − βS(ai ))∑
→Min x( j ) − ai( j )φii=1
k∑
j=1
m
∑2
+λ S(ai( j )
i=1
k∑ )
…Probabilis=c Interpreta=on • Maximizing log likelihood is equivalent to minimizing
energy func=on
• The choices of S(.), L1 or log penalty, correspond to the use of the Laplacian and the Cauchy prior, respec=vely
1/27/15 Ed Chang @ BigDat 2015
P(ai )∝ exp(−β ai )
P(ai )∝β
1+ ai2
φ*= argmaxφ log(P(x |φ)
φ*,a*= argminφ,a x( j ) − ai( j )φi
i=1
k
∑j=1
m∑
2
+λ S(ai( j )
i=1
k∑ )
Jus=fica=ons & Examples
• Probabilis=c Interpreta=on • Human Visual Cortex
– Not enforcing orthogonal bases like PCA – Over-‐complete preserves more features
• Scales, orienta=ons
1/27/15 Ed Chang @ BigDat 2015
Feature Invariance • Human visual system works so well • “Mental” model (T. Serre, T. Poggio; MIT 2005)
– Ventral visual pathway – Deep learning
1/27/15 Ed Chang @ BigDat 2015
Visual Pathway [Hubel Wiesle, 68]
1/27/15 Ed Chang @ BigDat 2015
Primary Visual Cortex (V1)
1/27/15 Ed Chang @ BigDat 2015
Extrastriate cortex
1/27/15 Ed Chang @ BigDat 2015
Extrastriate cortex
Feedforward Path of Ventral Stream • Invariance (overcomplete)
– V1, star=ng with scale/posi=on/orienta=on invariance over a restricted range
– Then invariance of view points and other transforma=ons
• Mul=-‐layer, mul=-‐area (deep) – V2 and V3 (shape), Improve complexity of op=mal s=mulus
• Feedforward – First 150 millisecond of percep=on – No color informa=on (in V4) – W/o feedback
1/27/15 Ed Chang @ BigDat 2015
Six Steps of HMAX [T. Serre, T. Poggio; MIT 2005]
1/27/15 Ed Chang @ BigDat 2015
Mul=-‐layer Visual Pathway
1/27/15 Ed Chang @ BigDat 2015
• Edge detec=on, mul=-‐scale, mul=-‐direc=on (on/off, simple) – Using mul=-‐scale mul=-‐direc=on Gabor filters
• Edge pooling (max, invariance) – Keep “strong” features”
• Unsupervised clustering (or) – Clustering edges into patches
V1 Like Bases
1/27/15 Ed Chang @ BigDat 2015
Mul=-‐layer Visual Pathway
1/27/15 Ed Chang @ BigDat 2015
• Part Detec=on (on/off, simple) – Find matching patches in photos
• Part Pooling (max, invariance) – Iden=fy useful patches/parts
• Supervised Learning – Object ß parts
Edges and Parts
1/27/15 Ed Chang @ BigDat 2015
Six Steps of HMAX [T. Serre, T. Poggio; MIT 2005]
• Edge Detec=on, mul=-‐/scale,direc=on (on/off, simple) – Using mul=-‐scale mul=-‐orienta=on Gabor filters
• Edge Pooling (max, invariance) – Keep “strong” features”
• Unsupervised Clustering (or) – Clustering edges into patches
• Part Detec=on (on/off, simple) – Find matching patches in photos
• Part Pooling (max, invariance) – Iden=fy useful patches/parts
• Supervised Learning – Object ß parts
1/27/15 Ed Chang @ BigDat 2015
Revisit Challenges of Representa=on Learning
• Invariance affected by noise – Environmental factor (e.g., ligh=ng condi=on, occlusion) – Equipment factor (e.g., different camera brands different colors and gamma correc=on)
– Aliasing (e.g., cars have different models, hence different features)
• Labeled data is tough to acquire – Robust models require big data
• Selec=vity requires good similarity func=ons
1/27/15 Ed Chang @ BigDat 2015
Lecture Outline
• Data Posteriors vs. Human Priors • Learn p(x) from Big Data
– Use NN to Construct Autoencoder – Sparse Coding – Dynamic Par=al
• Graphical Models – CNN, MRF, & RBM
• Demo
1/27/15 Ed Chang @ BigDat 2015
Example of Sparse Models
1/27/15
• because the 2nd and 4th elements of w are non-zero, these are the two selected features in x
• globally-aligned sparse representation
x1 [ | | | | | | ]
x2 [ | | | | | | ]
xm [ | | | | | | ]
…
x3 [ | | | | | | ]
[ 0 | 0 | 0 0 ]
[ 0 | 0 | 0 0 ]
[ 0 | 0 | 0 0 ]
…
[ 0 | 0 | 0 0 ]
f(x) = <w,x>, where w=[0, 0.2, 0, 0.1, 0, 0]
Ed Chang @ BigDat 2015
Example of Sparse Ac=va=ons
1/27/15
• Different x has different dimensions ac=vated • Locally-‐shared sparse representa=on: similar x’s tend to have
similar non-‐zero dimensions, but not all
a1 [ 0 | | | 0 … 0 ]
a2 [ | | | 0 0 … 0 ]
am [ 0 0 0 | | … 0 ]
…
a3 [ | 0 | | 0 … 0 ]
x1
x2 x3
xm
Ed Chang @ BigDat 2015
Example of Sparse Ac=va=ons
1/27/15
• Preserving manifold structure • i.e., clusters, manifolds,
a1 [ | | | 0 0 … 0 ] a2 [ 0 | | | 0 … 0 ]
am [ 0 0 0 0 | … 0 ]
…
a3 [ 0 0 | | | … 0 ]
x1 x2 x3
xm
Ed Chang @ BigDat 2015
1/27/15 Ed Chang @ BigDat 2015
Similarity Theories
• Objects are similar in all respects (Richardson 1928)
• Objects are similar in some respects (Tversky 1977)
• Similarity is a process of determining respects, rather than using predefined respects (Goldstone 94)
1/27/15 Ed Chang @ BigDat 2015
Similarity Theories
• Objects are similar in all or some respects
• Minkowski Func=on – D = (Σi = 1..M (pi -‐ qi)n)1/n
• Weighted Minkowski Func=on – D = (Σi = 1..M, wi(pi -‐ qi)n)1/n
• Same w is imposed to app pairs of objects p and q
[ 0 | 0 | 0 0 ]
[ 0 | 0 | 0 0 ]
[ 0 | 0 | 0 0 ]
…
[ 0 | 0 | 0 0 ]
1/27/15 Ed Chang @ BigDat 2015
DPF: Dynamic Par=al Func=on [B. Li, E. Chang, et al, MM Systems 2013]
• Similarity is a process of determining respects, rather than using predefined respects (Goldstone 94)
a1 [ 0 | | | 0 … 0 ]
a2 [ | | | 0 0 … 0 ]
am [ 0 0 0 | | … 0 ]
…
a3 [ | 0 | | 0 … 0 ]
a1 [ | | | 0 0 … 0 ] a2 [ 0 | | | 0 … 0 ]
am [ 0 0 0 0 | … 0 ]
…
a3 [ 0 0 | | | … 0 ]
1/27/15 Ed Chang @ BigDat 2015
0 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 0
0.007545 0.01307 0.004637 0.002413 0.002635 0.002954 0.0020070.014669 0.02717 0.010578 0.006734 0.007725 0.006379 0.0057660.012615 0.023055 0.009333 0.006764 0.007363 0.006593 0.0054430.082128 0.212612 0.068016 0.037835 0.032241 0.018068 0.0132030.061564 0.176548 0.045542 0.026445 0.026374 0.018583 0.0220370.019243 0.037016 0.015684 0.010834 0.012792 0.013536 0.0093460.09418 0.153677 0.066896 0.040249 0.036368 0.030341 0.0211380.1284 0.335405 0.13774 0.072613 0.054947 0.039216 0.043319
0.041414 0.101403 0.035881 0.022633 0.018991 0.017131 0.019450.014024 0.049782 0.01457 0.0053 0.004439 0.003041 0.0052260.049319 0.120274 0.045804 0.020165 0.019499 0.013805 0.018513
GIF
00.020.040.060.080.10.120.14
1 11 21 31 41 51 61 71 81 91 101
111
121
131
141
Feature Number
Ave
rage
Dis
tanc
e
00.050.10.150.20.250.30.350.4
1 11 21 31 41 51 61 71 81 91 101
111
121
131
141
0 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0
0.002923 0.004377 0.029086 0.017063 0.007649 0.002019 0.001984 0.011560.006648 0.010143 0.070708 0.046142 0.023502 0.005178 0.005169 0.030140.006298 0.009264 0.075118 0.042225 0.020053 0.006285 0.006533 0.0300430.010198 0.056025 0.052869 0.033199 0.018294 0.00688 0.006858 0.023620.017066 0.047514 0.104013 0.073459 0.037468 0.013849 0.01293 0.0483440.008148 0.015337 0.074134 0.044238 0.021222 0.005197 0.005099 0.0299780.013529 0.051743 0.063263 0.038084 0.020885 0.010481 0.009844 0.0285110.045746 0.104141 0.145924 0.11276 0.065015 0.026333 0.02593 0.0751920.026167 0.034522 0.085067 0.054154 0.02918 0.015887 0.014371 0.0397320.002676 0.012148 0.008913 0.004682 0.002452 0.000913 0.000905 0.0035730.014527 0.036084 0.046779 0.024712 0.017418 0.004182 0.004991 0.0196160.012121 0.030269 0.045198 0.022268 0.012468 0.004706 0.004955 0.017919
Scale up/down
00.050.10.150.20.250.30.350.4
1 11 21 31 41 51 61 71 81 91 101
111
121
131
141
Feature Number
Aver
age
Dis
tanc
e
0.024788 0.069615 0.0226 0.009364 0.01 0.00678 0.0097120.094781 0.227558 0.099002 0.046466 0.047815 0.036883 0.0246990.093399 0.233519 0.188091 0.043026 0.037991 0.022151 0.0240640.040228 0.102763 0.034949 0.014184 0.01465 0.010237 0.0155170.001163 0.000896 0.000722 0.000627 0.000349 0.000452 0.0027580.006947 0.006769 0.003541 0.006377 0.002048 0.005515 0.0130060.006365 0.005313 0.002064 0.004006 0.002055 0.003338 0.01010.011705 0.010935 0.006615 0.007506 0.003319 0.005911 0.0152110.009434 0.010169 0.004484 0.006306 0.002582 0.004798 0.0136570.006305 0.005997 0.003392 0.005719 0.002382 0.004853 0.0128020.005835 0.00945 0.004323 0.00564 0.002688 0.004535 0.0063320.008149 0.009636 0.0047 0.006213 0.002564 0.003375 0.0064210.006776 0.010315 0.005393 0.008004 0.003845 0.005659 0.0132030.001526 0.002551 0.000576 0.000371 0.000331 0.000286 0.000380.016302 0.022657 0.007055 0.00353 0.002171 0.004162 0.003980.012414 0.020159 0.007076 0.003102 0.00188 0.004606 0.003490.007231 0.013591 0.004979 0.001092 0.000582 0.002766 0.0007410.011588 0.015102 0.005764 0.003855 0.00262 0.004584 0.0037920.01212 0.016013 0.006441 0.004048 0.002728 0.004856 0.0042410.012235 0.01671 0.00483 0.002616 0.00197 0.00268 0.001672
Cropping
00.050.10.150.20.250.30.35
1 11 21 31 41 51 61 71 81 91 101
111
121
131
141
Feature Number
Ave
rage
Dis
tanc
e
0.006109 0.019169 0.032795 0.015229 0.008667 0.002357 0.00292 0.0123940.01223 0.070665 0.046472 0.02549 0.017445 0.008694 0.00841 0.0213020.019067 0.08113 0.04592 0.024327 0.014169 0.004995 0.005275 0.0189370.011323 0.029089 0.063856 0.037716 0.01988 0.00522 0.005556 0.0264460.000995 0.000971 0.00241 0.001415 0.000736 0.000275 0.000272 0.0010220.007103 0.006337 0.015615 0.008709 0.003433 0.001572 0.002071 0.006280.004321 0.004457 0.012494 0.007507 0.003403 0.001351 0.001976 0.0053460.007451 0.008135 0.017145 0.008711 0.003192 0.001154 0.00223 0.0064860.00576 0.006822 0.015235 0.00869 0.003676 0.001193 0.002159 0.0061910.006491 0.005948 0.013473 0.007436 0.003165 0.001777 0.002377 0.0056460.003832 0.005257 0.011884 0.008077 0.002654 0.001227 0.001213 0.0050110.004812 0.005389 0.011737 0.00729 0.003216 0.001534 0.002039 0.0051630.008795 0.007888 0.016303 0.008801 0.004048 0.002367 0.0027 0.0068440.000451 0.000707 0.002277 0.001346 0.000797 0.000253 0.000239 0.0009820.004914 0.006924 0.01499 0.009123 0.006657 0.003364 0.003391 0.0075050.004473 0.006398 0.017247 0.008858 0.005219 0.002338 0.002392 0.0072110.001723 0.003639 0.010426 0.005216 0.003024 0.00043 0.000423 0.0039040.00427 0.005712 0.011221 0.00856 0.006923 0.004464 0.004462 0.0071260.004978 0.006186 0.009864 0.007161 0.005881 0.003835 0.003847 0.0061180.001722 0.0046 0.015611 0.007291 0.00338 0.000508 0.00049 0.005456
Rotation
0
0.02
0.04
0.06
0.08
0.1
0.12
1 10 19 28 37 46 55 64 73 82 91 100
109
118
127
136
Feature Number
Ave
rage
Dis
tanc
e
1/27/15 Ed Chang @ BigDat 2015
DPF: Dynamic Par=al Func=on [B. Li, E. Chang, et al, MM Systems 2013]
• Which Place is Similar to Kyoto? • Par=al • Dynamic • Dynamic Par=al Func=on
1/27/15 Ed Chang @ BigDat 2015
Precision/Recall
Par=al, Dynamic Low dimensional manifolds
K. Yu and A. Ng, Tutorial: Feature Learning for Image classifica1on Part 3: Image Classifica1on using Sparse Coding: Advanced Topics, ECCV-‐2010.
Data Manifold
Local linear
1/27/15 Ed Chang @ BigDat 2015
Part #1 Summary
• Overcomplete Representa=on • Sparse weigh=ng vector a for x • Autoencoders & Sparse Coding
– Equivalent models – One with implicit and one with explicit f(x)
1/27/15 Ed Chang @ BigDat 2015
Autoencoders
-‐ also involve ac=va=on and reconstruc=on -‐ but have explicit f(x), e.g., sigmoid func=on -‐ not necessarily enforce sparsity on a -‐ but if put sparsity on a, oren get improved results [e.g. sparse RBM, Lee et al. NIPS08]
1/27/15
x
a
f(x) x’
a
g(a) encoding decoding
Ed Chang @ BigDat 2015
Sparse Coding
1/27/15
mina,�
mX
i=1
������xi �
kX
j=1
ai,j⇥j
������
2
+ �mX
i=1
kX
j=1
|ai,j |
-‐ a is sparse -‐ a is oren higher dimension than x -‐ Ac=va=on a = f(x) is nonlinear implicit func=on of x -‐ reconstruc=on x’ = g(a) is linear & explicit
x
a
f(x) x’
a
g(a) encoding decoding
Ed Chang @ BigDat 2015
Hierarchical Sparse Coding
Sparse Coding Pooling Sparse Coding Pooling
Learning from unlabeled data
Yu, Lin, & Lafferty, CVPR 11 Mauhew D. Zeiler, Graham W. Taylor, and Rob Fergus, ICCV 11
1/27/15 Ed Chang @ BigDat 2015
DEEP MODELS CNN, MRF & RBM
1/27/15 Ed Chang @ BigDat 2015
Recap NN
• Other network architectures – how the different neurons are connected to each other
Layer 3 Layer 1 Layer 2 Layer 4 In tradi1onal NN, neurons in a layer are fully connected to all neurons in the next layer.
1/27/15 Ed Chang @ BigDat 2015
CNN: NN Considers Sparse Coding
1/27/15 Ed Chang @ BigDat 2015
The replicated feature approach (Hinton: the dominant approach for neural networks)
• Use many different copies of the same feature detector with different posi=ons. – Could also replicate across scale and orienta=on (tricky and expensive)
– Replica=on greatly reduces the number of free parameters to be learned.
• Use several different feature types, each with its own map of replicated detectors. – Allows each patch of image to be represented in several ways à overcomplete
The red connec=ons all have the same weight.
1/27/15 Ed Chang @ BigDat 2015
CNN Architecture: Convolu=onal Layers
Spa=ally-‐local correla=on – Spa=al informa=on is encoded in the network – Sparse connec=vity
Layer 1
Layer 2
…
v … v
…
Par1al Convolu1onal Layer
1/27/15 Ed Chang @ BigDat 2015
1/27/15 Ed Chang @ BigDat 2015
1/27/15 Ed Chang @ BigDat 2015
1/27/15 Ed Chang @ BigDat 2015
1/27/15 Ed Chang @ BigDat 2015
Pooling the Outputs of Replicated Feature Detectors
Get a small amount of transla=onal invariance at each level by averaging four neighboring replicated detectors to give a single output to the next level.
– This reduces the number of inputs to the next layer of feature extrac=on, thus allowing us to have many more different feature maps.
– Taking the maximum of the four (like HMAX) works slightly beuer (G. Hinton).
1/27/15 Ed Chang @ BigDat 2015
Convolu=onal Networks [LeCun 97]
• Convolu=on (feature detec=on) • Sub-‐sampling (mul=-‐scale) • Perform C & S itera=vely to form a deep-‐learning network
• Learning weights from data
• Loca=on informa=on (where an object is at) lost
1/27/15 Ed Chang @ BigDat 2015
The 82 errors made by LeNet5
No=ce that most of the errors are cases that people find quite easy.
The human error rate is probably 20 to 30 errors but nobody has had the pa=ence to measure it.
1/27/15 Ed Chang @ BigDat 2015
Hinton NIPS 2013
Ciresan’s brute force approach • LeNet uses knowledge about the invariances to design: – the local connec=vity – the weight-‐sharing – the pooling.
• Achieves about 80 errors – This can be reduced to about 40 errors by using many different transforma=ons of the input and other tricks (Ranzato 2008)
• Ciresan et. al. (2010) inject knowledge of invariances by crea=ng a huge amount of carefully designed extra training data: – For each training image, they produce many new training examples by applying many different transforma=ons.
– They can then train a large, deep, dumb net on a GPU without much overfi�ng.
• Improves to 35 errors
1/27/15 Ed Chang @ BigDat 2015
Hinton NIPS 2013
The errors made by the Ciresan et. al. net
The top printed digit is the right answer. The bouom two printed digits are the network’s best two guesses. The right answer is almost always in the top 2 guesses. With model averaging they can now get about 25 errors.
1/27/15 Ed Chang @ BigDat 2015
Hinton NIPS 2013
From hand-‐wriuen digits to 3-‐D objects
• Recognizing real objects in color photographs downloaded from the web is much more complicated than recognizing hand-‐wriuen digits: – Hundred =mes as many classes (1,000 vs 10) – Hundred =mes as many pixels (256 x 256 color vs 28 x 28 gray) – Two dimensional image of three-‐dimensional scene. – Cluuered scenes requiring segmenta=on – Mul=ple objects in each image.
• Will the same type of CNN work?
1/27/15 Ed Chang @ BigDat 2015
Hinton NIPS 2013
The ILSVRC-‐2012 compe==on on ImageNet
• The dataset has 1.2 million high-‐resolu=on training images.
• The classifica=on task: – Get the “correct” class in your top 5 bets. There are 1,000 classes.
• The localiza=on task: – For each bet, put a box around the object. Your box must have at least 50% overlap with the correct box.
1/27/15 Ed Chang @ BigDat 2015
Hinton NIPS 2013
Examples
1/27/15 Ed Chang @ BigDat 2015
Hinton NIPS 2013
Error rates on the ILSVRC-‐2012 compe==on
• University of Tokyo • Oxford University Computer Vision Group
• INRIA (French na=onal research ins=tute in CS) + XRCE (Xerox Research Center Europe)
• University of Amsterdam
• 26.1% 53.6% • 26.9% 50.0%
• 27.0%
• 29.5%
• University of Toronto (Alex Krizhevsky) 16.4% 34.1% •
classifica=on classifica=on &localiza=on
1/27/15 Ed Chang @ BigDat 2015
Hinton NIPS 2013
A neural network for ImageNet
• Alex Krizhevsky (NIPS 2012) developed a very deep convolu=onal neural net of the type pioneered by Yann Le Cun. Its architecture was: – 7 hidden layers not coun=ng some max pooling layers.
– The early layers were convolu=onal.
– The last two layers were globally connected.
• The ac=va=on func=ons were:
– Rec=fied linear units in every hidden layer f(x) = max(0, x). These train much faster and are more expressive than logis=c units.
– Compe==ve normaliza=on to suppress hidden ac=vi=es when nearby units have stronger ac=vi=es. This helps with varia=ons in intensity.
1/27/15 Ed Chang @ BigDat 2015
Hinton NIPS 2013
Tricks that significantly improve generaliza=on
• Bagging Train on random 224x224 patches from the 256x256 images to get more data. Also use ler-‐right reflec=ons of the images. At test =me, combine the opinions from ten different patches: The four 224x224 corner patches plus the central 224x224 patch plus the reflec=ons of those five patches.
• Dropout (Sparsifica=on) Use “dropout” to regularize the weights in the globally connected layers (which contain most of the parameters). Dropout means that half of the hidden units in a layer are randomly removed for each training example. This stops hidden units from relying too much on other hidden units.
1/27/15 Ed Chang @ BigDat 2015
Hinton NIPS 2013
Dropout: An efficient way to average many large neural nets (hup://arxiv.org/abs/1207.0580)
• Consider a neural net with one hidden layer.
• Each =me we present a training example, we randomly omit each hidden unit with probability 0.5.
• So we are randomly sampling from 2H different architectures. All architectures share weights.
1/27/15 Ed Chang @ BigDat 2015
Hinton NIPS 2013
Dropout as a form of model averaging Bagging
• Sample from 2H models, so only a few of the models ever get trained, and they only get one training example. – This is as extreme as Bagging can get.
• The sharing of the weights means that every model is very strongly regularized. – It’s a much beuer regularizer than L2 or L1 penal=es that pull the weights towards zero.
1/27/15 Ed Chang @ BigDat 2015
Hinton NIPS 2013
1/27/15 Ed Chang @ BigDat 2015
DEEP MODELS CNN, MRF & RBM
1/27/15 Ed Chang @ BigDat 2015
1/27/15 Ed Chang @ BigDat 2015
Russ S. KDD 04 Tutorial
Directed Graph Bayesian Networks
General Factoriza=on pak denotes parents of xk
1/27/15 Ed Chang @ BigDat 2015
Russ S. KDD 04 Tutorial
1/27/15 Ed Chang @ BigDat 2015
Russ S. KDD 04 Tutorial
1/27/15
“Explaining Away” • Cause inference for directed graphs has one subtlety
• Illustra=on: pixel colour in an image
image colour
surface colour
ligh=ng colour
Ed Chang @ BigDat 2015
C. Bishop, ECCV tutorial
Shortcomings of Back-‐propaga=on • It requires labeled training data
– Almost all data is unlabeled. • The learning =me does not scale well
– It is very slow in networks with mul=ple hidden layers.
– Backward pass: signal = dE/dy, diminishing as # layers increases
• It can get stuck in poor local op=ma – These are oren quite good, but for deep nets they are far from op=mal.
1/27/15 Ed Chang @ BigDat 2015
MRF & RBM Directed à Undirected Graph
1/27/15 Ed Chang @ BigDat 2015
Markov Random Field (MRF) Components • A set of sites or pixels: P={1,…,m} : each pixel is a site. • Each pixel’s Neighborhood N={Np | p ∈ P} • A set of random variables (random field), one for each pixel
X={Xp | p ∈ P} • Denotes the label at each pixel.
Each random variable takes a value xp from the set of labels L={l1,…,lk}
• We have a joint event {X1=x1,…, Xm=xm} , or a configura=on, abbreviated as X=x
• The joint prob. Of such configura=on: p(X=x) or p(x) • Many possible configura=ons: k^m
From Slides by S. Seitz - University of Washington 1/27/15 Ed Chang @ BigDat 2015
1/27/15
Markov Random Field Hammersley-‐Clifford Theorem
• p(x) joint distribu=on is product of non-‐nega=ve func=ons over the cliques (neighbourhoods) of the graph
• where are the clique poten+als, and Z is a normaliza=on constant
Ed Chang @ BigDat 2015
1/27/15 Ed Chang @ BigDat 2015
Russ S. KDD 04 Tutorial
1/27/15 Ed Chang @ BigDat 2015
Russ S. KDD 04 Tutorial
Equilibrium Interpreta=on
• Expected value of product of states at thermal equilibrium when nothing is clamped
1/27/15 Ed Chang @ BigDat 2015
• Expected value of product of states at thermal equilibrium when the training data is clamped on the visible units
∂L(θ )∂θij
= EPdata[xix j ]−EPθ
[xix j ]
1/27/15 Ed Chang @ BigDat 2015
Russ S. KDD 04 Tutorial
1/27/15 Ed Chang @ BigDat 2015
Russ S. KDD 04 Tutorial
1/27/15 Ed Chang @ BigDat 2015
Russ S. KDD 04 Tutorial
1/27/15 Ed Chang @ BigDat 2015
Russ S. KDD 04 Tutorial
1/27/15 Ed Chang @ BigDat 2015
Russ S. KDD 04 Tutorial
Model Learning Similar to MRF
• Expensive to compute with exponen=al # of configura=ons (over all possible images)
• Use MCMC
1/27/15 Ed Chang @ BigDat 2015
• Simple to compute
∂L(θ )∂θij
= EPdata[vihj ]−EPθ
[vihj ]
1/27/15 Ed Chang @ BigDat 2015
Russ S. KDD 04 Tutorial
1/27/15 Ed Chang @ BigDat 2015
Russ S. KDD 04 Tutorial
1/27/15 Ed Chang @ BigDat 2015
Russ S. KDD 04 Tutorial
1/27/15 Ed Chang @ BigDat 2015
Russ S. KDD 04 Tutorial
Latest ImageNet Compe==on Update
1/27/15 Ed Chang @ BigDat 2015
Key References
1/27/15
• Deep Learning video lectures hup://videolectures.net/Top/Computer_Science/Machine_Learning/Deep_Learning/
• A Data-‐Driven Study on Image Feature Extrac<on and Fusion, Zhiyu Wang, Fangtao Li, Edward Y. Chang, and Shiqiang Yang, Google Technical Report, April 2012
• Founda<ons of Large-‐Scale Mul<media Informa<on Management and Retrieval, E. Y. Chang, Springer, 2011
• Convolu<onal deep belief networks for scalable unsupervised learning of hierarchical representa<ons, Honglak Lee, Roger Grosse, Rajesh Ranganath and Andrew Y. Ng. In Proceedings of the Twenty-‐Sixth Interna+onal Conference on Machine Learning, 2009
• Robust Object Recogni<on with Cortex-‐like Mechanisms, T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, IEEE Transac=ons on Pauern Analysis and Machine Intelligence, 29(3):411–426, 2007.
• Object Recogni<on from Local Scale-‐Invariant Features, D.G. Lowe, In IEEE Interna=onal Conference on Computer Vision (ICCV), 1999.
Ed Chang @ BigDat 2015
…Key References
1/27/15
• A Tutorial on Energy-‐Based Learning, Yann LeCun, et al, Predic=ng Structured Data, MIT Press, 2006.
• Dropout, A Simple Way to Prevent Neural Networks from OverfiNng, N. Srivastava, G. Hinton, A. Krizhevsky, U. Sutskever, and R. Salakhutdinov, Journal of Machine Learning, 2014.
• A Fast Learning Algorithm for Deep Belief Nets, G. Hinton, S. Osindero, and Y. The, Neural Computa=on, 2006
• Representa<on Learning Tutorial, Yoshua Bengio, ICML 2012. • Representa<on Learning: A Review and New Perspec<ves, Y. Bengio, A. Courville,
and P. Vincent, April 2014 • Convolu<onal networks for images, speech, and <me series, Y. LeCun and Y.
Bengio, The handbook of brain theory and neural networks 3361, 310, 1995. • Sparse Coding with an Overcomplete Basis Set, A Strategy Employed by V1,
Olshausen & Field,Vision Research, 37(23), p.3311-‐3325, 1997. • Deep Learning Tutorial, R. Salakhutdinov KDD, 2014
Ed C,hang @ BigDat 2015
APPENDIX
1/27/15 Ed Chang @ BigDat 2015
1/27/15 Ed Chang @ BigDat 2015
1/27/15 Ed Chang @ BigDat 2015
1/27/15 Ed Chang @ BigDat 2015