Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf ·...

77
深度学习:机器学习的新浪潮 Kai Yu Deputy Managing Director, Baidu IDL

Transcript of Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf ·...

Page 1: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

深度学习:机器学习的新浪潮

Kai Yu

Deputy Managing Director, Baidu IDL

Page 2: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Machine Learning

7/23/2013 2

“…the design and development of

algorithms that allow computers to evolve

behaviors based on empirical data, …”

Page 3: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

All Machine Learning Models in One Page

7/23/2013 3

Shallow Models Deep Models

Page 4: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Shallow Models Since Late 80’s

Neural Networks

Boosting

Support Vector Machines

Maximum Entropy

7/23/2013 4

Given good features, how to do classification?

Page 5: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Since 2000 – Learning with Structures

Kernel Learning

Transfer Learning

Semi-supervised Learning

Manifold Learning

Sparse Learning

Matrix Factorization

7/23/2013 5

Given good features, how to do classification?

Page 6: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Mission Yet Accomplished

7/23/2013 6

Images$&$Video$

Rela: onal$Data/$$Social$Network$

Massive$increase$in$both$computa: onal$power$and$the$amount$of$data$available$from$web,$video$cameras,$laboratory$measurements.$

Mining$for$Structure$

Speech$&$Audio$

Gene$Expression$

Text$&$Language$$

Geological$Data$Product$$Recommenda: on$

Climate$Change$

Mostly$Unlabeled$

• $Develop$sta: s: cal$models$that$can$discover$underlying$structure,$cause,$or$sta: s: cal$correla: on$from$data$in$unsupervised*or$semi,supervised*way.$$• $Mul: ple$applica: on$domains.$

Slide Courtesy: Russ Salakhutdinov

Page 7: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

The pipeline of machine visual perception

7/23/2013 7

Low-level

sensing

Pre-

processing

Feature

extract.

Feature

selection

Inference:

prediction,

recognition

• Most critical for accuracy

• Account for most of the computation for testing

• Most time-consuming in development cycle

• Often hand-craft in practice

Most Efforts in

Machine Learning

Page 8: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Computer vision features

SIFT Spin image

HoG RIFT

Slide Courtesy: Andrew Ng

GLOH

Page 9: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Deep Learning: learning features from data

7/23/2013 9

Low-level

sensing

Pre-

processing

Feature

extract.

Feature

selection

Inference:

prediction,

recognition

Feature Learning

Machine Learning

Page 10: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Convolution Neural Networks

Coding Pooling Coding Pooling

7/23/2013 10

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W.

Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip

code recognition. Neural Computation, 1989.

Page 11: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

“Winter of Neural Networks” Since 90’s

7/23/2013 11

Non-convex

Need a lot of tricks to play with

Hard to do theoretical analysis

Page 12: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

The Paradigm of Deep Learning

Page 13: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Deep Learning Since 2006

7/23/2013 13

Neural networks 焕发第二春!

Page 14: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Race on ImageNet (Top 5 Hit Rate)

7/23/2013 14

72%, 2010

74%, 2011

Page 15: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Variants

of Sparse Coding:

LLC, Super-vector

Pooling Linear

Classifier

Local Gradients Pooling

e.g, SIFT, HOG

The Best system on ImageNet (by 2012.10)

7/23/2013 15

- This is a moderate deep model

- The first two layers are hand-designed

Page 16: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Challenge to Deep Learners

7/23/2013 16

Key questions:

What if no hand-craft features at all?

What if use much deeper neural networks?

-- By Geoff Hinton

Page 17: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Answer from Geoff Hinton, 2012.10

7/23/2013 17

72%, 2010

74%, 2011

85%, 2012

Page 18: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

The Architecture

7/23/2013 18Slide Courtesy: Geoff Hinton

Page 19: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

The Architecture

7/23/2013 19

– 7 hidden layers not counting max pooling.

– Early layers are conv., last two layers globally connected.

– Uses rectified linear units in every layer.

– Uses competitive normalization to suppress hidden activities.

Slide Courtesy: Geoff HInton

Page 20: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Revolution on Speech Recognition

7/23/2013 20

Word error rates from MSR, IBM, and

the Google speech group

Slide Courtesy: Geoff Hinton

Page 21: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Deep Learning for NLP

7/23/2013 21

Natural Language Processing (Almost) from Scratch, Collobert et al,

JMLR 2011

Page 22: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Biological & Theoretical Justification

Page 23: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Why Hierarchy?

Theoretical:

“…well-known depth-breadth tradeoff in circuits design

[Hastad 1987]. This suggests many functions can be much

more efficiently represented with deeper architectures…”

[Bengio & LeCun 2007]

Biological: Visual cortex is hierarchical (Hubel-Wiesel

Model, 1981 Nobel Prize)

[Th

orp

e]

Page 24: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Sparse DBN: Training on face images

pixels

edges

object parts

(combination

of edges)

object models

Note: Sparsity important for these results

[Lee, Grosse, Ranganath & Ng, 2009]

Deep Architecture in the Brain

Retina

Area V1

Area V2

Area V4

pixels

Edge detectors

Primitive shape

detectors

Higher level visual

abstractions

Page 25: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Sensor representation in the brain

[Roe et al., 1992; BrainPort; Welsh & Blasch, 1997]

Seeing with your tongue

Auditory cortex learns to

see.

(Same rewiring process

also works for touch/

somatosensory cortex.)Auditory

Cortex

Page 26: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Deep Learning in Industry

Page 27: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Microsoft

First successful deep learning models for

speech recognition, by MSR in 2010

7/23/2013 27

Page 28: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

“Google Brain” Project

7/23/2013 28

Ley by Google fellow Jeff

Dean

Published two papers,

ICML2012, NIPS2012

Company-wise large-scale

deep learning infrastructure

Significant success on

images, speech, …

Page 29: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Deep Learning @ Baidu

Started working on deep learning in 2012

summer

Big success in speech recognition and image

recognition, 6 products have been launched so

far.

Recently, very encouraging results on CTR

Meanwhile, efforts are being tried on in areas

like LTR, NLP…

7/23/2013 29

Page 30: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Deep Learning Ingredients

Building blocks

– RBMs, autoencoder neural nets, sparse coding, …

Go deeper: Layerwise feature learning

– Layer-by-layer unsupervised training

– Layer-by-layer supervised training

Fine tuning via Backpropogation

– If data are big enough, direct fine tuning is enough

Sparsity on hidden layers are often helpful.

7/23/2013 30

Page 31: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Building Blocks of Deep Learning

Page 32: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

CVPR 2012 Tutorial: Deep Learning for Vision

09.00am: Introduction Rob Fergus (NYU)

10.00am: Coffee Break

10.30am: Sparse Coding Kai Yu (Baidu)

11.30am: Neural Networks Marc’Aurelio Ranzato (Google)

12.30pm: Lunch

01.30pm: Restricted Boltzmann Honglak Lee (Michigan)

Machines

02.30pm: Deep Boltzmann Ruslan Salakhutdinov (Toronto)

Machines

03.00pm: Coffee Break

03.30pm: Transfer Learning Ruslan Salakhutdinov (Toronto)

04.00pm: Motion & Video Graham Taylor (Guelph)

05.00pm: Summary / Q & A All

05.30pm: End

Page 33: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Building Block 1 - RBM

Page 34: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Restricted Boltzmann Machine

7/23/2013 34Slide Courtesy: Russ Salakhutdinov

Page 35: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Restricted Boltzmann Machine

7/23/2013 35

Restricted:

Slide Courtesy: Russ Salakhutdinov

Page 36: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

7/23/2013 36

Approximate*maximum*likelihood*learning:**

i.i.d.

Model Parameter Learning

Slide Courtesy: Russ Salakhutdinov

Page 37: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Contrastive Divergence

7/23/2013 37Slide Courtesy: Russ Salakhutdinov

Page 38: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

RBMs for Images

7/23/2013 38Slide Courtesy: Russ Salakhutdinov

Page 39: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Layerwise Pre-training

7/23/2013 39Slide Courtesy: Russ Salakhutdinov

Page 40: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

DBNs for MNIST Classification

7/23/2013 40Slide Courtesy: Russ Salakhutdinov

Page 41: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Deep Autoencoders for Unsupervised Feature

Learning

7/23/2013 41Slide Courtesy: Russ Salakhutdinov

Page 42: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Image Retrieval using Binary Codes

7/23/2013 42

Searching$Large$Image$Database$using$Binary$Codes$

• $Map$images$into$binary$codes$for$fast$retrieval.*

• $Small$Codes,$Torralba,$Fergus,$Weiss,$CVPR$2008$• *Spectral$Hashing,$Y.$Weiss,$A.$Torralba,$R.$Fergus,$NIPS$2008*• *Kulis$and$Darrell,$NIPS$2009,$Gong$and$Lazebnik,$CVPR$20111$• *Norouzi$and$Fleet,$ICML$2011,$*

Slide Courtesy: Russ Salakhutdinov

Page 43: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Building Block 2 – Autoencoder Neural Net

Page 44: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Autoencoder Neural Network

7/23/2013 44

116

AUTO-ENCODERS NEURAL NETS

Ranzato

encoder

Error

input predictiondecoder

code

– input higher dimensional than code

- error: ||prediction - input||

- training: back-propagation

2

Slide Courtesy: Marc'Aurelio Ranzato

Page 45: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Sparse Autoencoder

7/23/2013 45

117

SPARSE AUTO-ENCODERS

Ranzato

encoder

Error

input predictiondecoder

code

– sparsity penalty: ||code||

- error: ||prediction - input||

- training: back-propagation

Sparsity

Penalty

12

- loss: sum of square reconstruction error and sparsity

117

SPARSE AUTO-ENCODERS

Ranzato

encoder

Error

input predictiondecoder

code

– sparsity penalty: ||code||

- error: ||prediction - input||

- training: back-propagation

Sparsity

Penalty

12

- loss: sum of square reconstruction error and sparsity

Slide Courtesy: Marc'Aurelio Ranzato

Page 46: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Sparse Autoencoder

7/23/2013 46

117

SPARSE AUTO-ENCODERS

Ranzato

encoder

Error

input predictiondecoder

code

– sparsity penalty: ||code||

- error: ||prediction - input||

- training: back-propagation

Sparsity

Penalty

12

- loss: sum of square reconstruction error and sparsity

118

SPARSE AUTO-ENCODERS

Ranzato

encoder

Error

input predictiondecoder

code

– input: code:

- loss:

Sparsity

Penalty

Le et al. “ICA with reconstruction cost..” NIPS 2011

h= W TXX

LX ;W=∥W h− X∥2∑j∣h j∣

Slide Courtesy: Marc'Aurelio Ranzato

Page 47: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Building Block 3 – Sparse Coding

Page 48: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Sparse coding

7/23/2013 48

Sparse coding (Olshausen & Field,1996). Originally

developed to explain early visual processing in the brain

(edge detection).

Training: given a set of random patches x, learning a

dictionary of bases [Φ1, Φ2, …]

Coding: for data vector x, solve LASSO to find the

sparse coefficient vector a

Page 49: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Sparse coding: training time

Input: Images x1, x2, …, xm (each in Rd)

Learn: Dictionary of bases f1, f2, …, fk (also Rd).

Alternating optimization:

1.Fix dictionary f1, f2, …, fk , optimize a (a standard LASSO

problem)

2.Fix activations a, optimize dictionary f1, f2, …, fk (a

convex QP problem)

Page 50: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Sparse coding: testing time

Input: Novel image patch x (in Rd) and previously learned fi’s

Output: Representation [ai,1, ai,2, …, ai,k] of image patch xi.

0.8 * + 0.3 * + 0.5 *

Represent xi as: ai = [0, 0, …, 0, 0.8, 0, …, 0, 0.3, 0, …, 0, 0.5, …]

Page 51: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Sparse coding illustration

Natural Images Learned bases (f1 , …, f64): “Edges”

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

0.8 * + 0.3 * + 0.5 *

x 0.8 * f36

+ 0.3 * f42 + 0.5 *

f63

[a1, …, a64] = [0, 0, …, 0, 0.8, 0, …, 0, 0.3, 0, …, 0, 0.5, 0] (feature representation)

Test example

Compact & easily interpretableSlide credit: Andrew Ng

Page 52: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

RBM & autoencoders

- also involve activation and reconstruction

- but have explicit f(x)

- not necessarily enforce sparsity on a

- but if put sparsity on a, often get improved results

[e.g. sparse RBM, Lee et al. NIPS08]

7/23/2013 52

x

a

f(x)

x’

a

g(a)encoding decoding

Page 53: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Sparse coding: A broader view

Any feature mapping from x to a, i.e. a = f(x),

where

-a is sparse (and often higher dim. than x)

-f(x) is nonlinear

-reconstruction x’=g(a) , such that x’≈x

7/23/2013 53

x

a

f(x)

x’

a

g(a)

Therefore, sparse RBMs, sparse auto-encoder, even VQ

can be viewed as a form of sparse coding.

Page 54: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Example of sparse activations (sparse coding)

7/23/2013 54

• different x has different dimensions activated

• locally-shared sparse representation: similar x’s tend to have

similar non-zero dimensions

a1 [ 0 | | 0 0 …

0 ]a2 [ | | 0 0 0 … 0 ]

am [ 0 0 0 | | … 0 ]

a3 [ | 0 | 0 0 … 0 ]

x1

x2x3

xm

Page 55: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Example of sparse activations (sparse coding)

7/23/2013 55

• another example: preserving manifold structure

• more informative in highlighting richer data structures,

i.e. clusters, manifolds,

a1 [ | | 0 0 0 …

0 ]a2 [ 0 | | 0 0 … 0 ]

am [ 0 0 0 | | … 0 ]

a3 [ 0 0 | | 0 … 0 ]

x1

x2x3

xm

Page 56: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Sparsity vs. Locality

7/23/2013 56

sparse coding

local sparse

coding

• Intuition: similar data should

get similar activated features

• Local sparse coding:

• data in the same

neighborhood tend to

have shared activated

features;

• data in different

neighborhoods tend to

have different features

activated.

Page 57: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Sparse coding is not always local: example

7/23/201357

Case 2

data manifold (or clusters)

• Each basis an “anchor point”

• Sparsity: each datum is a

linear combination of neighbor

anchors.

• Sparsity is caused by locality.

Case 1

independent subspaces

• Each basis is a “direction”

• Sparsity: each datum is a

linear combination of only

several bases.

Page 58: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Two approaches to local sparse coding

7/23/201358

Approach 2

Coding via local subspaces

Approach 1

Coding via local anchor points

Local coordinate coding Super-vector coding

Image Classification using Super-Vector Coding of

Local Image Descriptors, Xi Zhou, Kai Yu, Tong Zhang,

and Thomas Huang. In ECCV 2010.

Large-scale Image Classification: Fast Feature

Extraction and SVM Training, Yuanqing Lin, Fengjun

Lv, Shenghuo Zhu, Ming Yang, Timothee Cour, Kai Yu,

LiangLiang Cao, Thomas Huang. In CVPR 2011

Learning locality-constrained linear coding for image

classification, Jingjun Wang, Jianchao Yang, Kai Yu,

Fengjun Lv, Thomas Huang. In CVPR 2010.

Nonlinear learning using local coordinate coding,

Kai Yu, Tong Zhang, and Yihong Gong. In NIPS 2009.

Page 59: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Two approaches to local sparse coding

7/23/201359

Approach 2

Coding via local subspaces

Approach 1

Coding via local anchor points

- Sparsity achieved by explicitly ensuring locality

- Sound theoretical justifications

- Much simpler to implement and compute

- Strong empirical success

Local coordinate coding Super-vector coding

Page 60: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Hierarchical Sparse Coding

7/23/2013 60

x

w

\alpha

x1,...,xn ∈R d n

X = [x1 x2 ···xn ]∈R d×n

B ∈R d×p

Φ = (φ1φ2···φq)∈Rp×q+

Φ Φ

xi

(W ,α) =W ,α

L (W ,α)+λ1

nW 1 + γ α 1

α 0,

L (W ,α)

1

n

n

i= 1

1

2xi − B w i

2+ λ2w i Ω(α)w i .

W = (w 1 w 2···w n ) ∈ R p×n

α ∈R q

Ω(α)≡q

k= 1

αk (φk)

− 1

.

1 w i α

λ2 = 0

λ2 > 0 α

2 w i

1

n

n

i= 1

w i

q

k= 1

αk (φk)

− 1

w i = S (W )Ω(α)

S (W )≡1

n

n

i= 1

w iwTi ∈R p×p

L (W ,α)

L (W ,α) =1

2nX − B W 2

F +λ2

nS (W )Ω(α)

w iΣ(α) = Ω(α)− 1

W S (W )Ω(α)

W

W 2

x1,...,xn ∈R d n

X = [x1 x2 ···xn ]∈R d×n

B ∈R d×p

Φ = (φ1φ2···φq)∈Rp×q+

Φ Φ

xi

(W ,α)=W ,α

L (W ,α)+λ1

nW 1 + γ α 1

α 0,

L (W ,α)

1

n

n

i= 1

1

2xi − B w i

2+ λ2w i Ω(α)w i .

W = (w 1 w 2···w n ) ∈ R p×n

α ∈R q

Ω(α)≡q

k= 1

αk (φk)

− 1

.

1 w i α

λ2 = 0

λ2 > 0 α

2 w i

1

n

n

i= 1

w i

q

k= 1

αk (φk)

− 1

w i = S (W )Ω(α)

S (W )≡1

n

n

i= 1

w iwTi ∈R p×p

L (W ,α)

L (W ,α)=1

2nX − B W 2

F +λ2

nS (W )Ω(α)

w iΣ(α) = Ω(α)− 1

W S (W )Ω(α)

W

W 2

Yu, Lin, & Lafferty, CVPR 11

Page 61: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Hierarchical Sparse Coding on MNIST

HSC vs. CNN: HSC provide even better performance than CNN more amazingly, HSC learns features in unsupervised manner!

61

Yu, Lin, & Lafferty, CVPR 11

Page 62: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Second-layer dictionary

62

A hidden unit in the second layer is connected to a unit group in

the 1st layer: invariance to translation, rotation, and deformation

Yu, Lin, & Lafferty, CVPR 11

Page 63: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Adaptive Deconvolutional Networks for Mid and

High Level Feature Learning

Hierarchical Convolutional Sparse Coding.

Trained with respect to image from all layers (L1-L4).

Pooling both spatially and amongst features.

Learns invariant mid-level features.

Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus, ICCV 2011

Select L2 Feature Groups

Select L3 Feature Groups

Select L4 Features

L1 Feature Maps

Image

L2 Feature Maps

L4 Feature Maps

L1 Features

L3 Feature Maps

Page 64: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Recap: Deep Learning Ingredients

Building blocks

– RBMs, autoencoder neural nets, sparse coding, …

Go deeper: Layerwise feature learning

– Layer-by-layer unsupervised training

– Layer-by-layer supervised training

Fine tuning via Backpropogation

– If data are big enough, direct fine tuning is enough

Sparsity on hidden layers are often helpful.

7/23/2013 64

Page 65: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Layer-by-layer unsupervised training + fine

tuning

7/23/2013 65

136

Ranzato

Semi-Supervised Learning

input

prediction

Weston et al. “Deep learning via semi-supervised embedding” ICML 2008

label

Loss = supervised_error + unsupervised_error

Slide Courtesy: Marc'Aurelio Ranzato

Page 66: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Layer-by-Layer Supervised Training

7/23/2013 66

Page 67: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Geoff Hinton, ICASSP 2013

Page 68: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Large-scale training: Stochastic vs. batch

Page 69: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

The challenge

7/23/2013 69150

The Challenge

– best optimizer in practice is on-line SGD which is naturally

sequential, hard to parallelize.

– layers cannot be trained independently and in parallel, hard to

distribute

– model can have lots of parameters that may clog the network,

hard to distribute across machines

A Large Scale problem has: – lots of training samples (>10M)

– lots of classes (>10K) and

– lots of input dimensions (>10K).

RanzatoSlide Courtesy: Marc'Aurelio Ranzato

Page 70: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

A solution by model parallelism

7/23/2013 70

153

MODEL PARALLELISM

Ranzato

Our Solution

1st

machine

2nd

machine

3rd

machine

155

Distributed Deep Nets

MODEL

PARALLELISM

+DATA

PARALLELISM

RanzatoLe et al. “Building high-level features using large-scale unsupervised learning” ICML 2012

input #1input #2

input #3

Slide Courtesy: Marc'Aurelio Ranzato

Page 71: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

7/23/2013 71

155

Distributed Deep Nets

MODEL

PARALLELISM

+DATA

PARALLELISM

RanzatoLe et al. “Building high-level features using large-scale unsupervised learning” ICML 2012

input #1input #2

input #3

155

Distributed Deep Nets

MODEL

PARALLELISM

+DATA

PARALLELISM

RanzatoLe et al. “Building high-level features using large-scale unsupervised learning” ICML 2012

input #1input #2

input #3

Slide Courtesy: Marc'Aurelio Ranzato

Page 72: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Asynchronous SGD

7/23/2013 72

157

Asynchronous SGD

PARAMETER SERVER

1st replica 2nd replica 3rd replica

Ranzato

∂ L

∂1

Slide Courtesy: Marc'Aurelio Ranzato

Page 73: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

7/23/2013 73

158

Asynchronous SGD

PARAMETER SERVER

1st replica 2nd replica 3rd replica

Ranzato

1

Asynchronous SGD

Slide Courtesy: Marc'Aurelio Ranzato

Page 74: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

7/23/2013 74

160

Asynchronous SGD

PARAMETER SERVER

1st replica 2nd replica 3rd replica

Ranzato

∂ L

∂2

Slide Courtesy: Marc'Aurelio Ranzato

Page 75: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

7/23/2013 75

161

Asynchronous SGD

PARAMETER SERVER

1st replica 2nd replica 3rd replica

Ranzato

2

Slide Courtesy: Marc'Aurelio Ranzato

Page 76: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Training A Model with 1 B Parameters

7/23/2013 76 164

Deep Net:

– 3 stages

– each stage consists of local filtering, L2 pooling, LCN

- 18x18 filters

- 8 filters at each location

- L2 pooling and LCN over 5x5 neighborhoods

– training jointly the three layers by:

- reconstructing the input of each layer

- sparsity on the code

RanzatoLe et al. “Building high-level features using large-scale unsupervised learning” ICML 2012

Unsupervised Learning With 1B Parameters

Slide Courtesy: Marc'Aurelio Ranzato

Page 77: Update on Collaboration with Business Units and CRLtopic.it168.com/factory/adc2013/doc/yukai.pdf · 2013-07-23  · Natural Language Processing (Almost) from Scratch, Collobert et

Conclusion Remark

We see the promise of deep learning

Big success on speech, vision, and many other

Large-scale training is the key challenge

More successful stories are coming …

7/23/2013 77