Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013...

79
Modeling deep structures for using high performance images Wanli Ouyang (欧阳万里) The University of Sydney

Transcript of Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013...

Page 1: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Modeling deep structures for using high performance images

Wanli Ouyang (欧阳万里)

The University of Sydney

Page 2: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Outline

2

Back-bone model design

Conclusion

Introduction

Structured features Structured samplesFeature fusion

Effectively using high performance images

+

+

+

+

+

+

+

+

??+

+- -

- -

Pedestrian detection (CVPR’17)

Scene parsing and depth estimation(CVPR18)

3D human pose estimation (CVPR’18)

Page 3: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Outline

3

Back-bone model design

Conclusion

Introduction

Effectively using high performance images

Pedestrian detection (CVPR’17)

Scene parsing and depth estimation(CVPR18)

3D human pose estimation (CVPR’18)

Image/video classification(CVPR’18, NIPS’18)

Page 4: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Deep learning

MIT Tech ReviewTop 10 Breakthroughs 2013Ranking No. 1

Hinton won ImageNetcompetition Classify 1.2 million images into 1,000 categoriesBeating existing computer vision methods by 20+% Surpassing human performance

Hold records on most of the computer vision problems

Web-scale visual search, self-driving cars, surveillance, multimedia …

Simulate brain activities and employ millions of neurons to fit billions of training samples. Deep neural networks are trained with GPU clusters with tens of thousands of processors

Page 5: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Performance vs practical need

Conventional model

Deep model Very Deep model

Very deep structured learning

Many other applications

Face recognition

Page 6: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

h

x

y

Structure in neurons

• Conventional neural networks

– Neurons in the same layer have no connections

– Neurons in adjacent layers are fully connected, at least within a local region

Structure exists in brain

Page 7: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Structure in data

?

Page 8: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Structure in data

? ?

?

Page 9: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Structure in data

?

? ?

?

?

Correlation

Page 10: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Outline

10

Back-bone model design

Conclusion

Introduction

Effectively using high performance images

Page 11: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Effectively using high performance imaging data

Training Deployment

RGB onlyMulti-modal dataImage from https://research.csiro.au/data61/high-performance-imaging/

Page 12: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Outline

12Conclusion

Structured Features Structured SamplesFeature fusion

Effectively using high performance images

+

+

+

+

+

+

+

+

??+

+- -

- -

Pedestrian detection (CVPR’17)

Scene parsing and depth estimation(CVPR18)

3D human pose estimation (CVPR’18)

Introduction

Page 13: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Outline

13Conclusion

Feature fusion

Effectively using high performance images

Pedestrian detection (CVPR’17)

Introduction

Page 14: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Motivation• Challenging open issues in pedestrian detection: illumination

variation, shadows, background clutter, and low external light

• Exploiting thermal data in addition to RGB data for learning cross-

modal representations

RGB

Thermal

Hard positive samples Hard negative samples

Dan Xu, Wanli Ouyang, Elisa Ricci, Xiaogang Wang, Nicu Sebe,“Learning Cross-Modal Deep Representations for Robust

Pedestrian Detection”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Hawaii, USA

Page 15: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Motivation• Challenging open issues in pedestrian detection: illumination

variation, shadows, background clutter, and low external light

• Exploiting thermal data in addition to RGB data for learning cross-

modal representations

• Can we transfer the learned cross-modal representations?

Missing thermal data???

Deep Reconstruction

+

RGB Data Thermal Data

Reconstruction network Detection network

Deep Detection

RGB Data

Cross modality transfer

Page 16: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Approach• RRN

• MSDN

- RGB domain to Thermal domain

- weakly supervised reconstruction

- region-based instead of frame-level

based

- cross-modal multi-scale feature fusion

- the parameters of subnetwork in

yellow box are transferred from RRN

Page 17: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Results - Caltech

• Demonstrated the effectiveness of the learned cross-modal representations

• Achieved superior detection performance

Average Miss rate

CMT-CNN-SA 13.76%

CMT-CNN 10.69%

Page 18: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Results - KAIST

54.26%

56.76%

52.15%

49.55%

52.44%

54.83%

50.71%

47.30%

58.97%

61.24%

57.65%

54.78%

45.00% 47.00% 49.00% 51.00% 53.00% 55.00% 57.00% 59.00% 61.00% 63.00%

CMT-CNN-SA

CMT-CNN-SA-SB(Random)

CMT-CNN-SA-SB(ImageNet)

CMT-CNN

Night Day All

• Demonstrated the effectiveness of the learned cross-modal representations

• Achieved superior detection performance

Page 19: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Qualitative results

ACF

w/o reconstruction network

With reconstruction network

Page 20: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Effectively using high performance images

Structured featuresFeature fusion

Page 21: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Motivation

RGB

? Segmentation

Depth estimation

Dan Xu, Wanli Ouyang, Xiaogang Wang, Nicu Sebe, “PAD-Net: Multi-Tasks Guided Prediction-and-Distillation

Network for Simultaneous Depth Estimation and Scene Parsing”, IEEE Conference on Computer Vision and

Pattern Recognition (CVPR 2018)

Page 22: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Motivation

Multi-task learning?

- Directly optimizing multiple tasks given input training data

does not guarantee consistent gain on all the tasks

RGB

Segmentation

Depth estimation

Page 23: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Motivation

Segmentation

Depth estimation

Depth Surface normal

Semantic Contour

CNN CNN

- Multi-modal input data improve training of deep networks

- Facilitate final tasks via leveraging intermediate multiple predictions while only

one single modal data are required?

RGB

Page 24: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Approach

Illustration of the proposed multi-task distillation network for simultaneous

depth estimation and scene parsing

L1D EC O N V

D EC O N V

D EC O N V

D EC O N V

M ulti-TaskD istillation

R G B Front-End Encoder M ulti-Task Predictions M ulti-Task D istillation D ecoder O utput

L2

L3

L4

L5

L6

Page 25: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Approach

• Different multi-task distillation modules:

Naive implementation via feature concatenation

Passing message between feature maps

Attention mechanism guided message passing module

Page 26: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Results for scene parsing on Cityscapes

0.725 0.723

0.736

0.748

0.7560.761

0.7

0.71

0.72

0.73

0.74

0.75

0.76

0.77

Front-end+ DE(baseline)

Front-end + DE +SP (baseline)

PAD-Net(Distillation A

+SP )

PAD-Net(Distillation B

+SP )

PAD-Net(Distillation C

+SP )

PAD-Net(Distillation C +

DE + SP )

Multi-task learning results in decrease of IOU accuracy

Message passing with attention performs betterDistillation improves accuracy

Page 27: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Results with Different Front-End CNNs on NYUD-V2

Results

Page 28: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Results with Different Front-End CNNs on NYUD-V2

Estim

atio

nG

TE

stim

atio

nG

T

Results

Page 29: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Results

Estim

atio

nG

TE

stim

atio

nG

T

Page 30: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Effectively using high performance images

Structured SamplesStructured featuresFeature fusion

+

+

+

+

+

+

+

+

??

+

+- -

- -

Page 31: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Challenges: No Annotation

Constrained scenes In-the-wild scenes

31

No annotation

Domain

Discrepancy

Page 32: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Which one is more plausible?

32

Discriminator

Page 33: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Weakly Supervised Adversarial Learning

33

𝐷

3D dataset Images w/o GT

𝐺3D Human Pose Estimator Multi-source Discriminator

Prediction Ground-truth

Real Fake

Page 34: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Adversarial Learning

34

Generator Discriminator

Fool

Tell𝑳𝒐𝒔𝒔𝑮 𝑳𝒐𝒔𝒔𝑫

Euclidean Loss Classification Loss

+

+

+

+ +

+

+

+

?

+

+?

- -

- --

-

-

Page 35: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Stack 1 Stack 𝑛

Generator

35

Co

nv

Res

idu

al

Res

idu

al

Hourglass

128 × 128

64 × 64

25

6x2

56

2D score maps

Dep

th

3D Poses

2D module Depth module

Page 36: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Discriminator

36

Page 37: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

CNN

2D Heatmaps Depthmaps

Image 𝐼

CNN

CNN

Real

Fake

64

64

Geometric descriptor

𝑃

𝑃

[Δ𝑥, Δ𝑦, Δ𝑧] [Δ𝑥2, Δ𝑦2, Δ𝑧2]

Fully C

on

nected

layers

25

6

Real or Fake samples

Concatenation

Multi-Source Discriminator

37

Raw poses

Page 38: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Effectiveness of Adversarial Learning

38

Page 39: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Ablation Study on H36M Dataset

39

64.9

65.2

64.8

61.3

60.3

59.7

58 59 60 61 62 63 64 65 66

MPJPE (error in mm) on H36M

Full Geo Pose Baseline Baseline (fix 2D) State-of-art*

8% less error

*Zhou et al. ICCV’17

Zhou et al. ICCV’17

(Ou

rs)

Fix 2D, finetune depth

Jointly learn 2D + depth

Image+Pose

Image+Geo

Image+Pose+Geo

Mean per joint position error

Page 40: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Results on Images in the Wild

40

bas

elin

eO

urs

Page 41: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Multi-view Results

41

Page 42: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Outline

42Conclusion

Structured Features Structured SamplesFeature fusion

Effectively using high performance images

+

+

+

+

+

+

+

+

??+

+- -

- -

Pedestrian detection (CVPR’17)

Scene parsing and depth estimation(CVPR18)

3D human pose estimation (CVPR’18)

Introduction

Page 43: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Does structure only exist for specific task?

Page 44: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Outline

44

Back-bone model design

Introduction

FishNet(NIPS 18)

Optical guided feature (CVPR18)

Effectively using high performance images

Page 45: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Low-level and high-level features

Image from Andrew Ng’s slides

Page 46: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Current CNN Structures

Image Classification: Summarize high-level semantic information of the whole image.

Detection/Segmentation:High-level semantic meaning with high spatial resolution

Called U-Net, Hourglass, or Conv-deconv

Page 47: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Architectures designed for different granularities are

DIVERGING

Page 48: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Unify the advantages of networks for pixel-level, region-level, and image-level tasks

Page 49: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Hourglass for Classification

Poor performance.

So what is the problem?

• Different tasks require different resolutions of feature

Directly applying hourglass for classification?

Features with high-level semantics and high resolution is good

• Down sample high-level features with high resolution

Our design

Page 50: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Hourglass for Classification

1 × 1, 𝑐𝑖𝑛

3 × 3, 𝑐𝑖𝑛

1 × 1, 𝑐𝑜𝑢𝑡

1 × 1, 𝑐𝑜𝑢𝑡

𝑐𝑜𝑢𝑡

𝑆𝑡𝑟𝑖𝑑𝑒 = 2

1 × 1, 𝑐𝑖𝑛

3 × 3, 𝑐𝑖𝑛

1 × 1, 𝑐𝑖𝑛

CLow-level features

𝑐𝑜𝑢𝑡 − 𝑐𝑖𝑛 𝑐𝑜𝑢𝑡

C Concat

𝑢𝑝/𝑑𝑜𝑤𝑛 𝑠𝑎𝑚𝑝𝑙𝑒

The 𝟏 × 𝟏 convolution layer in yellowindicates the Isolated convolution.

• Hourglass may bring more isolated convolutions than ResNet

1 × 1, 𝑐𝑖𝑛

3 × 3, 𝑐𝑖𝑛

1 × 1, 𝑐𝑜𝑢𝑡

𝑐𝑜𝑢𝑡

Normal Res-BlockRes-Block for

down/up samplingOur design

• Different tasks require different resolutions of feature

Page 51: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Observation and design

Solution1. Unify the advantages of

networks for pixel-level, region-level, and image-level tasks.

2. Design a network that does not need isolated convolution

3. Features from varying depths are preserved and refined from each other.

Bharath Hariharan, et al. "Hypercolumns for object segmentation and fine-grained localization." CVPR’15.Newell, Alejandro, Kaiyu Yang, and Jia Deng. "Stacked hourglass networks for human pose estimation." ECCV’16.

Our observation

1. Diverged structures for tasks requiring different resolutions.

2. Isolated Conv blocks the direct back-propagation

3. Features with different depths are not fully explored, or mixedbut not preserved

Page 52: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Difference between mix and preserve and refine

High level

Low level

+ High level

Low level

Mixed features Preserve and refine

M

M

M Message generation

Page 53: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

FishNet: Overview

224x224 … 56x56 28x28 14x14 7x7 14x14 28x28 56x56 28x28 14x14 7x7 1x1

… … … … … ……

Features in the tail part

Features in the body part

Residual Blocks

Features inthe head part

Concat

Fish Tail Fish Body Fish Head

… … …

Page 54: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

FishNet: Preservation & Refinement

……

…………

……

……

TransferringBlocks T (⋅)

DR Blocks

UR Blocks

Regular Connections

……

FishTail

FishBody

FishHead

M(⋅)

𝑑𝑜𝑤𝑛(⋅)

𝑢𝑝(⋅)

𝑟(⋅)M(⋅)

……

……

Up-sampling and Refinement (UR) Blocks

Down-sampling and Refinement (DR) Blocks

Feature from varying depth refines each other here

Sum up every𝒌 adjacent

channelsConcat

Concat

Nearest neighbor up-sampling

𝟐 × 𝟐 Max-Pooling

From Tail

From Body

From Body

From Head

Page 55: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

FishNet: Performance-ImageNet

22.59%

21.93%(5.92%)

21.55%(5.86%)

21.25%(5. 76%)

23.78%(7.00%)

22.30%(6.20%)

21.69%(5.94%)

21.00%

21.50%

22.00%

22.50%

23.00%

23.50%

24.00%

10 20 30 40 50 60 70

FishNet

ResNet

Parameters, × 106

22.59%

21.93%

21.55%21.25%

23.78%

22.30%

21.69%

21.00%

21.50%

22.00%

22.50%

23.00%

23.50%

24.00%

2 4 6 8 10 12

FishNet

ResNet

FLOP, × 109

To

p-1

Erro

r

Codehttps://github.com/kevin-ssy/FishNet

Page 56: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

FishNet: Performance-ImageNet

22.59%

21.93%(5.92%)

21.55%(5.86%)

21.25%(5. 76%)

22.58%(6.35%)

22.20%(6.20%)

22.15%(6.12%)

21.20%

23.78%(7.00%)

22.30%(6.20%)

21.69%(5.94%)

21.00%

21.50%

22.00%

22.50%

23.00%

23.50%

24.00%

10 20 30 40 50 60 70

FishNet

DenseNet

ResNetTo

p-1

Erro

r

Parameters, × 106

Codehttps://github.com/kevin-ssy/FishNet

Page 57: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

FishNet: Performance on COCO Detection

38.00%

38.50%

39.00%

39.50%

40.00%

40.50%

41.00%

41.50%

42.00%

AP

R-50

RX-50

Fish-150

18.00%

18.20%

18.40%

18.60%

18.80%

19.00%

19.20%

19.40%

19.60%

19.80%

20.00%

AP-small

36.00%

36.50%

37.00%

37.50%

38.00%

38.50%

39.00%

39.50%

40.00%

40.50%

41.00%

AP-medium

47.00%

47.50%

48.00%

48.50%

49.00%

49.50%

50.00%

50.50%

AP-large

Codehttps://github.com/kevin-ssy/FishNet

Page 58: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

FishNet: Performance on COCO Instance Segmentation

34.00%

34.50%

35.00%

35.50%

36.00%

36.50%

37.00%

37.50%

AP

R-50

RX-50

Fish-150

18.00%

18.20%

18.40%

18.60%

18.80%

19.00%

19.20%

19.40%

19.60%

19.80%

20.00%

AP-small

37.00%

37.50%

38.00%

38.50%

39.00%

39.50%

40.00%

40.50%

AP-medium

47.00%

47.50%

48.00%

48.50%

49.00%

49.50%

50.00%

50.50%

AP-large

Codehttps://github.com/kevin-ssy/FishNet

Page 59: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Winning COCO 2018 Instance Segmentation Task

Page 60: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Visualization

Page 61: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Visualization

Page 62: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Visualization

Page 63: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Visualization

Page 64: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Visualization

Page 65: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Codebase

• Comprehensive

RPN Fast/Faster R-CNN

Mask R-CNN FPN

Cascade R-CNN RetinaNet

More … …

• High performance

Better performance

Optimized memory consumption

Faster speed

• Handy to develop

Written with PyTorch

Modular design

GitHub: mmdet√

Page 66: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

FishNet: Advantages

1. Better gradient flow to shallow layers

2. High-resolution features contain rich low-level and high-level semantics

3. Feature from varying depth are preserved and refined from each other

Page 67: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Outline

67

Back-bone model design

Introduction

FishNet(NIPS 18)

Optical guided feature (CVPR18)

Effectively using high performance images

Page 68: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Action Recognition

• Recognize action from videos

Page 69: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Optical flow in Action Recognition

• Motion is the important information

• Optical flow

– Effective

– Time consuming

Modality Acc. Speed(fps)

RGB 85.5% 680

RGB+Optical Flow 94.0% 14

We need a better motion representation

Page 70: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Optical flow guided feature

Page 71: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Optical flow guided feature

{vx , vy} = optical flow

Coefficient for optical flow:

Optical flow:

Intuitive Inspiration

Page 72: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Optical flow guided feature

Feature flow:

{ } = feature flow

Page 73: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Optical flow guided feature

1 × 1 Conv 1 × 1 Conv

S

C

Sobel Subtract

Concat

OOFF Unit

Fram

et

Fram

et

+ ∆

t

Cla

ssif

ica

tio

nSu

b-n

etw

ork

...

...

ResolutionK*K

ResolutionK/2*K/2

ResolutionK/4*K/4

CO

CO

...

...

CO

Feat

ure

tFe

atu

re t

+ ∆

t

...

...

OCOFF UnitConcat

C C C

Page 74: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Optical Flow Guided Feature (OFF): Experimental results

1. OFF with only RGB inputs is comparable with the other state-of-the-art methods using optical flow as input.

0 50 100 150 200 250

RGB + Optical flow + I3D

RGB + OFF

RGB + OFF + Optical Flow

FPS

92.0 92.5 93.0 93.5 94.0 94.5 95.0 95.5 96.0

RGB + Optical flow + I3D

RGB + OFF

RGB + OFF + Optical Flow

Accuracy (%)Code:

Page 75: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Not only for action recognition

• Also effective for

– Video object detection

– Video compression artifact removal

71.5 72 72.5 73 73.5 74 74.5 75 75.5 76

resnet+rfcn

resnet+rfcn+OFFDetection (mAP)

34.6 34.8 35 35.2 35.4 35.6 35.8 36 36.2

DnCNN

DnCNN+OFFCompression Artifact Removal (PSNR)

q40

Page 76: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Outline

76

Back-bone model design

Introduction

FishNet(NIPS 18)

Optical guided feature (CVPR18)

Effectively using high performance images

Page 77: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Take home message

• Structured deep learning is

– effective

• Effectively using high performance imaging as the privileged information by exploring the structured information at

– Sample level

– Feature level

• End-to-end joint training bridges the gap between structure modeling and feature learning

Page 78: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000

Joint work

Xiaogang Wang Nicu Sebe Elisa Ricci

Wei Yang Dan Xu Shuyang Sun

Page 79: Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013 Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000