Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013...

Post on 15-Jul-2020

1 views 0 download

Transcript of Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013...

Modeling deep structures for using high performance images

Wanli Ouyang (欧阳万里)

The University of Sydney

Outline

2

Back-bone model design

Conclusion

Introduction

Structured features Structured samplesFeature fusion

Effectively using high performance images

+

+

+

+

+

+

+

+

??+

+- -

- -

Pedestrian detection (CVPR’17)

Scene parsing and depth estimation(CVPR18)

3D human pose estimation (CVPR’18)

Outline

3

Back-bone model design

Conclusion

Introduction

Effectively using high performance images

Pedestrian detection (CVPR’17)

Scene parsing and depth estimation(CVPR18)

3D human pose estimation (CVPR’18)

Image/video classification(CVPR’18, NIPS’18)

Deep learning

MIT Tech ReviewTop 10 Breakthroughs 2013Ranking No. 1

Hinton won ImageNetcompetition Classify 1.2 million images into 1,000 categoriesBeating existing computer vision methods by 20+% Surpassing human performance

Hold records on most of the computer vision problems

Web-scale visual search, self-driving cars, surveillance, multimedia …

Simulate brain activities and employ millions of neurons to fit billions of training samples. Deep neural networks are trained with GPU clusters with tens of thousands of processors

Performance vs practical need

Conventional model

Deep model Very Deep model

Very deep structured learning

Many other applications

Face recognition

h

x

y

Structure in neurons

• Conventional neural networks

– Neurons in the same layer have no connections

– Neurons in adjacent layers are fully connected, at least within a local region

Structure exists in brain

Structure in data

?

Structure in data

? ?

?

Structure in data

?

? ?

?

?

Correlation

Outline

10

Back-bone model design

Conclusion

Introduction

Effectively using high performance images

Effectively using high performance imaging data

Training Deployment

RGB onlyMulti-modal dataImage from https://research.csiro.au/data61/high-performance-imaging/

Outline

12Conclusion

Structured Features Structured SamplesFeature fusion

Effectively using high performance images

+

+

+

+

+

+

+

+

??+

+- -

- -

Pedestrian detection (CVPR’17)

Scene parsing and depth estimation(CVPR18)

3D human pose estimation (CVPR’18)

Introduction

Outline

13Conclusion

Feature fusion

Effectively using high performance images

Pedestrian detection (CVPR’17)

Introduction

Motivation• Challenging open issues in pedestrian detection: illumination

variation, shadows, background clutter, and low external light

• Exploiting thermal data in addition to RGB data for learning cross-

modal representations

RGB

Thermal

Hard positive samples Hard negative samples

Dan Xu, Wanli Ouyang, Elisa Ricci, Xiaogang Wang, Nicu Sebe,“Learning Cross-Modal Deep Representations for Robust

Pedestrian Detection”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Hawaii, USA

Motivation• Challenging open issues in pedestrian detection: illumination

variation, shadows, background clutter, and low external light

• Exploiting thermal data in addition to RGB data for learning cross-

modal representations

• Can we transfer the learned cross-modal representations?

Missing thermal data???

Deep Reconstruction

+

RGB Data Thermal Data

Reconstruction network Detection network

Deep Detection

RGB Data

Cross modality transfer

Approach• RRN

• MSDN

- RGB domain to Thermal domain

- weakly supervised reconstruction

- region-based instead of frame-level

based

- cross-modal multi-scale feature fusion

- the parameters of subnetwork in

yellow box are transferred from RRN

Results - Caltech

• Demonstrated the effectiveness of the learned cross-modal representations

• Achieved superior detection performance

Average Miss rate

CMT-CNN-SA 13.76%

CMT-CNN 10.69%

Results - KAIST

54.26%

56.76%

52.15%

49.55%

52.44%

54.83%

50.71%

47.30%

58.97%

61.24%

57.65%

54.78%

45.00% 47.00% 49.00% 51.00% 53.00% 55.00% 57.00% 59.00% 61.00% 63.00%

CMT-CNN-SA

CMT-CNN-SA-SB(Random)

CMT-CNN-SA-SB(ImageNet)

CMT-CNN

Night Day All

• Demonstrated the effectiveness of the learned cross-modal representations

• Achieved superior detection performance

Qualitative results

ACF

w/o reconstruction network

With reconstruction network

Effectively using high performance images

Structured featuresFeature fusion

Motivation

RGB

? Segmentation

Depth estimation

Dan Xu, Wanli Ouyang, Xiaogang Wang, Nicu Sebe, “PAD-Net: Multi-Tasks Guided Prediction-and-Distillation

Network for Simultaneous Depth Estimation and Scene Parsing”, IEEE Conference on Computer Vision and

Pattern Recognition (CVPR 2018)

Motivation

Multi-task learning?

- Directly optimizing multiple tasks given input training data

does not guarantee consistent gain on all the tasks

RGB

Segmentation

Depth estimation

Motivation

Segmentation

Depth estimation

Depth Surface normal

Semantic Contour

CNN CNN

- Multi-modal input data improve training of deep networks

- Facilitate final tasks via leveraging intermediate multiple predictions while only

one single modal data are required?

RGB

Approach

Illustration of the proposed multi-task distillation network for simultaneous

depth estimation and scene parsing

L1D EC O N V

D EC O N V

D EC O N V

D EC O N V

M ulti-TaskD istillation

R G B Front-End Encoder M ulti-Task Predictions M ulti-Task D istillation D ecoder O utput

L2

L3

L4

L5

L6

Approach

• Different multi-task distillation modules:

Naive implementation via feature concatenation

Passing message between feature maps

Attention mechanism guided message passing module

Results for scene parsing on Cityscapes

0.725 0.723

0.736

0.748

0.7560.761

0.7

0.71

0.72

0.73

0.74

0.75

0.76

0.77

Front-end+ DE(baseline)

Front-end + DE +SP (baseline)

PAD-Net(Distillation A

+SP )

PAD-Net(Distillation B

+SP )

PAD-Net(Distillation C

+SP )

PAD-Net(Distillation C +

DE + SP )

Multi-task learning results in decrease of IOU accuracy

Message passing with attention performs betterDistillation improves accuracy

Results with Different Front-End CNNs on NYUD-V2

Results

Results with Different Front-End CNNs on NYUD-V2

Estim

atio

nG

TE

stim

atio

nG

T

Results

Results

Estim

atio

nG

TE

stim

atio

nG

T

Effectively using high performance images

Structured SamplesStructured featuresFeature fusion

+

+

+

+

+

+

+

+

??

+

+- -

- -

Challenges: No Annotation

Constrained scenes In-the-wild scenes

31

No annotation

Domain

Discrepancy

Which one is more plausible?

32

Discriminator

Weakly Supervised Adversarial Learning

33

𝐷

3D dataset Images w/o GT

𝐺3D Human Pose Estimator Multi-source Discriminator

Prediction Ground-truth

Real Fake

Adversarial Learning

34

Generator Discriminator

Fool

Tell𝑳𝒐𝒔𝒔𝑮 𝑳𝒐𝒔𝒔𝑫

Euclidean Loss Classification Loss

+

+

+

+ +

+

+

+

?

+

+?

- -

- --

-

-

Stack 1 Stack 𝑛

Generator

35

Co

nv

Res

idu

al

Res

idu

al

Hourglass

128 × 128

64 × 64

25

6x2

56

2D score maps

Dep

th

3D Poses

2D module Depth module

Discriminator

36

CNN

2D Heatmaps Depthmaps

Image 𝐼

CNN

CNN

Real

Fake

64

64

Geometric descriptor

𝑃

𝑃

[Δ𝑥, Δ𝑦, Δ𝑧] [Δ𝑥2, Δ𝑦2, Δ𝑧2]

Fully C

on

nected

layers

25

6

Real or Fake samples

Concatenation

Multi-Source Discriminator

37

Raw poses

Effectiveness of Adversarial Learning

38

Ablation Study on H36M Dataset

39

64.9

65.2

64.8

61.3

60.3

59.7

58 59 60 61 62 63 64 65 66

MPJPE (error in mm) on H36M

Full Geo Pose Baseline Baseline (fix 2D) State-of-art*

8% less error

*Zhou et al. ICCV’17

Zhou et al. ICCV’17

(Ou

rs)

Fix 2D, finetune depth

Jointly learn 2D + depth

Image+Pose

Image+Geo

Image+Pose+Geo

Mean per joint position error

Results on Images in the Wild

40

bas

elin

eO

urs

Multi-view Results

41

Outline

42Conclusion

Structured Features Structured SamplesFeature fusion

Effectively using high performance images

+

+

+

+

+

+

+

+

??+

+- -

- -

Pedestrian detection (CVPR’17)

Scene parsing and depth estimation(CVPR18)

3D human pose estimation (CVPR’18)

Introduction

Does structure only exist for specific task?

Outline

44

Back-bone model design

Introduction

FishNet(NIPS 18)

Optical guided feature (CVPR18)

Effectively using high performance images

Low-level and high-level features

Image from Andrew Ng’s slides

Current CNN Structures

Image Classification: Summarize high-level semantic information of the whole image.

Detection/Segmentation:High-level semantic meaning with high spatial resolution

Called U-Net, Hourglass, or Conv-deconv

Architectures designed for different granularities are

DIVERGING

Unify the advantages of networks for pixel-level, region-level, and image-level tasks

Hourglass for Classification

Poor performance.

So what is the problem?

• Different tasks require different resolutions of feature

Directly applying hourglass for classification?

Features with high-level semantics and high resolution is good

• Down sample high-level features with high resolution

Our design

Hourglass for Classification

1 × 1, 𝑐𝑖𝑛

3 × 3, 𝑐𝑖𝑛

1 × 1, 𝑐𝑜𝑢𝑡

1 × 1, 𝑐𝑜𝑢𝑡

𝑐𝑜𝑢𝑡

𝑆𝑡𝑟𝑖𝑑𝑒 = 2

1 × 1, 𝑐𝑖𝑛

3 × 3, 𝑐𝑖𝑛

1 × 1, 𝑐𝑖𝑛

CLow-level features

𝑐𝑜𝑢𝑡 − 𝑐𝑖𝑛 𝑐𝑜𝑢𝑡

C Concat

𝑢𝑝/𝑑𝑜𝑤𝑛 𝑠𝑎𝑚𝑝𝑙𝑒

The 𝟏 × 𝟏 convolution layer in yellowindicates the Isolated convolution.

• Hourglass may bring more isolated convolutions than ResNet

1 × 1, 𝑐𝑖𝑛

3 × 3, 𝑐𝑖𝑛

1 × 1, 𝑐𝑜𝑢𝑡

𝑐𝑜𝑢𝑡

Normal Res-BlockRes-Block for

down/up samplingOur design

• Different tasks require different resolutions of feature

Observation and design

Solution1. Unify the advantages of

networks for pixel-level, region-level, and image-level tasks.

2. Design a network that does not need isolated convolution

3. Features from varying depths are preserved and refined from each other.

Bharath Hariharan, et al. "Hypercolumns for object segmentation and fine-grained localization." CVPR’15.Newell, Alejandro, Kaiyu Yang, and Jia Deng. "Stacked hourglass networks for human pose estimation." ECCV’16.

Our observation

1. Diverged structures for tasks requiring different resolutions.

2. Isolated Conv blocks the direct back-propagation

3. Features with different depths are not fully explored, or mixedbut not preserved

Difference between mix and preserve and refine

High level

Low level

+ High level

Low level

Mixed features Preserve and refine

M

M

M Message generation

FishNet: Overview

224x224 … 56x56 28x28 14x14 7x7 14x14 28x28 56x56 28x28 14x14 7x7 1x1

… … … … … ……

Features in the tail part

Features in the body part

Residual Blocks

Features inthe head part

Concat

Fish Tail Fish Body Fish Head

… … …

FishNet: Preservation & Refinement

……

…………

……

……

TransferringBlocks T (⋅)

DR Blocks

UR Blocks

Regular Connections

……

FishTail

FishBody

FishHead

M(⋅)

𝑑𝑜𝑤𝑛(⋅)

𝑢𝑝(⋅)

𝑟(⋅)M(⋅)

……

……

Up-sampling and Refinement (UR) Blocks

Down-sampling and Refinement (DR) Blocks

Feature from varying depth refines each other here

Sum up every𝒌 adjacent

channelsConcat

Concat

Nearest neighbor up-sampling

𝟐 × 𝟐 Max-Pooling

From Tail

From Body

From Body

From Head

FishNet: Performance-ImageNet

22.59%

21.93%(5.92%)

21.55%(5.86%)

21.25%(5. 76%)

23.78%(7.00%)

22.30%(6.20%)

21.69%(5.94%)

21.00%

21.50%

22.00%

22.50%

23.00%

23.50%

24.00%

10 20 30 40 50 60 70

FishNet

ResNet

Parameters, × 106

22.59%

21.93%

21.55%21.25%

23.78%

22.30%

21.69%

21.00%

21.50%

22.00%

22.50%

23.00%

23.50%

24.00%

2 4 6 8 10 12

FishNet

ResNet

FLOP, × 109

To

p-1

Erro

r

Codehttps://github.com/kevin-ssy/FishNet

FishNet: Performance-ImageNet

22.59%

21.93%(5.92%)

21.55%(5.86%)

21.25%(5. 76%)

22.58%(6.35%)

22.20%(6.20%)

22.15%(6.12%)

21.20%

23.78%(7.00%)

22.30%(6.20%)

21.69%(5.94%)

21.00%

21.50%

22.00%

22.50%

23.00%

23.50%

24.00%

10 20 30 40 50 60 70

FishNet

DenseNet

ResNetTo

p-1

Erro

r

Parameters, × 106

Codehttps://github.com/kevin-ssy/FishNet

FishNet: Performance on COCO Detection

38.00%

38.50%

39.00%

39.50%

40.00%

40.50%

41.00%

41.50%

42.00%

AP

R-50

RX-50

Fish-150

18.00%

18.20%

18.40%

18.60%

18.80%

19.00%

19.20%

19.40%

19.60%

19.80%

20.00%

AP-small

36.00%

36.50%

37.00%

37.50%

38.00%

38.50%

39.00%

39.50%

40.00%

40.50%

41.00%

AP-medium

47.00%

47.50%

48.00%

48.50%

49.00%

49.50%

50.00%

50.50%

AP-large

Codehttps://github.com/kevin-ssy/FishNet

FishNet: Performance on COCO Instance Segmentation

34.00%

34.50%

35.00%

35.50%

36.00%

36.50%

37.00%

37.50%

AP

R-50

RX-50

Fish-150

18.00%

18.20%

18.40%

18.60%

18.80%

19.00%

19.20%

19.40%

19.60%

19.80%

20.00%

AP-small

37.00%

37.50%

38.00%

38.50%

39.00%

39.50%

40.00%

40.50%

AP-medium

47.00%

47.50%

48.00%

48.50%

49.00%

49.50%

50.00%

50.50%

AP-large

Codehttps://github.com/kevin-ssy/FishNet

Winning COCO 2018 Instance Segmentation Task

Visualization

Visualization

Visualization

Visualization

Visualization

Codebase

• Comprehensive

RPN Fast/Faster R-CNN

Mask R-CNN FPN

Cascade R-CNN RetinaNet

More … …

• High performance

Better performance

Optimized memory consumption

Faster speed

• Handy to develop

Written with PyTorch

Modular design

GitHub: mmdet√

FishNet: Advantages

1. Better gradient flow to shallow layers

2. High-resolution features contain rich low-level and high-level semantics

3. Feature from varying depth are preserved and refined from each other

Outline

67

Back-bone model design

Introduction

FishNet(NIPS 18)

Optical guided feature (CVPR18)

Effectively using high performance images

Action Recognition

• Recognize action from videos

Optical flow in Action Recognition

• Motion is the important information

• Optical flow

– Effective

– Time consuming

Modality Acc. Speed(fps)

RGB 85.5% 680

RGB+Optical Flow 94.0% 14

We need a better motion representation

Optical flow guided feature

Optical flow guided feature

{vx , vy} = optical flow

Coefficient for optical flow:

Optical flow:

Intuitive Inspiration

Optical flow guided feature

Feature flow:

{ } = feature flow

Optical flow guided feature

1 × 1 Conv 1 × 1 Conv

S

C

Sobel Subtract

Concat

OOFF Unit

Fram

et

Fram

et

+ ∆

t

Cla

ssif

ica

tio

nSu

b-n

etw

ork

...

...

ResolutionK*K

ResolutionK/2*K/2

ResolutionK/4*K/4

CO

CO

...

...

CO

Feat

ure

tFe

atu

re t

+ ∆

t

...

...

OCOFF UnitConcat

C C C

Optical Flow Guided Feature (OFF): Experimental results

1. OFF with only RGB inputs is comparable with the other state-of-the-art methods using optical flow as input.

0 50 100 150 200 250

RGB + Optical flow + I3D

RGB + OFF

RGB + OFF + Optical Flow

FPS

92.0 92.5 93.0 93.5 94.0 94.5 95.0 95.5 96.0

RGB + Optical flow + I3D

RGB + OFF

RGB + OFF + Optical Flow

Accuracy (%)Code:

Not only for action recognition

• Also effective for

– Video object detection

– Video compression artifact removal

71.5 72 72.5 73 73.5 74 74.5 75 75.5 76

resnet+rfcn

resnet+rfcn+OFFDetection (mAP)

34.6 34.8 35 35.2 35.4 35.6 35.8 36 36.2

DnCNN

DnCNN+OFFCompression Artifact Removal (PSNR)

q40

Outline

76

Back-bone model design

Introduction

FishNet(NIPS 18)

Optical guided feature (CVPR18)

Effectively using high performance images

Take home message

• Structured deep learning is

– effective

• Effectively using high performance imaging as the privileged information by exploring the structured information at

– Sample level

– Feature level

• End-to-end joint training bridges the gap between structure modeling and feature learning

Joint work

Xiaogang Wang Nicu Sebe Elisa Ricci

Wei Yang Dan Xu Shuyang Sun