Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013...
Transcript of Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013...
Modeling deep structures for using high performance images
Wanli Ouyang (欧阳万里)
The University of Sydney
Outline
2
Back-bone model design
Conclusion
Introduction
Structured features Structured samplesFeature fusion
Effectively using high performance images
+
+
+
+
+
+
+
+
??+
+- -
- -
Pedestrian detection (CVPR’17)
Scene parsing and depth estimation(CVPR18)
3D human pose estimation (CVPR’18)
Outline
3
Back-bone model design
Conclusion
Introduction
Effectively using high performance images
Pedestrian detection (CVPR’17)
Scene parsing and depth estimation(CVPR18)
3D human pose estimation (CVPR’18)
Image/video classification(CVPR’18, NIPS’18)
Deep learning
MIT Tech ReviewTop 10 Breakthroughs 2013Ranking No. 1
Hinton won ImageNetcompetition Classify 1.2 million images into 1,000 categoriesBeating existing computer vision methods by 20+% Surpassing human performance
Hold records on most of the computer vision problems
Web-scale visual search, self-driving cars, surveillance, multimedia …
Simulate brain activities and employ millions of neurons to fit billions of training samples. Deep neural networks are trained with GPU clusters with tens of thousands of processors
Performance vs practical need
Conventional model
Deep model Very Deep model
Very deep structured learning
Many other applications
Face recognition
h
x
y
Structure in neurons
• Conventional neural networks
– Neurons in the same layer have no connections
– Neurons in adjacent layers are fully connected, at least within a local region
Structure exists in brain
Structure in data
?
Structure in data
? ?
?
Structure in data
?
? ?
?
?
Correlation
Outline
10
Back-bone model design
Conclusion
Introduction
Effectively using high performance images
Effectively using high performance imaging data
Training Deployment
RGB onlyMulti-modal dataImage from https://research.csiro.au/data61/high-performance-imaging/
Outline
12Conclusion
Structured Features Structured SamplesFeature fusion
Effectively using high performance images
+
+
+
+
+
+
+
+
??+
+- -
- -
Pedestrian detection (CVPR’17)
Scene parsing and depth estimation(CVPR18)
3D human pose estimation (CVPR’18)
Introduction
Outline
13Conclusion
Feature fusion
Effectively using high performance images
Pedestrian detection (CVPR’17)
Introduction
Motivation• Challenging open issues in pedestrian detection: illumination
variation, shadows, background clutter, and low external light
• Exploiting thermal data in addition to RGB data for learning cross-
modal representations
RGB
Thermal
Hard positive samples Hard negative samples
Dan Xu, Wanli Ouyang, Elisa Ricci, Xiaogang Wang, Nicu Sebe,“Learning Cross-Modal Deep Representations for Robust
Pedestrian Detection”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Hawaii, USA
Motivation• Challenging open issues in pedestrian detection: illumination
variation, shadows, background clutter, and low external light
• Exploiting thermal data in addition to RGB data for learning cross-
modal representations
• Can we transfer the learned cross-modal representations?
Missing thermal data???
Deep Reconstruction
+
RGB Data Thermal Data
Reconstruction network Detection network
Deep Detection
RGB Data
Cross modality transfer
Approach• RRN
• MSDN
- RGB domain to Thermal domain
- weakly supervised reconstruction
- region-based instead of frame-level
based
- cross-modal multi-scale feature fusion
- the parameters of subnetwork in
yellow box are transferred from RRN
Results - Caltech
• Demonstrated the effectiveness of the learned cross-modal representations
• Achieved superior detection performance
Average Miss rate
CMT-CNN-SA 13.76%
CMT-CNN 10.69%
Results - KAIST
54.26%
56.76%
52.15%
49.55%
52.44%
54.83%
50.71%
47.30%
58.97%
61.24%
57.65%
54.78%
45.00% 47.00% 49.00% 51.00% 53.00% 55.00% 57.00% 59.00% 61.00% 63.00%
CMT-CNN-SA
CMT-CNN-SA-SB(Random)
CMT-CNN-SA-SB(ImageNet)
CMT-CNN
Night Day All
• Demonstrated the effectiveness of the learned cross-modal representations
• Achieved superior detection performance
Qualitative results
ACF
w/o reconstruction network
With reconstruction network
Effectively using high performance images
Structured featuresFeature fusion
Motivation
RGB
? Segmentation
Depth estimation
Dan Xu, Wanli Ouyang, Xiaogang Wang, Nicu Sebe, “PAD-Net: Multi-Tasks Guided Prediction-and-Distillation
Network for Simultaneous Depth Estimation and Scene Parsing”, IEEE Conference on Computer Vision and
Pattern Recognition (CVPR 2018)
Motivation
Multi-task learning?
- Directly optimizing multiple tasks given input training data
does not guarantee consistent gain on all the tasks
RGB
Segmentation
Depth estimation
Motivation
Segmentation
Depth estimation
Depth Surface normal
Semantic Contour
CNN CNN
- Multi-modal input data improve training of deep networks
- Facilitate final tasks via leveraging intermediate multiple predictions while only
one single modal data are required?
RGB
Approach
Illustration of the proposed multi-task distillation network for simultaneous
depth estimation and scene parsing
L1D EC O N V
D EC O N V
D EC O N V
D EC O N V
M ulti-TaskD istillation
R G B Front-End Encoder M ulti-Task Predictions M ulti-Task D istillation D ecoder O utput
L2
L3
L4
L5
L6
Approach
• Different multi-task distillation modules:
Naive implementation via feature concatenation
Passing message between feature maps
Attention mechanism guided message passing module
Results for scene parsing on Cityscapes
0.725 0.723
0.736
0.748
0.7560.761
0.7
0.71
0.72
0.73
0.74
0.75
0.76
0.77
Front-end+ DE(baseline)
Front-end + DE +SP (baseline)
PAD-Net(Distillation A
+SP )
PAD-Net(Distillation B
+SP )
PAD-Net(Distillation C
+SP )
PAD-Net(Distillation C +
DE + SP )
Multi-task learning results in decrease of IOU accuracy
Message passing with attention performs betterDistillation improves accuracy
Results with Different Front-End CNNs on NYUD-V2
Results
Results with Different Front-End CNNs on NYUD-V2
Estim
atio
nG
TE
stim
atio
nG
T
Results
Results
Estim
atio
nG
TE
stim
atio
nG
T
Effectively using high performance images
Structured SamplesStructured featuresFeature fusion
+
+
+
+
+
+
+
+
??
+
+- -
- -
Challenges: No Annotation
Constrained scenes In-the-wild scenes
31
No annotation
Domain
Discrepancy
Which one is more plausible?
32
Discriminator
Weakly Supervised Adversarial Learning
33
𝐷
3D dataset Images w/o GT
𝐺3D Human Pose Estimator Multi-source Discriminator
Prediction Ground-truth
Real Fake
Adversarial Learning
34
Generator Discriminator
Fool
Tell𝑳𝒐𝒔𝒔𝑮 𝑳𝒐𝒔𝒔𝑫
Euclidean Loss Classification Loss
+
+
+
+ +
+
+
+
?
+
+?
- -
- --
-
-
Stack 1 Stack 𝑛
Generator
35
Co
nv
Res
idu
al
Res
idu
al
Hourglass
128 × 128
64 × 64
25
6x2
56
2D score maps
Dep
th
3D Poses
2D module Depth module
…
Discriminator
36
CNN
2D Heatmaps Depthmaps
Image 𝐼
CNN
CNN
Real
Fake
64
64
Geometric descriptor
𝑃
𝑃
[Δ𝑥, Δ𝑦, Δ𝑧] [Δ𝑥2, Δ𝑦2, Δ𝑧2]
Fully C
on
nected
layers
25
6
Real or Fake samples
Concatenation
Multi-Source Discriminator
37
Raw poses
Effectiveness of Adversarial Learning
38
Ablation Study on H36M Dataset
39
64.9
65.2
64.8
61.3
60.3
59.7
58 59 60 61 62 63 64 65 66
MPJPE (error in mm) on H36M
Full Geo Pose Baseline Baseline (fix 2D) State-of-art*
8% less error
*Zhou et al. ICCV’17
Zhou et al. ICCV’17
(Ou
rs)
Fix 2D, finetune depth
Jointly learn 2D + depth
Image+Pose
Image+Geo
Image+Pose+Geo
Mean per joint position error
Results on Images in the Wild
40
bas
elin
eO
urs
Multi-view Results
41
Outline
42Conclusion
Structured Features Structured SamplesFeature fusion
Effectively using high performance images
+
+
+
+
+
+
+
+
??+
+- -
- -
Pedestrian detection (CVPR’17)
Scene parsing and depth estimation(CVPR18)
3D human pose estimation (CVPR’18)
Introduction
Does structure only exist for specific task?
Outline
44
Back-bone model design
Introduction
FishNet(NIPS 18)
Optical guided feature (CVPR18)
Effectively using high performance images
Low-level and high-level features
Image from Andrew Ng’s slides
Current CNN Structures
Image Classification: Summarize high-level semantic information of the whole image.
Detection/Segmentation:High-level semantic meaning with high spatial resolution
Called U-Net, Hourglass, or Conv-deconv
Architectures designed for different granularities are
DIVERGING
Unify the advantages of networks for pixel-level, region-level, and image-level tasks
Hourglass for Classification
Poor performance.
So what is the problem?
• Different tasks require different resolutions of feature
Directly applying hourglass for classification?
Features with high-level semantics and high resolution is good
• Down sample high-level features with high resolution
Our design
Hourglass for Classification
1 × 1, 𝑐𝑖𝑛
3 × 3, 𝑐𝑖𝑛
1 × 1, 𝑐𝑜𝑢𝑡
1 × 1, 𝑐𝑜𝑢𝑡
𝑐𝑜𝑢𝑡
𝑆𝑡𝑟𝑖𝑑𝑒 = 2
1 × 1, 𝑐𝑖𝑛
3 × 3, 𝑐𝑖𝑛
1 × 1, 𝑐𝑖𝑛
CLow-level features
𝑐𝑜𝑢𝑡 − 𝑐𝑖𝑛 𝑐𝑜𝑢𝑡
C Concat
𝑢𝑝/𝑑𝑜𝑤𝑛 𝑠𝑎𝑚𝑝𝑙𝑒
The 𝟏 × 𝟏 convolution layer in yellowindicates the Isolated convolution.
• Hourglass may bring more isolated convolutions than ResNet
1 × 1, 𝑐𝑖𝑛
3 × 3, 𝑐𝑖𝑛
1 × 1, 𝑐𝑜𝑢𝑡
𝑐𝑜𝑢𝑡
Normal Res-BlockRes-Block for
down/up samplingOur design
• Different tasks require different resolutions of feature
Observation and design
Solution1. Unify the advantages of
networks for pixel-level, region-level, and image-level tasks.
2. Design a network that does not need isolated convolution
3. Features from varying depths are preserved and refined from each other.
Bharath Hariharan, et al. "Hypercolumns for object segmentation and fine-grained localization." CVPR’15.Newell, Alejandro, Kaiyu Yang, and Jia Deng. "Stacked hourglass networks for human pose estimation." ECCV’16.
Our observation
1. Diverged structures for tasks requiring different resolutions.
2. Isolated Conv blocks the direct back-propagation
3. Features with different depths are not fully explored, or mixedbut not preserved
Difference between mix and preserve and refine
High level
Low level
+ High level
Low level
Mixed features Preserve and refine
M
M
M Message generation
FishNet: Overview
224x224 … 56x56 28x28 14x14 7x7 14x14 28x28 56x56 28x28 14x14 7x7 1x1
… … … … … ……
Features in the tail part
Features in the body part
Residual Blocks
Features inthe head part
Concat
Fish Tail Fish Body Fish Head
… … …
FishNet: Preservation & Refinement
……
…………
……
……
TransferringBlocks T (⋅)
DR Blocks
UR Blocks
Regular Connections
……
FishTail
FishBody
FishHead
M(⋅)
𝑑𝑜𝑤𝑛(⋅)
𝑢𝑝(⋅)
𝑟(⋅)M(⋅)
……
……
Up-sampling and Refinement (UR) Blocks
Down-sampling and Refinement (DR) Blocks
Feature from varying depth refines each other here
Sum up every𝒌 adjacent
channelsConcat
Concat
Nearest neighbor up-sampling
𝟐 × 𝟐 Max-Pooling
From Tail
From Body
From Body
From Head
FishNet: Performance-ImageNet
22.59%
21.93%(5.92%)
21.55%(5.86%)
21.25%(5. 76%)
23.78%(7.00%)
22.30%(6.20%)
21.69%(5.94%)
21.00%
21.50%
22.00%
22.50%
23.00%
23.50%
24.00%
10 20 30 40 50 60 70
FishNet
ResNet
Parameters, × 106
22.59%
21.93%
21.55%21.25%
23.78%
22.30%
21.69%
21.00%
21.50%
22.00%
22.50%
23.00%
23.50%
24.00%
2 4 6 8 10 12
FishNet
ResNet
FLOP, × 109
To
p-1
Erro
r
Codehttps://github.com/kevin-ssy/FishNet
FishNet: Performance-ImageNet
22.59%
21.93%(5.92%)
21.55%(5.86%)
21.25%(5. 76%)
22.58%(6.35%)
22.20%(6.20%)
22.15%(6.12%)
21.20%
23.78%(7.00%)
22.30%(6.20%)
21.69%(5.94%)
21.00%
21.50%
22.00%
22.50%
23.00%
23.50%
24.00%
10 20 30 40 50 60 70
FishNet
DenseNet
ResNetTo
p-1
Erro
r
Parameters, × 106
Codehttps://github.com/kevin-ssy/FishNet
FishNet: Performance on COCO Detection
38.00%
38.50%
39.00%
39.50%
40.00%
40.50%
41.00%
41.50%
42.00%
AP
R-50
RX-50
Fish-150
18.00%
18.20%
18.40%
18.60%
18.80%
19.00%
19.20%
19.40%
19.60%
19.80%
20.00%
AP-small
36.00%
36.50%
37.00%
37.50%
38.00%
38.50%
39.00%
39.50%
40.00%
40.50%
41.00%
AP-medium
47.00%
47.50%
48.00%
48.50%
49.00%
49.50%
50.00%
50.50%
AP-large
Codehttps://github.com/kevin-ssy/FishNet
FishNet: Performance on COCO Instance Segmentation
34.00%
34.50%
35.00%
35.50%
36.00%
36.50%
37.00%
37.50%
AP
R-50
RX-50
Fish-150
18.00%
18.20%
18.40%
18.60%
18.80%
19.00%
19.20%
19.40%
19.60%
19.80%
20.00%
AP-small
37.00%
37.50%
38.00%
38.50%
39.00%
39.50%
40.00%
40.50%
AP-medium
47.00%
47.50%
48.00%
48.50%
49.00%
49.50%
50.00%
50.50%
AP-large
Codehttps://github.com/kevin-ssy/FishNet
Winning COCO 2018 Instance Segmentation Task
Visualization
Visualization
Visualization
Visualization
Visualization
Codebase
• Comprehensive
RPN Fast/Faster R-CNN
Mask R-CNN FPN
Cascade R-CNN RetinaNet
More … …
• High performance
Better performance
Optimized memory consumption
Faster speed
• Handy to develop
Written with PyTorch
Modular design
√
√
√
√
√
√
√
√
√
GitHub: mmdet√
√
FishNet: Advantages
1. Better gradient flow to shallow layers
2. High-resolution features contain rich low-level and high-level semantics
3. Feature from varying depth are preserved and refined from each other
Outline
67
Back-bone model design
Introduction
FishNet(NIPS 18)
Optical guided feature (CVPR18)
Effectively using high performance images
Action Recognition
• Recognize action from videos
Optical flow in Action Recognition
• Motion is the important information
• Optical flow
– Effective
– Time consuming
Modality Acc. Speed(fps)
RGB 85.5% 680
RGB+Optical Flow 94.0% 14
We need a better motion representation
Optical flow guided feature
−
Optical flow guided feature
{vx , vy} = optical flow
Coefficient for optical flow:
Optical flow:
Intuitive Inspiration
Optical flow guided feature
Feature flow:
{ } = feature flow
Optical flow guided feature
1 × 1 Conv 1 × 1 Conv
S
C
Sobel Subtract
Concat
OOFF Unit
Fram
et
Fram
et
+ ∆
t
Cla
ssif
ica
tio
nSu
b-n
etw
ork
...
...
ResolutionK*K
ResolutionK/2*K/2
ResolutionK/4*K/4
CO
CO
...
...
CO
Feat
ure
tFe
atu
re t
+ ∆
t
...
...
OCOFF UnitConcat
C C C
Optical Flow Guided Feature (OFF): Experimental results
1. OFF with only RGB inputs is comparable with the other state-of-the-art methods using optical flow as input.
0 50 100 150 200 250
RGB + Optical flow + I3D
RGB + OFF
RGB + OFF + Optical Flow
FPS
92.0 92.5 93.0 93.5 94.0 94.5 95.0 95.5 96.0
RGB + Optical flow + I3D
RGB + OFF
RGB + OFF + Optical Flow
Accuracy (%)Code:
Not only for action recognition
• Also effective for
– Video object detection
– Video compression artifact removal
71.5 72 72.5 73 73.5 74 74.5 75 75.5 76
resnet+rfcn
resnet+rfcn+OFFDetection (mAP)
34.6 34.8 35 35.2 35.4 35.6 35.8 36 36.2
DnCNN
DnCNN+OFFCompression Artifact Removal (PSNR)
q40
Outline
76
Back-bone model design
Introduction
FishNet(NIPS 18)
Optical guided feature (CVPR18)
Effectively using high performance images
Take home message
• Structured deep learning is
– effective
• Effectively using high performance imaging as the privileged information by exploring the structured information at
– Sample level
– Feature level
• End-to-end joint training bridges the gap between structure modeling and feature learning
Joint work
Xiaogang Wang Nicu Sebe Elisa Ricci
Wei Yang Dan Xu Shuyang Sun