論文紹介(Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification)
Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition...
Transcript of Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition...
![Page 1: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/1.jpg)
Modeling deep structures for 3D scene understanding
Wanli Ouyang (欧阳万里)
The Chinese University of Hong Kong The University of Sydney
![Page 2: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/2.jpg)
Outline
2
Back-bone model design
Conclusion
Introduction
Structured featuresStructured output
Structured deep learning
Depth Estimation (TPAMI’18)
Scene Graph Generation (ICCV’17, ECCV’18)
![Page 3: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/3.jpg)
Outline
3
Image/video classification(CVPR’18, NIPS’18)
Structured featuresStructured output
Structured deep learning
Depth Estimation (TPAMI’18)
Scene Graph Generation (ICCV’17, ECCV’18)
Back-bone model design
Conclusion
Introduction
![Page 4: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/4.jpg)
Outline
4
Image/video classification(CVPR’18, NIPS’18)
Structured featuresStructured output
Structured deep learning
Depth Estimation (TPAMI’18)
Scene Graph Generation (ICCV’17, ECCV’18)
Back-bone model design
Conclusion
Introduction
Codes available!
![Page 5: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/5.jpg)
Deep learning
MIT Tech ReviewTop 10 Breakthroughs 2013Ranking No. 1
Hinton won ImageNetcompetition Classify 1.2 million images into 1,000 categoriesBeating existing computer vision methods by 20+% Surpassing human performance
Hold records on most of the computer vision problems
Web-scale visual search, self-driving cars, surveillance, multimedia …
Simulate brain activities and employ millions of neurons to fit billions of training samples. Deep neural networks are trained with GPU clusters with tens of thousands of processors
![Page 6: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/6.jpg)
Performance vs practical need
Conventional model
Deep model Very Deep model
Very deep structured learning
Many other applications
Face recognition
![Page 7: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/7.jpg)
h
x
y
Structure in neurons
• Conventional neural networks
– Neurons in the same layer have no connection
– Neurons in adjacent layers are fully connected, at least within a local region
Structure exists in brain
![Page 8: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/8.jpg)
Structure in data
?
![Page 9: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/9.jpg)
Structure in data
?
?
Correlation
![Page 10: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/10.jpg)
Outline
10
Introduction
Structured featuresStructured output
Structured deep learning
Depth Estimation (TPAMI’18)
Scene Graph Generation (ICCV’17, ECCV’18)
![Page 11: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/11.jpg)
Outline
11
Introduction
Structured output
Structured deep learning
Depth Estimation (TPAMI’18)
![Page 12: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/12.jpg)
Monocular depth estimation
RGB-Input GT Ours
![Page 13: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/13.jpg)
Motivation•Deep structured dense pixel-level prediction:
In single scale with patch-level refinement due to the O(n3)
complexity of closed-form solution
CNN coarse output CRF-modeling InferenceRepresentative works:
• CRF-RNN:
. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional random fields as
recurrent neural networks. In ICCV, 2015.
• Deep convolutional neural field:
. F. Liu, C. Shen, G. Lin, and I. Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE
TPAMI, 38(10):2024–2039, 2016.
In Discrete Domain
Ours: In Multi-scale with pixel-level dense refinement with O(n) complexity
![Page 14: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/14.jpg)
Approach
Multi-Scale Deep Structured Fusion & Prediction + In Continuous Domain +
Within a Joint CNN-CRF Framework
Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, Nicu Sebe, “Multi-Scale Continuous CRFs as Sequential Deep
Networks for Monocular Depth Estimation”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR
2017, Spotlight Oral, TPAMI18)
Front-End C onvolutional N eural N etw ork
C -M F C -M F C -M F C -M FC -M F
d ?
s1
C -M F C -M F C -M F C -M FC -M F
M ulti-Scale Fusion w ith C ontinuous C R Fs
Side O utputs
s2 s3 s4 s5
...
...
...
...
...
r
r
![Page 15: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/15.jpg)
Message passing using CNN-CRF
• 20 minutes to illustrate CRF …
![Page 16: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/16.jpg)
Results with Different Front-End CNNs on NYUD-V2
Results
Qualitative results on NYUD-V2: significant improvement
over the pretrained front-end CNNs
Code:
![Page 17: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/17.jpg)
Results
Results on KITTI: ours achieved the best performance
compared with the state-of-the-art
Code:
![Page 18: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/18.jpg)
More effective than the classic multi-scale fusion schemes
RGB-Input GT Ours
Qualitative results on Make3DAchieved the best performance on most of the metrics.
Results
![Page 19: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/19.jpg)
Outline
19
Introduction
Structured featuresStructured output
Structured deep learning
Depth Estimation (TPAMI’18)
Scene Graph Generation (ICCV’17, ECCV’18)
![Page 20: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/20.jpg)
Object detection
Scene graph generation
Mom and her cute baby are brushing teeth Region captioning
Woman
Child
Tooth brush
Tooth brush
Use
Use
Watch
![Page 21: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/21.jpg)
Outline
21
Introduction
Structured featuresStructured output
Structured deep learning
Depth Estimation (TPAMI’18)
Scene Graph Generation (ICCV’17, ECCV’18)
![Page 22: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/22.jpg)
Why structured features?
Contains rich visual information
![Page 23: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/23.jpg)
Woman
Tooth brush
?
![Page 24: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/24.jpg)
Woman
Tooth brush
?Use
![Page 25: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/25.jpg)
Overview our proposed Multi-level Scene Description Network (MSDN)
25
![Page 26: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/26.jpg)
Methodology: Dynamic Graph Construction
26
![Page 27: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/27.jpg)
Methodology: Feature Refining
27
(b) Phrase Updating
(a) Object Updating
(c) Region Updating
Region nodes
phrase nodes
object nodes
![Page 28: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/28.jpg)
Methodology: Object feature updating
28
• Phrase feature merge: Since the features from different phrases have different importance factors for refining objects, we use a gate function to determine weights.
The gate function is defined as:
• Refine object features: For the i-th object, there are two merged features:
![Page 29: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/29.jpg)
Overview our proposed Multi-level Scene Description Network (MSDN)
29
Code:
![Page 30: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/30.jpg)
Quantitative Results
30
Comparison with existing works:• LP: Visual Relationship detection using word embeddings as
language prior (Lu, Cewu, et al., ECCV 2016)• ISGG: Scene graph generation using iterative message
passing (Xu, Danfei, et al. arXiv:1701.02426)
Experiment on object detection & captioning:• FRCNN: Faster R-CNN (Girshick, Ross., ICCV 2015) with the same
number of potential object proposals as used at our MSDN.• Baseline-3-bran.: the baseline model with 3 branches but the
feature refining structure removed.
![Page 31: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/31.jpg)
Qualitative Results
Top-1 region captioning results with detected objects and corresponding relationships are visualized.
![Page 32: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/32.jpg)
32
Slow inference speed due to large number of phrase proposals
2.45s/img
![Page 33: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/33.jpg)
33
Factorizable NetAn Efficient Subgraph-based Framework for Scene Graph Generation
![Page 34: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/34.jpg)
[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” ECCV 2018.
(person-play-snowboard) (snowboard-under-person)
(person-wear-pants) (pants-on-person)
(person-wear-helmet)
(helmet-on-person)
Object nodes Predicate nodes
![Page 35: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/35.jpg)
[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” ECCV 2018.
(Shared interacting regions)Object nodes Predicate nodes Subgraph nodes
![Page 36: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/36.jpg)
Factorizable Net
[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” ECCV 2018.
(1) Image and RPN
proposals (2) Fully-connected Graph(3) Subgraph-based
Representation
person
wear hold
helmet bat
(4) ROI-pooling and
Feature Preparation
(6) Object and Relation
Recognition
Predicate inference (SRI)(5) Spatial-weighted
Message Passing
(SMP)
Object inference
Object feature vectors
Subgraph feature maps
![Page 37: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/37.jpg)
SMP: Spatial-weighted Message Passing
[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” ECCV 2018.
(k × 512 × 1 × 1)
subgraph feature
(1 × 512 × 5 ×5)
merged features
(1 × 512 × 5 × 5)
object features
(k × 512)attention maps
(k × 1 × 5 × 5)
refined features
(1 × 512 × 5 × 5)
SMP : Subgraph Feature Refining
subgraph features
(m × 512 × 5 × 5)
avg-pooled features
(m × 515 × 1 × 1)
object feature
(1 × 512)
merged features
(1 × 512)
refined features
(1 × 512)
attention vector
(m× 1)
SMP: Object Feature Refining
![Page 38: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/38.jpg)
SMP: subgraph to object
[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” ECCV 2018.
subgraph features
(m × 512 × 5 × 5)
avg-pooling or attention
(m × 515 × 1 × 1)
object feature
(1 × 512)
merged features
(1 × 512)
refined features
(1 × 512)
attention vector
(m× 1)
![Page 39: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/39.jpg)
SMP: object to subgraph
[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” ECCV 2018.
(k × 512 × 5 × 5)
subgraph feature
(1 × 512 × 5 × 5)
object features
(k × 512)attention maps (k ×
1 × 5 × 5)
refined features
(1 × 512 × 5 × 5)
![Page 40: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/40.jpg)
SRI: Spatial-sensitive Relation Inference
[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” ECCV 2018.
conv kernel (1 ×512)
Concatenate
subgraph feature
(1 × 512 × 7 × 7)
(1 × 512 × 7 × 7)
(1 × 512 × 7 × 7)
(1 × 1536 × 7 × 7)
Spatial-sensitive Relation Inference (SRI)
predicate
(1 × 512)
Subject feature
(1 × 512)
Object feature
(1 × 512)
![Page 41: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/41.jpg)
Comparison with Existing Methods
[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” ECCV 2018.
![Page 42: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/42.jpg)
Evaluation on Object Detection
[4] Li, Yikang, et al. “Scene graph generation from objects, phrases and region captions.” ICCV 2017.
[8] Li, Yikang, et al. “Factorazable Net: An Efficient Subgraph-based framework for Scene Graph Generation .” 2018.
[9] Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." NIPS 2015.
• FRCNN-64: Faster RCNN with 64 object proposals (experiment settings in [7])
• FRCNN-300: Faster RCNN with 300 object proposals (experiment settings in [9])
• MSDN: Our proposed Multilevel Scene Description Network in [4]
• Ours-w/o-Rel: Adopt the subgraph-based framework but without relationship supervision
• Ours: Our Factorizable Net with 1 SMP (model 5 in Ablation Study)
![Page 43: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/43.jpg)
Does structure only exist for specific task?
![Page 44: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/44.jpg)
Outline
44
Back-bone model design
Introduction
FishNet(NIPS 18)
Optical guided feature (CVPR18)
Structured deep learning
![Page 45: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/45.jpg)
Current CNN Structures
Image Classification: Summarize high-level semantic information of the whole image.
Detection/Segmentation:High-level semantic meaning with high spatial resolution
Called U-Net, Hourglass, or Conv-deconv
![Page 46: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/46.jpg)
Architectures designed for different granularities are
DIVERGING
![Page 47: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/47.jpg)
Unify the advantages of networks for pixel-level, region-level, and image-level tasks
![Page 48: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/48.jpg)
Hourglass for Classification
Poor performance.
So what is the problem?
• Different tasks require different resolutions of feature
Directly applying hourglass for classification?
Features with high-level semantics and high resolution is good
• Down sample high-level features with high resolution
Our design
![Page 49: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/49.jpg)
Hourglass for Classification
1 × 1, 𝑐𝑖𝑛
3 × 3, 𝑐𝑖𝑛
1 × 1, 𝑐𝑜𝑢𝑡
1 × 1, 𝑐𝑜𝑢𝑡
𝑐𝑜𝑢𝑡
𝑆𝑡𝑟𝑖𝑑𝑒 = 2
1 × 1, 𝑐𝑖𝑛
3 × 3, 𝑐𝑖𝑛
1 × 1, 𝑐𝑖𝑛
CLow-level features
𝑐𝑜𝑢𝑡 − 𝑐𝑖𝑛 𝑐𝑜𝑢𝑡
C Concat
𝑢𝑝/𝑑𝑜𝑤𝑛 𝑠𝑎𝑚𝑝𝑙𝑒
The 𝟏 × 𝟏 convolution layer in yellowindicates the Isolated convolution.
• Hourglass may bring more isolated convolutions than ResNet
1 × 1, 𝑐𝑖𝑛
3 × 3, 𝑐𝑖𝑛
1 × 1, 𝑐𝑜𝑢𝑡
𝑐𝑜𝑢𝑡
Normal Res-BlockRes-Block for
down/up samplingOur design
• Different tasks require different resolutions of feature
![Page 50: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/50.jpg)
Observation and design
Solution1. Unify the advantages of
networks for pixel-level, region-level, and image-level tasks.
2. Design a network that does not need isolated convolution
3. Features from varying depths are preserved and refined from each other.
Bharath Hariharan, et al. "Hypercolumns for object segmentation and fine-grained localization." CVPR’15.Newell, Alejandro, Kaiyu Yang, and Jia Deng. "Stacked hourglass networks for human pose estimation." ECCV’16.
Our observation
1. Diverged structures for tasks requiring different resolutions.
2. Isolated Conv blocks the direct back-propagation
3. Features with different depths are not fully explored, or mixedbut not preserved
![Page 51: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/51.jpg)
Difference between mix and preserve and refine
High level
Low level
+ High level
Low level
Mixed features Preserve and refine
M
M
+
+
M Message generation
![Page 52: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/52.jpg)
FishNet: Overview
224x224 … 56x56 28x28 14x14 7x7 14x14 28x28 56x56 28x28 14x14 7x7 1x1
… … … … … ……
Features in the tail part
Features in the body part
Residual Blocks
Features inthe head part
Concat
Fish Tail Fish Body Fish Head
… … …
![Page 53: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/53.jpg)
FishNet: Preservation & Refinement
……
…………
……
……
TransferringBlocks T (⋅)
DR Blocks
UR Blocks
Regular Connections
……
FishTail
FishBody
FishHead
M(⋅)
𝑑𝑜𝑤𝑛(⋅)
𝑢𝑝(⋅)
𝑟(⋅)M(⋅)
……
……
Up-sampling and Refinement (UR) Blocks
Down-sampling and Refinement (DR) Blocks
Feature from varying depth refines each other here
Sum up every𝒌 adjacent
channelsConcat
Concat
Nearest neighbor up-sampling
𝟐 × 𝟐 Max-Pooling
From Tail
From Body
From Body
From Head
![Page 54: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/54.jpg)
FishNet: Performance-ImageNet
22.59%
21.93%(5.92%)
21.55%(5.86%)
21.25%(5. 76%)
23.78%(7.00%)
22.30%(6.20%)
21.69%(5.94%)
21.00%
21.50%
22.00%
22.50%
23.00%
23.50%
24.00%
10 20 30 40 50 60 70
FishNet
ResNet
Parameters, × 106
22.59%
21.93%
21.55%21.25%
23.78%
22.30%
21.69%
21.00%
21.50%
22.00%
22.50%
23.00%
23.50%
24.00%
2 4 6 8 10 12
FishNet
ResNet
FLOP, × 109
To
p-1
Erro
r
Codehttps://github.com/kevin-ssy/FishNet
![Page 55: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/55.jpg)
FishNet: Performance-ImageNet
22.59%
21.93%(5.92%)
21.55%(5.86%)
21.25%(5. 76%)
22.58%(6.35%)
22.20%(6.20%)
22.15%(6.12%)
21.20%
23.78%(7.00%)
22.30%(6.20%)
21.69%(5.94%)
21.00%
21.50%
22.00%
22.50%
23.00%
23.50%
24.00%
10 20 30 40 50 60 70
FishNet
DenseNet
ResNetTo
p-1
Erro
r
Parameters, × 106
Codehttps://github.com/kevin-ssy/FishNet
![Page 56: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/56.jpg)
FishNet: Performance on COCO Detection
38.00%
38.50%
39.00%
39.50%
40.00%
40.50%
41.00%
41.50%
42.00%
AP
R-50
RX-50
Fish-150
18.00%
18.20%
18.40%
18.60%
18.80%
19.00%
19.20%
19.40%
19.60%
19.80%
20.00%
AP-small
36.00%
36.50%
37.00%
37.50%
38.00%
38.50%
39.00%
39.50%
40.00%
40.50%
41.00%
AP-medium
47.00%
47.50%
48.00%
48.50%
49.00%
49.50%
50.00%
50.50%
AP-large
Codehttps://github.com/kevin-ssy/FishNet
![Page 57: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/57.jpg)
FishNet: Performance on COCO Instance Segmentation
34.00%
34.50%
35.00%
35.50%
36.00%
36.50%
37.00%
37.50%
AP
R-50
RX-50
Fish-150
18.00%
18.20%
18.40%
18.60%
18.80%
19.00%
19.20%
19.40%
19.60%
19.80%
20.00%
AP-small
37.00%
37.50%
38.00%
38.50%
39.00%
39.50%
40.00%
40.50%
AP-medium
47.00%
47.50%
48.00%
48.50%
49.00%
49.50%
50.00%
50.50%
AP-large
Codehttps://github.com/kevin-ssy/FishNet
![Page 58: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/58.jpg)
Winning COCO 2018 Instance Segmentation Task
![Page 59: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/59.jpg)
Visualization
![Page 60: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/60.jpg)
Visualization
![Page 61: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/61.jpg)
Visualization
![Page 62: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/62.jpg)
Visualization
![Page 63: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/63.jpg)
Visualization
![Page 64: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/64.jpg)
Codebase
• Comprehensive
RPN Fast/Faster R-CNN
Mask R-CNN FPN
Cascade R-CNN RetinaNet
More … …
• High performance
Better performance
Optimized memory consumption
Faster speed
• Handy to develop
Written with PyTorch
Modular design
√
√
√
√
√
√
√
√
√
GitHub: mmdet√
√
![Page 65: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/65.jpg)
FishNet: Advantages
• Better gradient flow to shallow layers
• High-resolution features contain rich low-level and high-level semantics
• Build up correlation among features with different semantic information
– They are preserved and refined from each other
![Page 66: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/66.jpg)
Outline
66
Back-bone model design
Introduction
FishNet(NIPS 18)
Optical guided feature (CVPR18)
Structured deep learning
![Page 67: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/67.jpg)
Action Recognition
• Recognize action from videos
![Page 68: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/68.jpg)
Optical flow in Action Recognition
• Motion is the important information
• Optical flow
– Effective
– Time consuming
Modality Acc. Speed(fps)
RGB 85.5% 680
RGB+Optical Flow 94.0% 14
We need a better motion representation
![Page 69: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/69.jpg)
Optical flow guided feature
−
![Page 70: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/70.jpg)
Optical flow guided feature
{vx , vy} = optical flow
Coefficient for optical flow:
Optical flow:
Intuitive Inspiration
![Page 71: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/71.jpg)
Optical flow guided feature
Feature flow:
{ } = feature flow
![Page 72: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/72.jpg)
Optical flow guided feature
1 × 1 Conv 1 × 1 Conv
S
C
Sobel Subtract
Concat
OOFF Unit
Fram
et
Fram
et
+ ∆
t
Cla
ssif
ica
tio
nSu
b-n
etw
ork
...
...
ResolutionK*K
ResolutionK/2*K/2
ResolutionK/4*K/4
CO
CO
...
...
CO
Feat
ure
tFe
atu
re t
+ ∆
t
...
...
OCOFF UnitConcat
C C C
![Page 73: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/73.jpg)
Optical Flow Guided Feature (OFF): Experimental results
1. OFF with only RGB inputs is comparable with the other state-of-the-art methods using optical flow as input.
0 50 100 150 200 250
RGB + Optical flow + I3D
RGB + OFF
RGB + OFF + Optical Flow
FPS
92.0 92.5 93.0 93.5 94.0 94.5 95.0 95.5 96.0
RGB + Optical flow + I3D
RGB + OFF
RGB + OFF + Optical Flow
Accuracy (%)Code:
![Page 74: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/74.jpg)
Not only for action recognition
• Also effective for
– Video object detection
– Video compression artifact removal
71.5 72 72.5 73 73.5 74 74.5 75 75.5 76
resnet+rfcn
resnet+rfcn+OFFDetection (mAP)
34.6 34.8 35 35.2 35.4 35.6 35.8 36 36.2
DnCNN
DnCNN+OFFCompression Artifact Removal (PSNR)
q40
![Page 75: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/75.jpg)
Outline
75
Back-bone model design
Introduction
FishNet(NIPS 18)
Optical guided feature (CVPR18)
Structured deep learning
![Page 76: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/76.jpg)
Take home message
76
• Structured deep learning is
– effective
– for output, features
– from observation
• End-to-end joint training bridges the gap between structure modeling and feature learning
![Page 77: Modeling deep structures for 3D scene understanding...Ranking No. 1 Hinton won ImageNet competition Classify 1.2 million images into 1,000 categories Beating existing computer vision](https://reader034.fdocument.pub/reader034/viewer/2022050113/5f4a113e91bb81620f672812/html5/thumbnails/77.jpg)