Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013...

Modeling deep structures for using high performance images

Wanli Ouyang （欧阳万里）

The University of Sydney

Outline

Back-bone model design

Conclusion

Introduction

Structured features Structured samplesFeature fusion

Effectively using high performance images

Pedestrian detection (CVPR’17)

Scene parsing and depth estimation(CVPR18)

3D human pose estimation (CVPR’18)

Outline

Conclusion

Introduction

Image/video classification(CVPR’18, NIPS’18)

Deep learning

MIT Tech ReviewTop 10 Breakthroughs 2013Ranking No. 1

Hinton won ImageNetcompetition Classify 1.2 million images into 1,000 categoriesBeating existing computer vision methods by 20+% Surpassing human performance

Hold records on most of the computer vision problems

Web-scale visual search, self-driving cars, surveillance, multimedia …

Simulate brain activities and employ millions of neurons to fit billions of training samples. Deep neural networks are trained with GPU clusters with tens of thousands of processors

Performance vs practical need

Conventional model

Deep model Very Deep model

Very deep structured learning

Many other applications

Face recognition

Structure in neurons

• Conventional neural networks

– Neurons in the same layer have no connections

– Neurons in adjacent layers are fully connected, at least within a local region

Structure exists in brain

Structure in data

Correlation

Outline

Conclusion

Introduction

Effectively using high performance imaging data

Training Deployment

RGB onlyMulti-modal dataImage from https://research.csiro.au/data61/high-performance-imaging/

Outline

12Conclusion

Structured Features Structured SamplesFeature fusion

Introduction

Outline

13Conclusion

Feature fusion

Introduction

Motivation• Challenging open issues in pedestrian detection: illumination

variation, shadows, background clutter, and low external light

• Exploiting thermal data in addition to RGB data for learning cross-

modal representations

Thermal

Hard positive samples Hard negative samples

Dan Xu, Wanli Ouyang, Elisa Ricci, Xiaogang Wang, Nicu Sebe,“Learning Cross-Modal Deep Representations for Robust

Pedestrian Detection”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Hawaii, USA

Motivation• Challenging open issues in pedestrian detection: illumination

variation, shadows, background clutter, and low external light

• Exploiting thermal data in addition to RGB data for learning cross-

modal representations

• Can we transfer the learned cross-modal representations?

Missing thermal data???

Deep Reconstruction

RGB Data Thermal Data

Reconstruction network Detection network

Deep Detection

RGB Data

Cross modality transfer

Approach• RRN

• MSDN

- RGB domain to Thermal domain

- weakly supervised reconstruction

- region-based instead of frame-level

- cross-modal multi-scale feature fusion

- the parameters of subnetwork in

yellow box are transferred from RRN

Results - Caltech

• Demonstrated the effectiveness of the learned cross-modal representations

• Achieved superior detection performance

Average Miss rate

CMT-CNN-SA 13.76%

CMT-CNN 10.69%

Results - KAIST

54.26%

56.76%

52.15%

49.55%

52.44%

54.83%

50.71%

47.30%

58.97%

61.24%

57.65%

54.78%

45.00% 47.00% 49.00% 51.00% 53.00% 55.00% 57.00% 59.00% 61.00% 63.00%

CMT-CNN-SA

CMT-CNN-SA-SB(Random)

CMT-CNN-SA-SB(ImageNet)

CMT-CNN

Night Day All

• Demonstrated the effectiveness of the learned cross-modal representations

• Achieved superior detection performance

Qualitative results

w/o reconstruction network

With reconstruction network

Structured featuresFeature fusion

Motivation

? Segmentation

Depth estimation

Dan Xu, Wanli Ouyang, Xiaogang Wang, Nicu Sebe, “PAD-Net: Multi-Tasks Guided Prediction-and-Distillation

Network for Simultaneous Depth Estimation and Scene Parsing”, IEEE Conference on Computer Vision and

Pattern Recognition (CVPR 2018)

Motivation

Multi-task learning?

- Directly optimizing multiple tasks given input training data

does not guarantee consistent gain on all the tasks

Segmentation

Depth estimation

Motivation

Segmentation

Depth estimation

Depth Surface normal

Semantic Contour

CNN CNN

- Multi-modal input data improve training of deep networks

- Facilitate final tasks via leveraging intermediate multiple predictions while only

one single modal data are required?

Approach

Illustration of the proposed multi-task distillation network for simultaneous

depth estimation and scene parsing

L1D EC O N V

D EC O N V

M ulti-TaskD istillation

R G B Front-End Encoder M ulti-Task Predictions M ulti-Task D istillation D ecoder O utput

Approach

• Different multi-task distillation modules:

Naive implementation via feature concatenation

Passing message between feature maps

Attention mechanism guided message passing module

Results for scene parsing on Cityscapes

0.725 0.723

0.7560.761

Front-end+ DE(baseline)

Front-end + DE +SP (baseline)

PAD-Net(Distillation A

PAD-Net(Distillation B

PAD-Net(Distillation C

PAD-Net(Distillation C +

DE + SP )

Multi-task learning results in decrease of IOU accuracy

Message passing with attention performs betterDistillation improves accuracy

Results with Different Front-End CNNs on NYUD-V2

Results

Results with Different Front-End CNNs on NYUD-V2

Results

Structured SamplesStructured featuresFeature fusion

Challenges: No Annotation

Constrained scenes In-the-wild scenes

No annotation

Domain

Discrepancy

Which one is more plausible?

Discriminator

Weakly Supervised Adversarial Learning

3D dataset Images w/o GT

𝐺3D Human Pose Estimator Multi-source Discriminator

Prediction Ground-truth

Real Fake

Adversarial Learning

Generator Discriminator

Tell𝑳𝒐𝒔𝒔𝑮 𝑳𝒐𝒔𝒔𝑫

Euclidean Loss Classification Loss

Stack 1 Stack 𝑛

Generator

Hourglass

128 × 128

64 × 64

2D score maps

3D Poses

2D module Depth module

Discriminator

2D Heatmaps Depthmaps

Image 𝐼

Geometric descriptor

[Δ𝑥, Δ𝑦, Δ𝑧] [Δ𝑥2, Δ𝑦2, Δ𝑧2]

Fully C

nected

layers

Real or Fake samples

Concatenation

Multi-Source Discriminator

Raw poses

Effectiveness of Adversarial Learning

Ablation Study on H36M Dataset

58 59 60 61 62 63 64 65 66

MPJPE (error in mm) on H36M

Full Geo Pose Baseline Baseline (fix 2D) State-of-art*

8% less error

*Zhou et al. ICCV’17

Zhou et al. ICCV’17

Fix 2D, finetune depth

Jointly learn 2D + depth

Image+Pose

Image+Geo

Image+Pose+Geo

Mean per joint position error

Results on Images in the Wild

Multi-view Results

Outline

42Conclusion

Structured Features Structured SamplesFeature fusion

Introduction

Does structure only exist for specific task?

Outline

Introduction

FishNet(NIPS 18)

Optical guided feature (CVPR18)

Low-level and high-level features

Image from Andrew Ng’s slides

Current CNN Structures

Image Classification: Summarize high-level semantic information of the whole image.

Detection/Segmentation:High-level semantic meaning with high spatial resolution

Called U-Net, Hourglass, or Conv-deconv

Architectures designed for different granularities are

DIVERGING

Unify the advantages of networks for pixel-level, region-level, and image-level tasks

Hourglass for Classification

Poor performance.

So what is the problem?

• Different tasks require different resolutions of feature

Directly applying hourglass for classification?

Features with high-level semantics and high resolution is good

• Down sample high-level features with high resolution

Our design

Hourglass for Classification

1 × 1, 𝑐𝑖𝑛

3 × 3, 𝑐𝑖𝑛

1 × 1, 𝑐𝑜𝑢𝑡

𝑐𝑜𝑢𝑡

𝑆𝑡𝑟𝑖𝑑𝑒 = 2

1 × 1, 𝑐𝑖𝑛

3 × 3, 𝑐𝑖𝑛

1 × 1, 𝑐𝑖𝑛

CLow-level features

𝑐𝑜𝑢𝑡 − 𝑐𝑖𝑛 𝑐𝑜𝑢𝑡

C Concat

𝑢𝑝/𝑑𝑜𝑤𝑛 𝑠𝑎𝑚𝑝𝑙𝑒

The 𝟏 × 𝟏 convolution layer in yellowindicates the Isolated convolution.

• Hourglass may bring more isolated convolutions than ResNet

1 × 1, 𝑐𝑖𝑛

3 × 3, 𝑐𝑖𝑛

1 × 1, 𝑐𝑜𝑢𝑡

𝑐𝑜𝑢𝑡

Normal Res-BlockRes-Block for

down/up samplingOur design

• Different tasks require different resolutions of feature

Observation and design

Solution1. Unify the advantages of

networks for pixel-level, region-level, and image-level tasks.

2. Design a network that does not need isolated convolution

3. Features from varying depths are preserved and refined from each other.

Bharath Hariharan, et al. "Hypercolumns for object segmentation and fine-grained localization." CVPR’15.Newell, Alejandro, Kaiyu Yang, and Jia Deng. "Stacked hourglass networks for human pose estimation." ECCV’16.

Our observation

1. Diverged structures for tasks requiring different resolutions.

2. Isolated Conv blocks the direct back-propagation

3. Features with different depths are not fully explored, or mixedbut not preserved

Difference between mix and preserve and refine

High level

Low level

+ High level

Low level

Mixed features Preserve and refine

M Message generation

FishNet: Overview

224x224 … 56x56 28x28 14x14 7x7 14x14 28x28 56x56 28x28 14x14 7x7 1x1

… … … … … ……

Features in the tail part

Features in the body part

Residual Blocks

Features inthe head part

Concat

Fish Tail Fish Body Fish Head

… … …

FishNet: Preservation & Refinement

……

…………

……

TransferringBlocks T (⋅)

DR Blocks

UR Blocks

Regular Connections

……

FishTail

FishBody

FishHead

M(⋅)

𝑑𝑜𝑤𝑛(⋅)

𝑢𝑝(⋅)

𝑟(⋅)M(⋅)

……

Up-sampling and Refinement (UR) Blocks

Down-sampling and Refinement (DR) Blocks

Feature from varying depth refines each other here

Sum up every𝒌 adjacent

channelsConcat

Concat

Nearest neighbor up-sampling

𝟐 × 𝟐 Max-Pooling

From Tail

From Body

From Head

FishNet: Performance-ImageNet

22.59%

21.93%(5.92%)

21.55%(5.86%)

21.25%(5. 76%)

23.78%(7.00%)

22.30%(6.20%)

21.69%(5.94%)

21.00%

21.50%

22.00%

22.50%

23.00%

23.50%

24.00%

10 20 30 40 50 60 70

FishNet

ResNet

Parameters, × 106

22.59%

21.93%

21.55%21.25%

23.78%

22.30%

21.69%

21.00%

21.50%

22.00%

22.50%

23.00%

23.50%

24.00%

2 4 6 8 10 12

FishNet

ResNet

FLOP, × 109

Codehttps://github.com/kevin-ssy/FishNet

FishNet: Performance-ImageNet

22.59%

21.93%(5.92%)

21.55%(5.86%)

21.25%(5. 76%)

22.58%(6.35%)

22.20%(6.20%)

22.15%(6.12%)

21.20%

23.78%(7.00%)

22.30%(6.20%)

21.69%(5.94%)

21.00%

21.50%

22.00%

22.50%

23.00%

23.50%

24.00%

10 20 30 40 50 60 70

FishNet

DenseNet

ResNetTo

Parameters, × 106

FishNet: Performance on COCO Detection

38.00%

38.50%

39.00%

39.50%

40.00%

40.50%

41.00%

41.50%

42.00%

Fish-150

18.00%

18.20%

18.40%

18.60%

18.80%

19.00%

19.20%

19.40%

19.60%

19.80%

20.00%

AP-small

36.00%

36.50%

37.00%

37.50%

38.00%

38.50%

39.00%

39.50%

40.00%

40.50%

41.00%

AP-medium

47.00%

47.50%

48.00%

48.50%

49.00%

49.50%

50.00%

50.50%

AP-large

FishNet: Performance on COCO Instance Segmentation

34.00%

34.50%

35.00%

35.50%

36.00%

36.50%

37.00%

37.50%

Fish-150

18.00%

18.20%

18.40%

18.60%

18.80%

19.00%

19.20%

19.40%

19.60%

19.80%

20.00%

AP-small

37.00%

37.50%

38.00%

38.50%

39.00%

39.50%

40.00%

40.50%

AP-medium

47.00%

47.50%

48.00%

48.50%

49.00%

49.50%

50.00%

50.50%

AP-large

Winning COCO 2018 Instance Segmentation Task

Visualization

Codebase

• Comprehensive

RPN Fast/Faster R-CNN

Mask R-CNN FPN

Cascade R-CNN RetinaNet

More … …

• High performance

Better performance

Optimized memory consumption

Faster speed

• Handy to develop

Written with PyTorch

Modular design

GitHub: mmdet√

FishNet: Advantages

1. Better gradient flow to shallow layers

2. High-resolution features contain rich low-level and high-level semantics

3. Feature from varying depth are preserved and refined from each other

Outline

Introduction

FishNet(NIPS 18)

Action Recognition

• Recognize action from videos

Optical flow in Action Recognition

• Motion is the important information

• Optical flow

– Effective

– Time consuming

Modality Acc. Speed(fps)

RGB 85.5% 680

RGB+Optical Flow 94.0% 14

We need a better motion representation

Optical flow guided feature

{vx , vy} = optical flow

Coefficient for optical flow:

Optical flow:

Intuitive Inspiration

Feature flow:

{ } = feature flow

1 × 1 Conv 1 × 1 Conv

Sobel Subtract

Concat

OOFF Unit

ResolutionK*K

ResolutionK/2*K/2

ResolutionK/4*K/4

OCOFF UnitConcat

Optical Flow Guided Feature (OFF): Experimental results

1. OFF with only RGB inputs is comparable with the other state-of-the-art methods using optical flow as input.

0 50 100 150 200 250

RGB + Optical flow + I3D

RGB + OFF

RGB + OFF + Optical Flow

92.0 92.5 93.0 93.5 94.0 94.5 95.0 95.5 96.0

RGB + Optical flow + I3D

RGB + OFF

RGB + OFF + Optical Flow

Accuracy (%)Code:

Not only for action recognition

• Also effective for

– Video object detection

– Video compression artifact removal

71.5 72 72.5 73 73.5 74 74.5 75 75.5 76

resnet+rfcn

resnet+rfcn+OFFDetection (mAP)

34.6 34.8 35 35.2 35.4 35.6 35.8 36 36.2

DnCNN+OFFCompression Artifact Removal (PSNR)

Outline

Introduction

FishNet(NIPS 18)

Take home message

• Structured deep learning is

– effective

• Effectively using high performance imaging as the privileged information by exploring the structured information at

– Sample level

– Feature level

• End-to-end joint training bridges the gap between structure modeling and feature learning

Joint work

Xiaogang Wang Nicu Sebe Elisa Ricci

Wei Yang Dan Xu Shuyang Sun

Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013...

Documents

Transcript of Modeling deep structures for using high …Deep learning MIT Tech Review Top 10 Breakthroughs 2013...

IMAGEnet i-base User Manual - NEW.HUN...IMAGEnet i-bázis FELHASZNÁLÓI KÉZIKÖNYV Az alábbi kézikönyv korszerű i-bázison alapul. Megemlítésére egyéb engedélyek igénylése

IMAGEnet i-base User Manual - topcon-medical.se€¦ · IMAGEnet i-base BRUGERVEJLEDNING Denne vejledning er baseret på en avanceret i-base. Den bliver nævnt hvis der er brug for

7 Ways To Achieve Content Marketing Breakthroughs

BP 4 Deep Well BP 6 Deep Well - Kaercher

Deep Forest: Towards An Alternative to Deep Neural Networks

IMAGEnet i-base User Manual...IMAGEnet i-base . HANDLEIDING . Deze handleiding is op i-base advanced gebaseerd. Als andere licenties nodig zijn, zal dat worden vermeld. Versie 1.08

Deep SimNets

Hanita Breakthroughs catalog - Portuguesedmctools.com.br/catalogo/inovacoes_2009.pdf · de topo e brocas inteiriças de metal ... características de desempenho de “equivalentes

ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, NIPS 2012 Eunsoo Oh ( 오은수 )

Deep purple

Manual de Serviços Balança SA-110 - Imagenet

Deep River

Introduction to Deep Reinforcement Learningwnzhang.net/teaching/cs420/slides/13-deep-rl.pdf · 2020-06-08 · Deep Reinforcement Learning •Deep Reinforcement Learning •leverages

Deep Target

論文紹介(Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification)

Deep Drawingacac

Deep Learning

IA & Mécanique Le complexe du Deep€¦ · Le Deep Learning est récent (2012, ImageNet). C’est une révolution pour la recherche. C’est un changement de paradigme total pour

ImageNet Large Scale Visual Recognition Challenge

Bayes Deep