Download - Detection: From R-CNN to Fast R-CNN Reporter: Liliang Zhang 第 1 页 | 共 25 页.

Transcript

Detection: From R-CNN to Fast R-CNN

Reporter: Liliang Zhang

第 1页 | 共 25 页

Object Detection: Intuition

Detection ≈ Localization + Classification

第 2页 | 共 25 页

Page 3: Detection: From R-CNN to Fast R-CNN Reporter: Liliang Zhang 第 1 页 | 共 25 页.

Outline

• R-CNN• SPP-Net• Fast R-CNN

第 3页 | 共 25 页

Page 4: Detection: From R-CNN to Fast R-CNN Reporter: Liliang Zhang 第 1 页 | 共 25 页.

Outline

• R-CNN• SPP-Net• Fast R-CNN

第 4页 | 共 25 页

Page 5: Detection: From R-CNN to Fast R-CNN Reporter: Liliang Zhang 第 1 页 | 共 25 页.

R-CNN: Pipeline Overview

Step1. Input an imageStep2. Use selective search to obtain ~2k proposalsStep3. Warp each proposal and apply CNN to extract its featuresStep4. Adopt class-specified SVM to score each proposalStep5. Rank the proposals and use NMS to get the bboxes. Step6. Use class-specified regressors to refine the bboxes’ positions.Ross Girshick et al. Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR14

第 5页 | 共 25 页

Page 6: Detection: From R-CNN to Fast R-CNN Reporter: Liliang Zhang 第 1 页 | 共 25 页.

R-CNN: Performance in PASCAL VOC07

• AlexNet(T-Net): 58.5 mAP

• VGG-Net(O-Net): 66.0 mAP

第 6页 | 共 25 页

Page 7: Detection: From R-CNN to Fast R-CNN Reporter: Liliang Zhang 第 1 页 | 共 25 页.

R-CNN: Limitation

• TOO SLOWWWW !!! (13s/image on a GPU or 53s/image on a CPU, and VGG-Net 7x slower)

• Proposals need to be warped to a fixed size.

第 7页 | 共 25 页

Page 8: Detection: From R-CNN to Fast R-CNN Reporter: Liliang Zhang 第 1 页 | 共 25 页.

Outline

• R-CNN• SPP-Net• Fast R-CNN

第 8页 | 共 25 页

Page 9: Detection: From R-CNN to Fast R-CNN Reporter: Liliang Zhang 第 1 页 | 共 25 页.

SPP-Net: Motivation

• Cropping may loss some information about the object

• Warpping may change the object’s appearance

He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, TPAMI15

第 9页 | 共 25 页

Page 10: Detection: From R-CNN to Fast R-CNN Reporter: Liliang Zhang 第 1 页 | 共 25 页.

SPP-Net: Spatial Pyramid Pooling (SPP) Layer

• FC layer need a fixed-length input while conv layer can be adapted to arbitrary input size.

• Thus we need a bridge between the conv and FC layer.• Here comes the SPP layer.

第 10页 | 共 25 页

Page 11: Detection: From R-CNN to Fast R-CNN Reporter: Liliang Zhang 第 1 页 | 共 25 页.

SPP-Net: Training for Detection(1)

Conv5 feature

map

Conv5 feature

map

Conv5 feature

map

Image Pyramid FeatMap Pyramids

conv

Step1. Generate a image pyramid and exact the conv FeatMap of the whole image

第 11页 | 共 25 页

Page 12: Detection: From R-CNN to Fast R-CNN Reporter: Liliang Zhang 第 1 页 | 共 25 页.

SPP-Net: Training for Detection(2)

• Step 2, For each proposal, walking the image pyramid and find a project version that has a number of pixels closest to 224x224. (For scaling invariance in training.)

• Step 3, find the corresponding FeatMap in Conv5 and use SPP layer to pool it to a fix size.

• Step 4, While getting all the proposals’ feature, fine-tune the FC layer only.

• Step 5, Train the class-specified SVM

第 12页 | 共 25 页