Detection: From R-CNN to Fast R-CNN
Reporter: Liliang Zhang
第 1页 | 共 25 页
Object Detection: Intuition
Detection ≈ Localization + Classification
第 2页 | 共 25 页
Outline
• R-CNN• SPP-Net• Fast R-CNN
第 3页 | 共 25 页
Outline
• R-CNN• SPP-Net• Fast R-CNN
第 4页 | 共 25 页
R-CNN: Pipeline Overview
Step1. Input an imageStep2. Use selective search to obtain ~2k proposalsStep3. Warp each proposal and apply CNN to extract its featuresStep4. Adopt class-specified SVM to score each proposalStep5. Rank the proposals and use NMS to get the bboxes. Step6. Use class-specified regressors to refine the bboxes’ positions.Ross Girshick et al. Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR14
第 5页 | 共 25 页
R-CNN: Performance in PASCAL VOC07
• AlexNet(T-Net): 58.5 mAP
• VGG-Net(O-Net): 66.0 mAP
第 6页 | 共 25 页
R-CNN: Limitation
• TOO SLOWWWW !!! (13s/image on a GPU or 53s/image on a CPU, and VGG-Net 7x slower)
• Proposals need to be warped to a fixed size.
第 7页 | 共 25 页
Outline
• R-CNN• SPP-Net• Fast R-CNN
第 8页 | 共 25 页
SPP-Net: Motivation
• Cropping may loss some information about the object
• Warpping may change the object’s appearance
He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, TPAMI15
第 9页 | 共 25 页
SPP-Net: Spatial Pyramid Pooling (SPP) Layer
• FC layer need a fixed-length input while conv layer can be adapted to arbitrary input size.
• Thus we need a bridge between the conv and FC layer.• Here comes the SPP layer.
第 10页 | 共 25 页
SPP-Net: Training for Detection(1)
Conv5 feature
map
Conv5 feature
map
Conv5 feature
map
Image Pyramid FeatMap Pyramids
conv
Step1. Generate a image pyramid and exact the conv FeatMap of the whole image
第 11页 | 共 25 页
SPP-Net: Training for Detection(2)
• Step 2, For each proposal, walking the image pyramid and find a project version that has a number of pixels closest to 224x224. (For scaling invariance in training.)
• Step 3, find the corresponding FeatMap in Conv5 and use SPP layer to pool it to a fix size.
• Step 4, While getting all the proposals’ feature, fine-tune the FC layer only.
• Step 5, Train the class-specified SVM
第 12页 | 共 25 页
SPP-Net: Testing for Detection
• Allmost the same as R-CNN, except Step3.
第 13页 | 共 25 页
SPP-Net: Performance
• Speed: 64x faster than R-CNN using one scale, and 24x faster using five-scale paramid.
• mAP: +1.2 mAP vs R-CNN
第 14页 | 共 25 页
SPP-Net: Limitation
2. Training is expensive in space and time.
1. Training is a multi-stage pipeline.
FC layersConv layers SVM regressor
store
第 15页 | 共 25 页
Outline
• R-CNN• SPP-Net• Fast R-CNN
第 16页 | 共 25 页
Fast R-CNN: Motivation
Ross Girshick, Fast R-CNN, Arxiv tech report
JOINT TRAINING!!
第 17页 | 共 25 页
Fast R-CNN: Joint Training Framework
Joint the feature extractor, classifier, regressor together in a unified framework
第 18页 | 共 25 页
Fast R-CNN: RoI pooling layer
≈ one scale SPP layer
第 19页 | 共 25 页
Fast R-CNN: Regression Loss
A smooth L1 loss which is less sensitive to outliers than L2 loss
第 20页 | 共 25 页
Fast R-CNN: Scale Invariance
image pyramids ( multi scale )brute force ( single scale )
Conv5 feature
mapconv
• In practice, single scale is good enough. (The main reason why it can faster x10 than SPP-Net)
第 21页 | 共 25 页
Fast R-CNN: Other tricks
• SVD on FC layers: 30% speed up at testing time with a little performance drop.
• Which layers to fine-tune? Fix the shallow conv layers can reduce the training time with a little performance drop.
• Data augment: use VOC12 as the additional trainset can boost mAP by ~3%
第 22页 | 共 25 页
Fast R-CNN: Performance
• Without data augment, the mAP just +0.9 on VOC077
• But training and testing time has been greatly speed up. (training 9x, testing 213x vs R-CNN)
• Without data augment, the mAP +2.3 on VOC127
第 23页 | 共 25 页
Fast-RCNN: Discussion about #proposal
Are more proposals always better ?
NO!
第 24页 | 共 25 页
Thanks
第 25页 | 共 25 页
Top Related