R-CNN

@inproceedings{girshick2014,
    title     = {Rich feature hierarchies for accurate object detection and semantic segmentation},
    author    = {R. Girshick, J. Donahue, T. Darrell, J. Malik},
    booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
    pages     = {580--587},
    year      = {2014}
}

R-CNN, for Region with CNN features, is a deep learning architecture used for object detection proposed by Girshick et al [1] in 2014. This architecture is composed of three modules: region proposal, feature extraction and object classification.

Architecture

Region Proposal

The region proposal module aim to localize and segment objects into bounding boxes. As explained by the authors [1], “While R-CNN is agnostic to the particular region proposal method, we use selective search”. This method will not be developed here and let the reader referred to the article [4]. For each image, around 2000 bounding boxes are extracted.

Feature extraction

For each bounding box extracted by the previous module, a 4096-d feature vector is computed by a CNN. The authors have chosen to use AlexNet to achieve this task. Since the boudning boxes have non-fixed size, they are warp into 227x227 size because the CNN requires input data of a fixed size.

The neural network is trained in two stages. First, the dataset ILSVRC 2012 [4] is used for pre-training. This dataset contains only image-level annotations, no bounding box labels. Then the network is fine tune with on warped region proposals on images from the VOC dataset. In order to adapt the CNN to this new dataset, the last layer which ouput 1000 classes (for ImageNet) is replaced by a 21 classes output layer (20 classes from VOC dataset + 1 for the background class). Once the training is finished, the last layer is removed and the SVMs are used to select the corresponding class for each bounding box.

Object Classification

Each bounding box is classified with one of the 21 classes (20 classes from VOC dataset + 1 for the background class) by using SVMs. There is one SVM per class. Finally, the bounding boxes are cleaned with a greedy non-maximum suppression applied on each class independently. A bounding box is removed it has intersection-over-union (IoU) overlap with another bounding box and the latter has a higher scoring selected region larger than a learned threshold. The image below presents an example of IoU computation between two bounding boxes.

IoU

The IoU is a number between 0 and 1 where a value near to 0 means the bounding boxes overlap slightly, and a value near to 1 means an almost perfect overlap.

According to the error analysis done by authors on their method (see section 3.3 of the original paper [1] for more information), a lot of errors come from a poor bounding box localization, which is defined by a detection with an IoU between 0.1 and 0.5 with the correct class or a duplicate. In order to reduce these errors, the authors introduced a bounding box regression which is used after the SVMs classification. It uses the feature map after the last CNN pooling layer to predict a new bounding box. More information about the bounding box regression can be found in the supplementary material.

Training

As we have seen in the previous sections, R-CNN is composed of several modules. These modules are trained one after the others in this order: CNN, SVMs and bouding box regressors.

CNN

First, the CNN used for the features extraction is “pre-trained the CNN on a large auxiliary dataset (ILSVRC 2012) with image-level annotations (i.e., no bounding box labels).” [1]. Here, the original architecture of AlexNet is used. Then the CNN is fined-tuned for the new task (detection) and the new domain (warped VOC windows). Since the fine-tuning is done on the PASCAL VOC dataset, the last 1000-way fully-connected layer is replaced by a 21-way fully-connected layer. Region proposals with $\text{IoU} \geq 0.5$ are used as positive examples and the others as negatives examples.

Pre-training parameters:

  • on the ILSVRC 2012 dataset (1000 classes).
  • SGD with learning rate at 0.01.

Fine-tuning parameters:

  • on the PASCAL VOC dataset (20 classes + 1 background class).
  • SGD with learning rate at 0.001.
  • mini-batch of size 128 (32 positie windows + 96 background windows).
  • use the log loss.

SVMs

SVMs are trained with the standard hard negative mining method. To train them, the ground truth boxes are used as positive examples while proposed boxes with $\text{IoU} < 0.3$ with a ground truth box are used as negatives examples. This choice of positive/negative examples is different from the one used to train the CNN. More information about this choice can be found in the supplementary material [2].

Bounding Box Regression

The bounding box regressors learn “a transformation that maps a proposed box $P$ to a ground-truth box $G$” [2]. The transformation is composed of four functions: two for the $x$ and $y$ translations of the box center, and two log-space translations of the width and height.

Limitations

  • the training is a multi-stage pipeline.
  • the features are written on the disk: “For SVM and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the VOC07 trainval set. These features require hundreds of gigabytes of storage.” [5].

Bibliography

results matching ""

    No results matching ""