SPPNet
SPPNet, for Spatial Pyramid Pooling, is not a neural network architecture but a “layer” which can be added in any CNN in order to remove the fixed-size of the network [1]. The authors have also proposed to use their new “layer” to improve R-CNN object detection.
Spatial Pyramid Pooling
This technique is used to removed the fixed-size input image constraint on a network. A convolutional network is composed of two parts: the convolutional layers and the fully-connected layers. As explained by the authors [1], the convolutional layers do not require a fixed image size and can generate feature maps of any sizes. On the other hand, the fully-connected layers need to have fixedsize/length input by their definition.
We will use a simple neural network to illustrate the problem. The figure below presents the ZF-5 [2] network used with 224x224 images on the left and 180x180 images on the right.
We can see differences in the sizes of the feature maps between the left and the right but the problem is visible when we try to implement the network. Let’s try coding the network for 224x224 input images in PyTorch.
# Convolutional part
self.conv1 = nn.Conv2d(3, 96, 7, 2, 3)
self.pool1 = nn.MaxPool2d(3, 2)
self.conv2 = nn.Conv2d(96, 256, 5, 2, 2)
self.pool2 = nn.MaxPool2d(3, 2)
self.conv3 = nn.Conv2d(256, 384, 3, 1)
self.conv4 = nn.Conv2d(384, 384, 3, 1)
self.conv5 = nn.Conv2d(256, 256, 3, 1)
self.pool3 = nn.MaxPool2d(3, 2)
# Fully-connected part
self.fc1 = nn.Linear(9216, 4096)
self.fc2 = nn.Linear(4096, 4096)
self.fc3 = nn.Linear(4096, 1000)
The implementation of the convolutional layers does not involve the image size, but only the number of channels/number of feature maps. On the contrary, the fully-connected layers , and in particular the first one, takes as a first parameter the size of the previous layer. On the scheme of the network, the size of the feature map after the last pooling layer is $[1, 256, 6, 6]$ which is $256 \times 6 \times 6 = 9216$, the input of self.fc1
.
Now, if we want to use the code but with a 180x180 image, there is going to be a problem: the input size of self.fc1
is $256 \times 5 \times 5 = 6400$.
With these type of architecture it is not possible to train a network with different image size, so the authors propose to add a Spatial Pyramid Pooling (SPP) layer between the last convolutional layer and the first fully-connected layer. This SPP layer is composed of pooling layers which pool features extracted at different scales. The pooling layers do not contain weights, which means that they can adapt their parameters during the training or inference stage to the size of the input, so that the output size is always the same. The figure below presents a 3-level pyramid pooling.
The parameters of the SPP layer are:
- the level of the pyramid, which is the number of pool used, here 3.
- the output size of each pooling layer, here 4x4, 2x2 and 1x1.
The “classical” parameters of the pooling layers, which are the window size and the stride, will be automatically computed based on the input feature map size. You can find the equations below:
\[win = \lceil a/n \rceil\] \[stride = \lfloor a/n \rfloor\]for a $a \times a$ input feature map and a $n \times n$ output feature map. $n$ $\lceil . \rceil$ and $\lfloor . \rfloor$ designed ceil and floor.
Let’s apply the equations above to calculate the parameters of the three pooling layers in the SPP layer for a 224x224 input image:
- 4x4 output: $win = \lceil 6/4 \rceil = 4$, $stride = \lfloor 5/4 \rfloor = 3$
- 2x2 output: $win = \lceil 6/2 \rceil = 7$, $stride = \lfloor 5/2 \rfloor = 6$
- 1x1 output: $win = \lceil 6/1 \rceil = 13$, $stride = \lfloor 13/1 \rfloor = 13$
The same for a 180x180 input image:
- 4x4 output: $win = \lceil 5/4 \rceil = 4$, $stride = \lfloor 13/4 \rfloor = 3$
- 2x2 output: $win = \lceil 5/2 \rceil = 7$, $stride = \lfloor 13/2 \rfloor = 6$
- 1x1 output: $win = \lceil 5/1 \rceil = 13$, $stride = \lfloor 13/1 \rfloor = 13$
Object Detection
The authors of the paper [1] also proposed to use SPPNet to improve the execution time of the R-CNN testing stage. R-CNN extract feature maps for each region proposals in an image. With about 2000 regions per image, the features extraction is time-consuming but it can be reduced since these regions come from the same image. The authors proposed to extract a feature map from the entire image and then apply a spatial pyramid pooling on each region. Here, ZF-5 [2] is used rather than AlexNet in R-CNN.
Limitations
There are several limitations of SPPNet (that can be found in the Fast R-CNN paper [3]). Some are shared with R-CNN, while the last one is specific to this network:
- the training is a multi-stage pipeline.
- the features are written on the disk.
- only the fully-connected layers can be fine-tuned, and not the convolutional layers above the spatial pyramid pooling. Actually, this statement should be taken with a grain of salt. Here are some statements in the SPPNet [1] paper and the Fast R-CNN [3] paper that suggest that is not totally impossible to fine-tune the convolutional layers:
- “[…] for simplicity we only fine-tune the fully-connected layers” [1].
- “[…] back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained.” [3].
- “For the less deep networks considered in the SPPnet paper [11], fine-tuning only the fully connected layers appeared to be sufficient for good accuracy.” [3].