VGGNet
VGGNet, also called VGG (Visual Geometry Group), is a convolutional neural network, deeper than AlexNet. VGG-16 is the most popular variant of this network.
You can find the variants of the VGGNet on this page.
Network features:
- 138M parameters
- 16 layers (13 convolutional + 3 fully connected)
- use ReLU as activation function
Architecture
Parameters
Layer | Activation Shape | Filter Shape | Weights | Biases | Parameters |
---|---|---|---|---|---|
Input | 3x224x224 | - | 0 | 0 | 0 |
Conv1 | 64x224x224 | 3x3 | 1,728 | 64 | 1,792 |
Conv2 | 64x224x224 | 3x3 | 36,864 | 64 | 36,928 |
Pool1 | 64x112x112 | - | 0 | 0 | 0 |
Conv3 | 128x112x112 | 3x3 | 73,728 | 128 | 73,856 |
Conv4 | 128x112x112 | 3x3 | 147,456 | 128 | 147,584 |
Pool2 | 128x56x56 | - | 0 | 0 | 0 |
Conv5 | 256x56x56 | 3x3 | 294,912 | 256 | 295,168 |
Conv6 | 256x56x56 | 3x3 | 589,824 | 256 | 590,080 |
Conv7 | 256x56x56 | 3x3 | 589,824 | 256 | 590,080 |
Pool3 | 256x28x28 | - | 0 | 0 | 0 |
Conv8 | 512x28x28 | 3x3 | 1,179,648 | 512 | 1,180,160 |
Conv9 | 512x28x28 | 3x3 | 2,359,296 | 512 | 2,359,808 |
Conv10 | 512x28x28 | 3x3 | 2,359,296 | 512 | 2,359,808 |
Pool4 | 512x14x14 | - | 0 | 0 | 0 |
Conv11 | 512x14x14 | 3x3 | 2,359,296 | 512 | 2,359,808 |
Conv12 | 512x14x14 | 3x3 | 2,359,296 | 512 | 2,359,808 |
Conv13 | 512x14x14 | 3x3 | 2,359,296 | 512 | 2,359,808 |
Pool5 | 512x7x7 | - | 0 | 0 | 0 |
FC1 | 4096x1x1 | - | 102,760,448 | 4096 | 102,764,544 |
FC2 | 4096x1x1 | - | 16,777,216 | 4096 | 16,781,312 |
FC3 | 1000x1x1 | - | 4,096,000 | 1000 | 4,097,000 |
Total | 138,357,544 |
Stack of Convolutional Layers
Unlike AlexNet which uses big filters in the convolution layers at the beginning of the network (11x11 and 5x5), VGG uses only 3x3 filters. Using a stack several convolutional layers, without spatial pooling in between, increase the receptive field in the input of the stack. The figure below presents a stack of two 3x3 convolutional layers and the respective receptive fields (in blue and red).
As we can observed, the effective receptive field of two 3x3 convolutions is the same than using a single 5x5 convolution. The figure below presents the same technique with a stack of three 3x3 convolutions. In that case, the effective receptive field is the same than using a single 7x7 convolution.
The authors explained that the use of several small convolutions decreases the number of parameters. For a single 7x7 convolutional layer there are $49C^2$ parameters (with $C$ the number of channels) while a stack of three 3x3 convolutional layers has $27C^2$ parameters. Moreover, the use of three ReLU rather than one (for a stack of three) makes the decision function more discriminative.
For more information about the receptive field, see [3].