VGGNet

VGGNet, also called VGG (Visual Geometry Group), is a convolutional neural network, deeper than AlexNet. VGG-16 is the most popular variant of this network.

You can find the variants of the VGGNet on this page.

Network features:

  • 138M parameters
  • 16 layers (13 convolutional + 3 fully connected)
  • use ReLU as activation function

Architecture

Parameters

Layer Activation Shape Filter Shape Weights Biases Parameters
Input 3x224x224 - 0 0 0
Conv1 64x224x224 3x3 1,728 64 1,792
Conv2 64x224x224 3x3 36,864 64 36,928
Pool1 64x112x112 - 0 0 0
Conv3 128x112x112 3x3 73,728 128 73,856
Conv4 128x112x112 3x3 147,456 128 147,584
Pool2 128x56x56 - 0 0 0
Conv5 256x56x56 3x3 294,912 256 295,168
Conv6 256x56x56 3x3 589,824 256 590,080
Conv7 256x56x56 3x3 589,824 256 590,080
Pool3 256x28x28 - 0 0 0
Conv8 512x28x28 3x3 1,179,648 512 1,180,160
Conv9 512x28x28 3x3 2,359,296 512 2,359,808
Conv10 512x28x28 3x3 2,359,296 512 2,359,808
Pool4 512x14x14 - 0 0 0
Conv11 512x14x14 3x3 2,359,296 512 2,359,808
Conv12 512x14x14 3x3 2,359,296 512 2,359,808
Conv13 512x14x14 3x3 2,359,296 512 2,359,808
Pool5 512x7x7 - 0 0 0
FC1 4096x1x1 - 102,760,448 4096 102,764,544
FC2 4096x1x1 - 16,777,216 4096 16,781,312
FC3 1000x1x1 - 4,096,000 1000 4,097,000
Total         138,357,544

Stack of Convolutional Layers

Unlike AlexNet which uses big filters in the convolution layers at the beginning of the network (11x11 and 5x5), VGG uses only 3x3 filters. Using a stack several convolutional layers, without spatial pooling in between, increase the receptive field in the input of the stack. The figure below presents a stack of two 3x3 convolutional layers and the respective receptive fields (in blue and red).

Receptive field of two 3x3 conv

As we can observed, the effective receptive field of two 3x3 convolutions is the same than using a single 5x5 convolution. The figure below presents the same technique with a stack of three 3x3 convolutions. In that case, the effective receptive field is the same than using a single 7x7 convolution.

Receptive field of three 3x3 conv

The authors explained that the use of several small convolutions decreases the number of parameters. For a single 7x7 convolutional layer there are $49C^2$ parameters (with $C$ the number of channels) while a stack of three 3x3 convolutional layers has $27C^2$ parameters. Moreover, the use of three ReLU rather than one (for a stack of three) makes the decision function more discriminative.

For more information about the receptive field, see [3].

Bibliography

results matching ""

    No results matching ""