YOLO
YOLO, for You Only Look Once, is a convolutional neural network for object detection. Object detection consists of two tasks: identifcation and locatation of objects within images.
Network features:
- 24 layers (actually 26 = 24 convolutional + 2 fully-connected)
- use Leaky ReLU as activation function
Architecture
Grid & Prediction Vector
The authors explained that the input image is divided into an SxS grid [1]. It is important to understand here that the image is not divided before being used in the neural network. In fact, the grid is created at the output of the network in the prediction vector.
The prediction vector is a $S \times S \times (B * 5 + C)$ tensor, where $S$ is the grid size, $B$ is the number of bouding boxes per grid cell and $C$ is the number of classes. In the paper [1], the authors have chosen $S = 7$, $B = 2$ and $C = 20$ (the number of class comes from the PASCAL VOC dataset), which corresponds to the 30x7x7 output in the figure of the network achitecture. The decomposition of the prediction vector is presented in the figure below.
Limitations
The first version of YOLO has several limitations:
- struggles with small objects that appear in groups, such as flocks of birds because each grid cell only predicts two boxes and can only have one class. [1]
- struggles to generalize to objects in new or unusual aspect ratios or configurations. [1]
- the loss function treats errors the same in small bounding boxes versus large bouding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations. [1]