The SSD algorithm is called single shot because it predicts the bounding box and the class simultaneously as it processes the image in the same deep learning model. Basically, the architecture is summarized in the following steps:
- A 300 x 300 image is input into the architecture.
- The input image is passed through multiple convolutional layers, obtaining different features at different scales.
- For each feature map obtained in 2, we use a 3 x 3 convolutional filter to evaluate small set of default bounding boxes.
- For each default box evaluated, the bounding box offsets and class probabilities are predicted.
The model architecture looks like this:
SSD is used for predicting multiple classes similar to that in YOLO, ...