Ignoring for a moment the classification problem and focusing only on the localization part, we can think about the localization as the problem of regressing the four coordinates of the bounding box that contains the subject of the input image.
In practice, there is not much difference between training a CNN to solve a classification task or a regression task: the architecture of the feature extractor remains the same, while the classification head changes and becomes a regression head. In the very end, this only means to change the number of output neurons from the number of classes to 4, one neuron per coordinate of the bounding box.
The idea is that the regression head should learn to output the correct ...