First, let's think about how we can make a localizer using the output of the base model.
As mentioned previously, the output tensor of the base model has a shape of (None, 7, 7, 1280). The output tensor represents features obtained using a convolutional network. We can suppose that some spatial information is encoded in the spatial indexes (7,7).
Let's try to reduce the dimensionality of our feature map using a couple of convolutional layers and create a regressor that should predict the corner coordinates of the pets' head bounding boxes provided by the dataset.
Our convolutional layers will have several options that are the same:
conv_opts = dict( activation='relu', padding='same', kernel_regularizer="l2")
First of ...