Earlier, we took care of the initialization of our weights to make the job of gradient descent optimizers easier. However, the benefits are only seen during the early stages of training and do not guarantee improvements in the latter stages. That is where we turn to another great invention called the batch norm layer. The effect produced by using the batch norm layer in a CNN model is more or less the same as the input normalization that we saw in Chapter 2, Deep Learning and Convolutional Neural Networks; the only difference now is that this will happen at the output of all the convolution and fully connected layers in your model.
The batch norm layer will usually be attached to the end of every fully connected or convolution ...