The final layer is the fully connected layer and there is voting by the set of values to determine the class of the output. The fully connected layer is just a merged matrix of all the previous outputs.
This is the final layer and the output is determined based on the highest voted category.
By stacking up the layers in steps 1, 2, and 3, we form the convolution network, which can reduce the error term with backpropagation to give us the best prediction.
The layers can be repeated multiple times and each layer output forms an input to the next layer.
A classical CNN architecture would look like this:
An example classification prediction using CNN is shown in the following figure: ...