Even if you have clear in mind how a CNN can manage to classify an image, it could be less obvious for you how a neural network can localize multiple objects into an image by defining its bounding box (a rectangular perimeter bounding the object itself). The first and easiest solution that you may imagine could be to have a sliding window and apply the CNN on each window, but that could be really computationally expensive for most real-world applications (if you are powering the vision of a self-driving car, you do want it to recognize the obstacle and stop before hitting it).
You can find more about the sliding windows approach for object detection in this blog post by Adrian Rosebrock: ...