Inception v1 is the first version of the network. An object in an image appears in different sizes and in a different positions. For example, look at the first image; as you can see, the parrot, when viewed closer, takes up the whole portion of the image but in the second image, when the parrot is viewed from a distance, it takes up a smaller region of the image:
Thus, we can say objects (in the given image, it's a parrot) can appear on any region of the image. It might be small or big. It might take up a whole region of the image, or just a very small portion. Our network has to exactly identify the object. But what's the problem ...