Chapter 10. Image Similarity Detection with Deep Learning and PySpark LSH
Whether you encounter them on social media or e-commerce stores, images are integral to our digital lives. In fact, it was an image dataset—ImageNet—which was a key component for sparking the current deep learning revolution. A remarkable performance by a classification model in the ImageNet 2012 challenge was an important milestone and led to widespread attention. It is no wonder then that you are likely to encounter image data at some point as a data science practitioner.
In this chapter, you will gain experience scaling a deep learning workflow for a visual task, namely, image similarity detection, with PySpark. The task of identifying images that are similar to each other comes intuitively to humans, but it is a complex computational task. At scale, it becomes even more difficult. In this chapter, we will introduce an approximate method for finding similar items called locality sensitive hashing, or LSH, and apply it to images. We’ll use deep learning to convert image data into a numerical vector representation. PySpark’s LSH algorithm will be applied to the resulting vectors, which will allow us to find similar images given a new input image.
On a high level, this example mirrors one of the approaches used by photo sharing apps such as Instagram and Pinterest for image similarity detection. This helps their users make sense of the deluge of visual data that exists on their platforms. This also depicts ...
Get Advanced Analytics with PySpark now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.