How to analyze 100 million images for $624
There's a lot of new ground to be explored in large-scale image processing.
Jetpac is building a modern version of Yelp, using big data rather than user reviews. People are taking more than a billion photos every single day, and many of these are shared publicly on social networks. We analyze these pictures to discover what they can tell us about bars, restaurants, hotels, and other venues around the world — spotting hipster favorites by the number of mustaches, for example.
Treating large numbers of photos as data, rather than just content to display to the user, is a pretty new idea. Traditionally it’s been prohibitively expensive to store and process image data, and not many developers are familiar with both modern big data techniques and computer vision. That meant we had to cut a path through some thick underbrush to get a system working, but the good news is that the free-falling price of commodity servers makes running it incredibly cheap.
I use m1.xlarge servers on Amazon EC2, which are beefy enough to process two million Instagram-sized photos a day, and only cost $12.48! I’ve used some open source frameworks to distribute the work in a completely scalable way, so this works out to $624 for a 50-machine cluster that can process 100 million pictures in 24 hours. That’s just 0.000624 cents per photo! (I seriously do not have enough exclamation points for how mind-blowingly exciting this is.)
So, how do I do all that work so quickly and cheaply? The building blocks are the open source Hadoop, HIPI, and OpenCV projects. Unless you were frozen in an Arctic ice-cave for the last decade, you’ll have heard of Hadoop, but the other two are a bit less famous.
HIPI is a Java framework that lets you efficiently process images on a Hadoop cluster. It’s needed because HDFS can’t handle large numbers of files, so it provides a way of bundling images together into much bigger files, and unbundling them on the fly as you process them. It’s been growing in popularity in research and academia, but hasn’t had widespread commercial use yet. I had to fork it and add meta-data support so I could keep track of the files as they went through the system, along with some other fixes. It’s running like a champion now, though, and has enabled everything else we’re doing.
OpenCV is written in C++, but has a recently added Java wrapper. It supports a lot of the fundamental operations you need to implement image-processing algorithms, so I was able to write my advanced (and fairly cunning, if I do say so myself, especially for mustaches!) image analysis and object detection routines using OpenCV for the basic operations.
The first and most time-consuming step is getting your images downloaded. HIPI has a basic distributed image downloader as an example, but you’ll want to make sure the site you’re accessing won’t be overwhelmed. I was focused on large social networks with decent CDNs, so I felt comfortable running several servers in parallel. I did have to alter the HIPI downloader code to add a user agent string so the admins of the sites could contact me if the downloading was causing any problems.
If you have three different external services you’re pulling from, with four servers assigned to each service, you end up taking about 30 days to download 100 million photos. That’s $4,492.80 for the initial pull, which is not chump change, but not wildly outside a startup budget, especially if you plan ahead and use reserved instances to reduce the cost.
Now you have 100 million images, you need to do something useful with them. Photos contain a massive amount of information, and OpenCV has a lot of the building blocks you’ll need to extract some of the parts you care about. Think about the properties you’re interested in — maybe you want to exclude pictures containing people if you’re trying to create slideshows about places — and then look around at the tools that the library offers; for this example you could search for faces, and identify photos that are faceless as less likely to contain people. Anything you could do on a single image, you can now do in parallel on millions of them.
Before you go ahead and do a distributed run, though, make sure it works. I like to write a standalone command-line version of my image processing step, so I can debug it easily and iterate fast on the algorithm. OpenCV has a lot of convenience functions that make it easy to load images from disk, or even capture them from your webcam, display them, and save them out at the end. Their Java support is quite new, and I hit a few hiccups, but overall, it works very well. This made it possible to write a wrapper that is handed the images from HIPI’s decoding engine, does some processing, and then writes the results out in a text format, one line per image, all within a natively-Java Hadoop job.
Once you’re happy with the way the algorithm is working on your test data, and you’ve wrapped it in a HIPI wrapper, you’re ready to apply it to all the images you’ve downloaded. Just spin up a Hadoop job with your jar file, pointing at the HDFS folders containing the output of the downloader. In our experience, we were able to process around two million 700×700-sized JPEG images on each server per day, so you can use that as a rule of thumb for the size/speed tradeoffs you want to make when you choose how many machines to put in your cluster. Surprisingly, the actual processing we did within our algorithm didn’t affect the speed very much, apparently the object recognition and image-processing work ran fast enough that it was swamped by the time spent on IO.
I hope I’ve left you excited and ready to tackle your own large-scale image processing challenges. Whether you’re a data person who’s interested in image processing or a computer-vision geek who wants to widen your horizons, there’s a lot of new ground to be explored, and it’s surprisingly easy once you put together the pieces.