"The Encyclopedia Americana" color montage of a variety of unidentified butterflies and moths, 1920.
"The Encyclopedia Americana" color montage of a variety of unidentified butterflies and moths, 1920. (source: Wikimedia Commons).

After reading Pete Warden’s excellent TensorFlow for Poets, I was impressed at how easy it seemed to build a working deep learning classifier. It was so simple that I had to try it myself.

I have a lot of photos around, mostly of birds and butterflies. So, I decided to build a simple butterfly classifier. I chose butterflies because I didn’t have as many photos to work with, and because they were already fairly well sorted. I didn’t want to spend hours sorting a thousand or so bird pictures. According to Pete, that’s the most laborious, time-consuming part of the process: getting your training data sorted.

Sorting was relatively easy: while I thought I’d need a database, or some CSV file tagging the photos by filename, I only needed a simple directory structure: a top-level directory named butterflies, with a directory underneath for each kind of butterfly I was classifying. Here’s what I ended up with:

# ls ../tf_files/butterflies/
Painted Lady black swallowtail monarch tiger swallowtail

Only four kinds of butterflies? Unfortunately, yes. While you don’t need thousands of images, you do need at least a dozen or so in each directory. If you don’t, you’ll get divide-by-zero errors that will make you pull your hair out. Pete’s code randomly separates the images you provide into a training set and a validation set. If either of those sets ends up empty, you’re in trouble. (Pete, thanks for your help understanding this!) I ended up with a classifier that only knew about four kinds of butterflies because I had to throw out all the species where I only had six or seven (or one or two) photos. I may try adding some additional species back in later; I think I can find a few more.

I’ll skip the setup (see Pete’s article for that). By using VirtualBox and Docker, he eliminates pain and agony building and installing the software, particularly if you’re using OS X. If you run into strange errors, try going back to the git steps and rebuilding. TensorFlow (TF) won’t survive Docker disconnecting from the VM, so if that happens (for example, if you restart the VM), you’ll need to rebuild.

Here’s what I did to create the classifier; it’s straight from Pete’s article, except for the name of the image directory. You can ignore the “No files found” for the top-level directory (butterflies), but if you see this message for any of the subdirectories, you’re in trouble:

# bazel-bin/tensorflow/examples/image_retraining/retrain \
> --bottleneck_dir=/tf_files/bottlenecks \
> --model_dir=/tf_files/inception \
> --output_graph=/tf_files/retrained_graph.pb \
> --output_labels=/tf_files/retrained_labels.txt \
> --image_dir /tf_files/butterflies
Looking for images in 'butterflies'
No files found
Looking for images in 'black swallowtail'
Looking for images in 'monarch'
Looking for images in 'Painted Lady'
Looking for images in 'tiger swallowtail'
100 bottleneck files created.
2016-03-14 01:46:08.962029: Step 0: Train accuracy = 31.0%
2016-03-14 01:46:08.962241: Step 0: Cross entropy = 1.311761
2016-03-14 01:46:09.137622: Step 0: Validation accuracy = 18.0%
… (Lots of output deleted…)
Final test accuracy = 100.0%
Converted 2 variables to const ops.

And here’s what happens when you actually do some classifying. Here’s the image I’m trying to classify: an Eastern Tiger Swallowtail.

Eastern Tiger Swallowtail image classification deep learning
Figure 1. Eastern Tiger Swallowtail

And here’s the result:

# bazel build tensorflow/examples/label_image:label_image && \
> bazel-bin/tensorflow/examples/label_image/label_image \
> --graph=/tf_files/retrained_graph.pb \
> --labels=/tf_files/retrained_labels.txt \
> --output_layer=final_result \
> --image=/tf_files/sample/IMG_5941-e.jpg
(Lots of output)
INFO: Elapsed time: 532.630s, Critical Path: 515.99s
I tensorflow/examples/label_image/main.cc:206] tiger swallowtail (1): 0.999395
I tensorflow/examples/label_image/main.cc:206] black swallowtail (2): 0.000338286
I tensorflow/examples/label_image/main.cc:206] monarch (0): 0.000144585
I tensorflow/examples/label_image/main.cc:206] painted lady (3): 0.000121789

There’s a 99.9% chance that picture was a Tiger Swallowtail. Not bad. Was I just lucky, or did it really work? Here’s another image, this time a trickier photo of a pair of Monarchs:

Monarch butterfly image classification deep learning
Figure 2. Pair of Monarchs
# bazel build tensorflow/examples/label_image:label_image && \ bazel-bin/tensorflow/examples/label_image/label_image \ --graph=/tf_files/retrained_graph.pb \ --labels=/tf_files/retrained_labels.txt \
--output_layer=final_result \
(Not quite as much output)
INFO: Elapsed time: 16.717s, Critical Path: 11.43s
I tensorflow/examples/label_image/main.cc:206] monarch (0): 0.875138
I tensorflow/examples/label_image/main.cc:206] painted lady (3): 0.117698
I tensorflow/examples/label_image/main.cc:206] tiger swallowtail (1): 0.0054633
I tensorflow/examples/label_image/main.cc:206] black swallowtail (2): 0.00170112

TF isn’t as confident, but it still thinks the image is a Monarch with a probability of about 87%.

I was surprised that TF worked so well. First, I thought that a successful classifier would need to be trained on thousands of photos, and I only had a hundred or so. You’d need thousands (or millions) if you’re building an app for Google or Facebook, but I had at most a couple dozen in each of the four categories. That proved to be enough for my purposes. Second, the Monarch is tricky; the butterflies are at a bad angle, and one is blurry because it was moving. I don’t know why I didn’t delete this image after shooting it, but it made a nice test case.

Pete pointed out that, if you don’t have many images, you can improve the accuracy by using the --random_crop, --random_scale, and
--random_brightness options. These make the classifier run much slower. In effect, they’re creating more images by distorting the images you’ve provided.

Deep learning isn’t magic, and playing with it will get you thinking about its limits. TensorFlow doesn’t know anything about what it’s classifying; it’s just trying to find similar images. If you ask TensorFlow to classify something, classify it will, whether or not that classification makes any sense. It doesn’t know the meaning of “I don’t know.” When I gave the butterfly classifier a Skipper, one of a large and confusing family of small butterflies that doesn’t look remotely like anything in the training set, TF classified it as a Black Swallowtail with 80% confidence:

Skipper butterfly image classification deep learning
Figure 3. Skipper butterfly

Of all the butterflies in the training set, the Black Swallowtail is probably the least similar (Black Swallowtails are, well, black). If I gave my classifier a snapshot of someone walking down the street, it would helpfully determine the set of butterfly photos to which the photo was most similar. Can that be fixed? More training images, and more categories, would make classification more accurate, but wouldn’t deal with the “don’t know” problem. A larger training set might help identify a Skipper (with enough photos, it could possibly even identify the type of Skipper), but not a photo that’s completely unexpected. Setting some sort of lower bound for confidence might help. For example, returning “don’t know” if the highest confidence is under 50% might be useful for building commercial applications. But that leaves behind a lot of nuance: “I don’t know, but it might be...” Pete suggests that you can solve the “don’t know” problem by adding a random category that consists of miscellaneous photos unrelated to the photos in the “known” categories; this trick doesn’t sound like it should work, but it’s surprisingly effective.

Since TensorFlow doesn’t really know anything about butterflies, or flowers, or birds, it might not be classifying based on what you think. It’s easy to think that TF is comparing butterflies, but it’s really just trying to find similar pictures. I don’t have many pictures of Swallowtails sitting still on the pavement (I suspect this one was dying). But I have many pictures of butterflies feeding on those purple flowers. Maybe TF identified the Monarch correctly for the wrong reason. Maybe TF classified the Skipper as a Black Swallowtail because it was also sitting on a purple flower, like several of the Swallowtails.

Likewise, TensorFlow has no built-in sense of scale, nor should it. Photography is really good at destroying information about size, unless you’re really careful about context. But for a human trying to identify something, size matters. Swallowtails and Monarchs are huge as butterflies go (Tiger Swallowtails are the largest butterflies in North America). There are many butterflies that are tiny, and many situations in which it’s important to know whether you’re looking at a butterfly with a wingspan of 2 centimeters or 3, or 15. A Skipper is much smaller than any Swallowtail, but my Skipper was cropped so that it filled most of the image, and thus looked like a giant among butterflies. I doubt that there’s any way for a deep learning system to recover information about scale, aside from a very close analysis of the context.

How does TensorFlow deal with objects that look completely different from different angles? Many butterflies look completely different top and bottom (dorsal and ventral, if you know the lingo): for example, the Painted Lady. If you’re not used to thinking about butterflies, you’re seeing the bottom when the wings are folded and pointing up; you’re seeing the top when the wings are open and out to the sides. Can TF deal with this? Given enough images, I suspect it could; it would be an interesting experiment. Obviously, there would be no problem if you built your training set with “Painted Lady, dorsal” and “Painted Lady, ventral” as separate categories.

Finally, a thought about photography. The problem with butterflies (or birds, for that matter) is that you need to take dozens of pictures to get one good one. The animals won’t stay still for you. I save a lot of my pictures, but not all of them: I delete the images that aren’t focused, where the subject is moving, where it’s too small, or just doesn’t “look nice.” We’re spoiled by National Geographic. For a classifier, I suspect that these bad shots are as useful as the good ones, and that human aesthetics make classification more difficult. Save everything? If you’re planning on building a classifier, that’s the way to go.

Playing with TF was fun; I certainly didn’t build anything that could be used commercially, but I did get surprisingly good results with surprisingly little effort. Now, onto the birds...can I beat Cornell Ornithology Lab's Merlin?

Article image: "The Encyclopedia Americana" color montage of a variety of unidentified butterflies and moths, 1920. (source: Wikimedia Commons).