Solving real-world business problems with computer vision
Applications of CNNs for real-time image classification in the enterprise.
The process of data integration has traditionally been done using structured and semistructured data in batch-oriented use cases. In the last few years, real-time data has become the new frontier for many enterprises, and real-time streaming of unstructured or binary data has been a particularly tough nut to crack. In fact, many enterprises have large volumes of binary data that are not used to their full potential because of the inherent complexity of ingesting and processing such data.
Here are a few examples of how one might work with binary data :
- Performing speech-to-text recognition of audio files, recognizing individual speakers, and automatically cataloging files with enriched metadata so that audio recorded in interactive voice response systems is indexed and searchable.
- Automatically classifying image files based on the actual content of the image, such as recognizing products, faces, or other objects in the scene.
Of course, there are many other use cases. The good news is that working with binary data does not have to be that complicated. In this post, we’ll show how companies are using advances in computer vision, integrated with modern data ingestion technologies, to solve real-world business problems.
Applications of computer vision and deep learning in enterprise
The enterprise’s interest in machine vision techniques has ramped up sharply in the last few years due to the increased accuracy in competitions such as ImageNet. Computer vision methods have been around for decades, but it takes a certain level of accuracy for some use cases to move beyond the lab into real-world production applications. The advances seen in the ImageNet competition showed the world what was possible, and also harkened the rise of convolutional neural networks as the method of choice in computer vision.
Convolutional neural networks have the ability to learn location invariant features automatically by leveraging a network architecture that learns image features, as opposed to having them hand-engineered (as in traditional engineering). This aspect highlights a key property of deep learning networks—the ability of data scientists to choose the right architecture for the input data type so the network can automatically learn features. All of this is also directly dependent on having enough quality data that is properly labeled and appropriate for the problem at hand.
We’re seeing applications of computer vision across the spectrum of the enterprise:
- Financial Services
- Health care
In insurance, we see companies such as Orbital Insights analyzing satellite imagery to count cars and oil tank levels automatically to predict such things as mall sales and oil production, respectively. We are also seeing insurance companies leveraging computer vision to analyze the damage on assets under policy to better decide whom should be offered coverage.
The automotive industry has embraced computer vision (and deep learning) aggressively in the past five years with applications such as scene analysis, automated lane detection, and automated road sign reading to set speed limits.
The media world is leveraging computer vision to recognize images on social media to identify brands so companies can better position their brands around relevant content. Ebay recently used computer vision to let users visually search for items with photos.
In health care, we see the classic application of detecting disease in MRI scans, where companies like Arterys are now FDA-cleared to use deep learning to model medical imagery data. We’re also seeing this with partnerships, such as the relationship between Google, Nvidia, and Massachusetts General Hospital to leverage deep learning on radiology tasks.
In retail, we see companies interested in analyzing the shopping carts of in-store shoppers to detect items and make recommendations in store about what else they might want to buy. Think of this as a recommendation engine for a brick-and-mortar situation. We also see retailers using even more complex cameras taking more complex pictures (hyper-spectral imagery) that are modeled with convolutional neural networks.
These are but a few examples of computer vision ideas that are in development or already in production across the Global 2000 enterprise. It seems like this deep learning stuff may be around for awhile.
Beyond convolutional neural networks, the automotive industry has leveraged deep learning and long-short-term memory networks to analyze sensor data to automatically detect other cars and objects around the car. On newer cars, if you try and change lanes on the highway without setting your turn signal, the car will correct you, automatically directing you back into your lane. James Long shared with us this anecdote on how he sees integrated machine learning as a force multiplier, as opposed to job replacement:
My father had auto-steer on his tractor for years. It allowed him to cover more ground and do a better job at higher speed—so maybe 20% more productive. That’s how robots will permeate.
It’s small examples like this that show how latent integrated intelligence in vehicles is slowly making them “progressively automated”—as opposed to the idea that all cars will be self-driving tomorrow. Deep learning is quickly becoming the standard platform for integrating automation and intelligence into the environment around us. We probably won’t turn on a complete self-driving car tomorrow; it will likely be a slow transition, to the point where the system progressively autocorrects more and more aspects of driving, and we just naturally stop wanting to drive manually.
Challenges of production deep learning
Computer vision and deep learning present challenges when going into production. These challenges include:
- Getting enough data of good quality
- Managing executives’ expectations about model performance
- Being pragmatic about how bleeding-edge we really need our network to be
- Planning data ingest, storage, security, and overall infrastructure
- Understanding how machine learning differs from software engineering, to avoid misaligned expectations
Most organizations do not collect enough quality data to produce the model their line of business wants in terms of accuracy (e.g., “Our model has an F1 of .80, but the line of business says the F1 has to be .95 to be financially viable to them”). The computer vision practitioner needs to understand the dynamics of model evaluation and how F1 scores, precision, and recall work in practice. This knowledge will allow the practicing data scientist to better communicate realistic expectations about the model performance to management and not set the project up for failure out of the gate.
Building off the concept of model training, we want to further delineate the training phase of machine learning from the inference phase of machine learning. In training, we are performing a batch-class operation, where we typically make multiple passes over a data set to build up the weights (or “parameters”) on the connections in the neural network model. This operation tends to happen on a single machine (with CPU or GPU, depending on situation) or on a cluster of machines (e.g., Hadoop with Spark). The training process can take anywhere from a few minutes to days to complete, and sometimes we’ll build the model multiple times to get the most accurate model for our input data. Making predictions (“inference”) based on the model produced from the training phase is different in terms of how we manage its execution. Sending a new record to a saved model and getting a prediction (e.g., “classification” or “regression”) output is a transactional class operation. We call this phase out separately in the context of an article on real-time streaming applications, as we want to make sure the reader understands that models are rarely trained inside a streaming system. Most of the time, the model is produced offline based on saved training data and then set up later in a way that a streaming system can make predictions transactionally as data flows into the system.
Another challenge for the enterprise is getting machine learning teams trained correctly to understand how to leverage the latest methods in convolutional network tuning and application. Most education sources are too academic for enterprise practitioners and are meant for a college classroom. While that is a good way to teach grad school students, enterprise software training courses often approach teaching material from a practitioner’s point of view.
Another tip for enterprises is to focus on leveraging good, tried-and-true convolutional architectures from the past few years, as opposed to trying to implement the “hot new ICML paper of the week.” Twitter is great for discovering new papers as they come out, but it can also encourage folks to jump from one hot idea to the next before they can actually leverage real production value from new networks. A pragmatic computer vision approach focuses on using networks that have good results and that are implemented on well-known deep learning libraries, such as deeplearning4j, TensorFlow, Keras, and Theano. Once you have established a baseline convolutional model that performs decently, deploy it to users/applications and then, while they are working against that model, you can try out newer architectures in parallel.
Data ingestion has long been a challenge for the enterprise. While it may seem simple on the surface, getting image data from here to there consistently and stored correctly is more work than it seems. Hurdles include the structure of the data, the rate of data ingest, and the overall infrastructure needs relative to the incoming data. Some marketing literature even uses the term “unstructured data,” which is a misnomer. Image data, and all data, has structure. Data that has no structure is unparseable and therefore unusable in a processing system. Most of the time, what people mean when they say “unstructured data” is that “it doesn’t look like a CSV file or a RDBMS table.” Ingest systems can also involve real-time tagging of images as they are ingested, helping us to understand if we have certain images as soon as they are ingested or serving an image detection system. Beyond ingest, companies should also consider their storage options, parallelization, GPU strategy, model serving, workflow management, and security implications. These factors are largely infrastructure-based but have direct impacts on our ability to take a computer vision model to production, regardless of how accurate the model is.
So often we hear customers talk about a fear of failure of data science projects because there is a large element of “the unknown” involved. Data science and deep learning are exploratory in nature, and it is hard to predict just how accurate a model can be on the front end by the input data we have. Many folks tend to conflate the idea of software engineering being fairly (within reason) deterministic (e.g., “We built a house out of these materials”) and data science having a wider range of outcomes with the same labor (e.g., “We mined for gold as long as the other team, but only found half as much gold on our land”). A best practice is to invest in the best possible infrastructure that builds, secures, and deploys our model in a way that IT can consume, then let the data science team focus on building as many models as possible to find the best one for the task at hand.
In this post, we’ve discussed the concepts of streaming technology and enterprise applications of computer vision. To learn in more detail how to implement convolutional neural networks into enterprise applications, see our post “Integrating convolutional neural networks into enterprise pplications.” And, to hear more about applied machine learning in the context of streaming data infrastructure, attend our session Real-time image classification: Using convolutional neural networks on real-time streaming data” at the Strata Data Conference in New York City, Sept. 25-28, 2017.