Artificial intelligence and advances in data warehousing
The O'Reilly Podcast: Gary Orenstein on developing a data infrastructure that enables the latest applications in machine learning and AI.
In this podcast episode, I speak with Gary Orenstein, chief marketing officer at MemSQL, a platform for real-time analytics that combines a database, a data warehouse, and streaming workloads into one system. We discuss trends that are driving advancements in data warehousing, how related technologies are changing as machine learning and AI evolve, and example use cases across industries.
Here are some highlights from our chat:
Data warehousing mega-trends
I think when we look at modern data warehousing, which is a critical part of the landscape, we’re seeing what I refer to as mega-trends—things like the Internet of Things, the drive to do more machine learning and artificial intelligence, and the desire to move more to the cloud. These mega-trends are causing people to reevaluate how they have collected data; how they process data; and, more specifically, how they make use of it and build applications.
The infrastructure necessary for machine learning
There is a lot of talk about what’s possible with machine learning algorithms, but not as much talk about how to enable it. Ultimately, machine learning is a way to use software to crunch through data, and crunch through numbers in a way that human beings could never do in a million years—but in order to make that happen, you need the analytics and you need the data infrastructure.
I think being effective with machine learning requires having a healthy understanding of the data infrastructure to support it. … It’s very helpful if you have a data infrastructure that can maintain transactions because once you’re maintaining transactions you can record, then you are recording the state of the business, which changes over time. To implement machine learning well, and in a way that brings you to the present, you want those machine learning algorithms to be running on fresh data that accurately reflects the business today, as opposed to reflecting it some time ago. Another element of the data infrastructure that you want to make sure you have in place, might be the ability to rapidly ingest data—and that fits perfectly in line with what’s happening with the Internet of Things, or what’s happening with people wanting to gather information from mobile applications or web applications.
Improving image recognition with a real time data warehouse
We did some work in real-time image recognition, with an organization called Thorn. They are a non-profit dedicated to working to protect children against sexual exploitation on the internet. … The goal at Thorn is to monitor what ads are going up on the internet, to see ads showing particular faces and identify those faces more quickly, to help law enforcement perhaps track down and protect these children. Thorn had been working on a more traditional data pipeline for real-time image recognition, but was having challenges keeping up with the volume of fresh imagery that was coming in every day, and matching that against a large volume of images that they have in their database. What we were able to do with them is to implement a machine learning function called dot_product. … It essentially allows you to compare the similarity of two vectors, and you can compare in different ways, sometimes with cosine similarity, sometimes with euclidean distance.
In the facial recognition example, you’re taking the image of the face and you’re taking the points on the face—the eyeballs, the corners of the mouth, the ear lobes—and creating a numerical vector. By implementing that function in a real-time data warehouse like MemSQL, that is optimized with a SQL engine to crunch through this data extremely quickly, we were able to help Thorn improve the performance of their image recognition by up to 1,000 fold. So, that’s one example in real-time image recognition where machine learning coupled closely to the real time data warehouse makes a lot of sense.
This post is a collaboration between O’Reilly and MemSQL. See our statement of editorial independence.