Here’s a simple recipe for solving crazy-hard problems with machine intelligence. First, collect huge amounts of training data — probably more than anyone thought sensible or even possible a decade ago. Second, massage and preprocess that data so the key relationships it contains are easily accessible (the jargon here is “feature engineering”). Finally, feed the result into ludicrously high-performance, parallelized implementations of pretty standard machine-learning methods like logistic regression, deep neural networks, and k-means clustering (don’t worry if those names don’t mean anything to you — the point is that they’re widely available in high-quality open source packages).
Google pioneered this formula, applying it to ad placement, machine translation, spam filtering, YouTube recommendations, and even the self-driving car — creating billions of dollars of value in the process. The surprising thing is that Google isn’t made of magic. Instead, mirroring Bruce Scheneier’s surprised conclusion about the NSA in the wake of the Snowden revelations, “its tools are no different from what we have in our world; it’s just better funded.”
Google’s success is astonishing not only in scale and diversity, but also the degree to which it exploded the accumulated conventional wisdom of the artificial intelligence and machine learning fields. Smart people with carefully tended arguments and closely held theories about how to build AI were proved wrong (not the first time this happened). So was born the unreasonable aspect of data’s effectiveness: that is, the discovery that simple models fed with very large datasets really crushed the sophisticated theoretical approaches that were all the rage before the era of big data.
In many cases, Google has succeeded by reducing problems that were previously assumed to require strong AI — that is, reasoning and problem-solving abilities generally associated with human intelligence — into narrow AI, solvable by matching new inputs against vast repositories of previously encountered examples. This alchemy rests critically on step one of the recipe above: namely, acquisition of data at scales previously rejected as absurd, if such collection was even considered before centralized cloud services were born.
Now the company’s motto makes a bit more sense: “Google’s mission is to organize the world’s information and make it universally accessible and useful.” Yes, to machines. The company’s ultimate success relies on transferring the rules and possibilities of the online world to our physical surroundings, and its approach to machine learning and AI reflects this underlying drive.
But is it the only viable approach? With Google (and other tech giants) buying robotics and AI companies at a manic clip — systematically moving into areas where better machine learning will provide a compelling advantage and employing “less than 50% but certainly more than 5%” of ML experts — it’s tempting to declare game over. But, with the caveat that we know little about the company’s many unannounced projects (and keeping in mind that I have approximately zero insider info), we can still make some good guesses about areas where the company, and others that have adopted its model, are unlikely to dominate.
I think this comes down to situations that have one or more of the following properties:
- The data is inherently small (for the relevant definition of small) and further collection is illegal, prohibitively expensive or even impossible. Note that this is a high bar: sometimes a data collection scheme that seems out of reach is merely waiting for the appropriate level of effort and investment, such as driving down every street on earth with a specially equipped car.
- The data really cannot be interpreted without a sophisticated model. This is tricky to judge, of course: the unreasonable effectiveness of data is exactly that it exposes just how superfluous models are in the face of simple statistics computed over large datasets.
- The data cannot be pooled across users or customers, whether for legal, political, contractual, or other reasons. This results in many “small data” problems, rather than one “big data” problem.
My friend and colleague Eric Jonas points out that genomic data is a good example of properties one and two. While it might seem strange to call gene sequencing data “small,” keep in mind there are “only” a few billion human genomes on earth, each comprising a few billion letters. This means the vast majority of possible genomes — including many perfectly good ones — will never be observed; on the other hand, those that do exist contain enough letters that plenty of the patterns we find will turn out to be misleading: the product of chance, rather than a meaningfully predictive signal (a problem called over-fitting). The disappointing results of genome-wide association studies, the relatively straightforward statistical analyses of gene sequences that represented the first efforts to identify genetic predictors of disease, reinforce the need for approaches that incorporate more knowledge about how the genetic code is read and processed by cellular machinery to produce life.
Another favorite example of mine is perception and autonomous navigation in unknown environments. Remember that Google’s cars would be completely lost anywhere without a pre-existing high-resolution map. While this might scale up to handle everyday driving in many parts of the developed world, many autonomous vehicle or robot applications will require the system to recognize and understand its environment from scratch, and adapt to novel challenges in real time. What about autonomous vehicles exploring new territory for the first time (think about an independent Mars rover, at one extreme), or that face rapidly-shifting or even adversarial situations in which a static map, however detailed, simply can’t capture the essential aspects of the situation? The bottom line is that there are many environments that can’t be measured or instrumented sufficiently to be rendered legible to Google-style machines.
Other candidates include the interpretation and prediction of company performance from financial and other public data (properties 1 and 2); understanding manufacturing performance and other business processes directly from sensor data, and suggesting improvements thereon (2 and 3); and mapping and optimizing the real information and decision-making flows within organizations, an area that’s seen far more promise than delivery (1, 2, and 3).
This is a long way from coherent advice, but it’s in areas like these where I see the opportunities. It’s not that the large Internet companies can’t go after these applications; it’s that these kinds of problems fit poorly with their ingrained assumptions, modes of organization, existing skill sets, and internal consensus about the right way to go about things. Maybe that’s not much daylight, but it’s all you’re going to get.