Challenges facing predictive APIs

Solutions to a number of problems must be found to unlock PAPI value.

October 5, 2014

Telephone switchboard (source: Christopher Brown)

In November, the first International Conference on Predictive APIs and Apps will take place in Barcelona, just ahead of Strata Barcelona. This event will bring together those who are building intelligent web services (sometimes called Machine Learning as a Service) with those who would like to use these services to build predictive apps, which, as defined by Forrester, deliver “the right functionality and content at the right time, for the right person, by continuously learning about them and predicting what they’ll need.”

This is a very exciting area. Machine learning of various sorts is revolutionizing many areas of business, and predictive services like the ones at the center of predictive APIs (PAPIs) have the potential to bring these capabilities to an even wider range of applications. I co-founded one of the first companies in this space (acquired by Salesforce in 2012), and I remain optimistic about the future of these efforts. But the field as a whole faces a number of challenges, for which the answers are neither easy nor obvious, that must be addressed before this value can be unlocked.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

In the remainder of this post, I’ll enumerate what I see as the most pressing issues. I hope that the speakers and attendees at PAPIs will keep these in mind as they map out the road ahead.

Data gravity

It’s widely recognized now that for truly large data sets, it makes a lot more sense to move compute to the data rather than the other way around — which conflicts with the basic architecture of cloud-based analytics services such as predictive APIs. It’s worth noting, though, that after transformation and cleaning, many machine learning data sets are actually quite small — not much larger than a hefty spreadsheet. This is certainly an issue for the truly big data needed to train, say, deep learning models.

Workflow

The data gravity problem is just the most basic example of a number of issues that arise from the development process for data science and data products. The Strata conferences right now are flooded with proposals from data science leaders who stress the iterative and collaborative nature of this work. And it’s now widely appreciated that the preparatory (data preparation, cleaning, transformation) and communication (visualization, presentation, storytelling) phases usually consume far more time and energy than model building itself. The most valuable toolsets will directly support (or at least not disrupt) the whole process, with machine learning and model building closely integrated into the overall flow. So, it’s not enough for a predictive API to have solid client libraries and/or a slick web interface: instead, these services will need to become upstanding, fully assimilated citizens of the existing data science stacks.

Crossing the development/production divide

Executing a data science project is one thing; delivering a robust and scalable data product entails a whole new set of requirements. In a nutshell, project-based work thrives on flexible data munging, tight iteration loops, and lightweight visualization; productization emphasizes reliability, efficient resource utilization, logging and monitoring, and solid integration with other pieces of distributed architecture. A predictive API that supports one of these endeavors won’t necessarily shine in the other setting. These limitations might be fine if expectations are set correctly; it’s fine for a tool to support, say, exploratory work, with the understanding that production use will require re-implementation and hardening. But I do think the reality does conflict with some of the marketing in the space.

Users and skill sets

Sometimes it can be hard to tell at whom, exactly, a predictive service is aimed. Sophisticated and competent data scientists — those familiar with the ins and outs of statistical modeling and machine learning methods — are typically drawn to high-quality open source libraries, like scikit-learn, which deliver a potent combination of control and ease of use. For these folks, predictive APIs are likely to be viewed as opaque (if the methods aren’t transparent and flexible) or of questionable value (if the same results could be achieved using a free alternative). Data analysts, skilled in data transformation and manipulation but often with limited coding ability, might be better served by a more integrated “workbench” (such as those provided by legacy vendors like SAS and SPSS). In this case, the emphasis is on the overall experience rather than the API. Finally, application developers probably just want to add predictive capabilities to their products, and need a service that doesn’t force them to become de facto (and probably subpar) data scientists along the way.

These different needs are conflicting, and clear thinking is needed to design products for the different personas. But even that’s not enough: the real challenge arises from the fact that developing a single data product or predictive app will often require all three kinds of effort. Even a service that perfectly addresses one set of needs is therefore at risk of being marginalized.

Horizontal vs vertical

In a sense, all of these challenges come down to the question of value. What aspects of the total value chain does a predictive service address? Does it support ideation, experimentation and exploration, core development, production deployment, or the final user experience? Many of the developers of predictive services that I’ve spoken with gravitate naturally toward the horizontal aspect of their services. No surprise there: as computer scientists, they are at home with abstraction, and they are intellectually drawn to — even entranced by — the underlying similarities between predictive problems in fields as diverse as finance, health care, marketing, and e-commerce. But this perspective is misleading if the goal is to deliver a solution that carries more value than free libraries and frameworks. Seemingly trivial distinctions in language, as well as more fundamental issues such as appetite for risk, loom ever larger.

As a result, predictive API providers will face increasing pressure to specialize in one or a few verticals. At this point, elegant and general APIs become not only irrelevant, but a potential liability, as industry- and domain-specific feature engineering increases in importance and it becomes crucial to present results in the right parlance. Sadly, these activities are not thin adapters that can be slapped on at the end, but instead are ravenous time beasts that largely determine the perceived value of a predictive API. No single customer cares about the generality and wide applicability of a platform; each is looking for the best solution to the problem as he conceives it.

As I said, I am hopeful that these issues can be addressed — if they are confronted squarely and honestly. The world is badly in need of more accessible predictive capabilities, but I think we need to enlarge the problem before we can truly solve it.

Post topics: Data