Chapter 1. Cognitive Augmentation

We address the theme of cognitive augmentation first because this is where the rubber hits the road: we build machines to make our lives better, to bring us capacities that we don’t otherwise have—or that only some of us would. This chapter opens with Beau Cronin’s thoughtful essay on predictive APIs, things that deliver the right functionality and content at the right time, for the right person. The API is the interface that tackles the challenge that Alistair Croll defined as “Designing for Interruption.” Ben Lorica then discusses graph analysis, an increasingly prevalent way for humans to gather information from data. Graph analysis is one of the many building blocks of cognitive augmentation; the way that tools interact with each other—and with us—is a rapidly developing field with huge potential.

Challenges Facing Predictive APIs

Solutions to a number of problems must be found to unlock PAPI value

by Beau Cronin

In November, the first International Conference on Predictive APIs and Apps will take place in Barcelona, just ahead of Strata Barcelona. This event will bring together those who are building intelligent web services (sometimes called Machine Learning as a Service) with those who would like to use these services to build predictive apps, which, as defined by Forrester, deliver “the right functionality and content at the right time, for the right person, by continuously learning about them and predicting what they’ll need.”

This is a very exciting area. Machine learning of various sorts is revolutionizing many areas of business, and predictive services like the ones at the center of predictive APIs (PAPIs) have the potential to bring these capabilities to an even wider range of applications. I co-founded one of the first companies in this space (acquired by Salesforce in 2012), and I remain optimistic about the future of these efforts. But the field as a whole faces a number of challenges, for which the answers are neither easy nor obvious, that must be addressed before this value can be unlocked.

In the remainder of this post, I’ll enumerate what I see as the most pressing issues. I hope that the speakers and attendees at PAPIs will keep these in mind as they map out the road ahead.

Data Gravity

It’s widely recognized now that for truly large data sets, it makes a lot more sense to move compute to the data rather than the other way around—which conflicts with the basic architecture of cloud-based analytics services such as predictive APIs. It’s worth noting, though, that after transformation and cleaning, many machine learning data sets are actually quite small—not much larger than a hefty spreadsheet. This is certainly an issue for the truly big data needed to train, say, deep learning models.


The data gravity problem is just the most basic example of a number of issues that arise from the development process for data science and data products. The Strata conferences right now are flooded with proposals from data science leaders who stress the iterative and collaborative nature of this work. And it’s now widely appreciated that the preparatory (data preparation, cleaning, transformation) and communication (visualization, presentation, storytelling) phases usually consume far more time and energy than model building itself. The most valuable toolsets will directly support (or at least not disrupt) the whole process, with machine learning and model building closely integrated into the overall flow. So, it’s not enough for a predictive API to have solid client libraries and/or a slick web interface: instead, these services will need to become upstanding, fully assimilated citizens of the existing data science stacks.

Crossing the Development/Production Divide

Executing a data science project is one thing; delivering a robust and scalable data product entails a whole new set of requirements. In a nutshell, project-based work thrives on flexible data munging, tight iteration loops, and lightweight visualization; productization emphasizes reliability, efficient resource utilization, logging and monitoring, and solid integration with other pieces of distributed architecture. A predictive API that supports one of these endeavors won’t necessarily shine in the other setting. These limitations might be fine if expectations are set correctly; it’s fine for a tool to support, say, exploratory work, with the understanding that production use will require re-implementation and hardening. But I do think the reality does conflict with some of the marketing in the space.

Users and Skill Sets

Sometimes it can be hard to tell at whom, exactly, a predictive service is aimed. Sophisticated and competent data scientists—those familiar with the ins and outs of statistical modeling and machine learning methods—are typically drawn to high-quality open source libraries, like scikit-learn, which deliver a potent combination of control and ease of use. For these folks, predictive APIs are likely to be viewed as opaque (if the methods aren’t transparent and flexible) or of questionable value (if the same results could be achieved using a free alternative). Data analysts, skilled in data transformation and manipulation but often with limited coding ability, might be better served by a more integrated “workbench” (such as those provided by legacy vendors like SAS and SPSS). In this case, the emphasis is on the overall experience rather than the API. Finally, application developers probably just want to add predictive capabilities to their products, and need a service that doesn’t force them to become de facto (and probably subpar) data scientists along the way.

These different needs are conflicting, and clear thinking is needed to design products for the different personas. But even that’s not enough: the real challenge arises from the fact that developing a single data product or predictive app will often require all three kinds of effort. Even a service that perfectly addresses one set of needs is therefore at risk of being marginalized.

Horizontal versus Vertical

In a sense, all of these challenges come down to the question of value. What aspects of the total value chain does a predictive service address? Does it support ideation, experimentation and exploration, core development, production deployment, or the final user experience? Many of the developers of predictive services that I’ve spoken with gravitate naturally toward the horizontal aspect of their services. No surprise there: as computer scientists, they are at home with abstraction, and they are intellectually drawn to—even entranced by—the underlying similarities between predictive problems in fields as diverse as finance, health care, marketing, and e-commerce. But this perspective is misleading if the goal is to deliver a solution that carries more value than free libraries and frameworks. Seemingly trivial distinctions in language, as well as more fundamental issues such as appetite for risk, loom ever larger.

As a result, predictive API providers will face increasing pressure to specialize in one or a few verticals. At this point, elegant and general APIs become not only irrelevant, but a potential liability, as industry- and domain-specific feature engineering increases in importance and it becomes crucial to present results in the right parlance. Sadly, these activities are not thin adapters that can be slapped on at the end, but instead are ravenous time beasts that largely determine the perceived value of a predictive API. No single customer cares about the generality and wide applicability of a platform; each is looking for the best solution to the problem as he conceives it.

As I said, I am hopeful that these issues can be addressed—if they are confronted squarely and honestly. The world is badly in need of more accessible predictive capabilities, but I think we need to enlarge the problem before we can truly solve it.

There Are Many Use Cases for Graph Databases and Analytics

Business users are becoming more comfortable with graph analytics

by Ben Lorica

GraphLab graph

The rise of sensors and connected devices will lead to applications that draw from network/graph data management and analytics. As the number of devices surpasses the number of people—Cisco estimates 50 billion connected devices by 2020—one can imagine applications that depend on data stored in graphs with many more nodes and edges than the ones currently maintained by social media companies.

This means that researchers and companies will need to produce real-time tools and techniques that scale to much larger graphs (measured in terms of nodes and edges). I previously listed tools for tapping into graph data, and I continue to track improvements in accessibility, scalability, and performance. For example, at the just-concluded Spark Summit, it was apparent that GraphX remains a high-priority project within the Spark1 ecosystem.

Another reason to be optimistic is that tools for graph data are getting tested in many different settings. It’s true that social media applications remain natural users of graph databases and analytics. But there are a growing number of applications outside the “social” realm. In his recent Strata Santa Clara talk and book, Neo Technology’s founder and CEO Emil Eifrem listed other uses cases for graph databases and analytics:

  • Network impact analysis (including root cause analysis in data centers)
  • Route finding (going from point A to point B)
  • Recommendations
  • Logistics
  • Authorization and access control
  • Fraud detection
  • Investment management and finance (including securities and debt)

The widening number of applications means that business users are becoming more comfortable with graph analytics. In some domains network science dashboards are beginning to appear. More recently, analytic tools like GraphLab Create make it easier to unlock and build applications with graph2 data. Various applications that build upon graph search/traversal are becoming common, and users are beginning to be comfortable with notions like “centrality” and “community structure”.

A quick way to immerse yourself in the graph analysis space is to attend the third GraphLab conference in San Francisco—a showcase of the best tools3 for graph data management, visualization, and analytics, as well as interesting use cases. For instance, MusicGraph will be on hand to give an overview of their massive graph database from the music industry, Ravel Law will demonstrate how they leverage graph tools and analytics to improve search for the legal profession, and Lumiata is assembling a database to help improve medical science using evidence-based tools powered by graph analytics.

Graphistry graph
Figure 1-1. Interactive analyzer of Uber trips across San Francisco’s micro-communities

Network Science Dashboards

Network graphs can be used as primary visual objects with conventional charts used to supply detailed views

by Ben Lorica

With Network Science well on its way to being an established academic discipline, we’re beginning to see tools that leverage it.4 Applications that draw heavily from this discipline make heavy use of visual representations and come with interfaces aimed at business users. For business analysts used to consuming bar and line charts, network visualizations take some getting used. But with enough practice, and for the right set of problems, they are an effective visualization model.

In many domains, networks graphs can be the primary visual objects with conventional charts used to supply detailed views. I recently got a preview of some dashboards built using Financial Network Analytics (FNA). In the example below, the primary visualization represents correlations among assets across different asset classes5 (the accompanying charts are used to provide detailed information for individual nodes):

Financial Network Anlytics

Using the network graph as the center piece of a dashboard works well in this instance. And with FNA’s tools already being used by a variety of organizations and companies in the financial sector, I think “Network Science dashboards” will become more commonplace in financial services.

Network Science dashboards only work to the extent that network graphs are effective (networks graphs tend get harder to navigate and interpret when the number of nodes and edges get large6). One workaround is to aggregate nodes and visualize communities rather than individual objects. New ideas may also come to the rescue: the rise of networks and graphs is leading to better techniques for visualizing large networks.

This fits one of the themes we’re seeing in Strata: cognitive augmentation. The right combination of data/algorithm(s)/interface allows analysts to make smarter decisions much more efficiently. While much of the focus has been on data and algorithms, it’s good to see more emphasis paid to effective interfaces and visualizations.

1 Full disclosure: I am an advisor to Databricks—a startup commercializing Apache Spark.

2 As I noted in a previous post, GraphLab has been extended to handle general machine learning problems (not just graphs).

3 Exhibitors at the GraphLab conference will include creators of several major graph databases, visualization tools, and Python tools for data scientists.

4 This post is based on a recent conversation with Kimmo Soramäki, founder of Financial Network Analytics.

5 Kimmo is an experienced researcher and policy-maker who has consulted and worked for several central banks. Thus FNA’s first applications are aimed at financial services.

6 Traditional visual representations of large networks are pejoratively referred to as “hairballs.”

Get Big Data Now: 2014 Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.