8

MANAGING A WORKFORCE OF DJINNS

IN 2016, MIT’S SLOAN MANAGEMENT REVIEW ASKED ME TO contribute a short essay on the future of management. At first, I told them I had nothing much to say, or at least nothing that hadn’t long ago been said. But then I realized that I was responding to the question using an old map.

If you think with a twentieth-century factory mindset, you might believe that the tens of thousands of software engineers at companies like Google, Amazon, and Facebook spend their days grinding out products just like their industrial forebears, only today they are producing software rather than physical goods. If, instead, you step back and view these companies with a twenty-first-century mindset, you realize that a large part of what they do—delivering search results, news and information, social network status updates, relevant products for purchase, and drivers on demand—is done by software programs and algorithms. These programs are workers, and the programmers who create them are their managers. Each day, these “managers” take in feedback about their workers’ performance, as measured in real-time data from the marketplace, and if necessary, they give feedback to the workers in the form of minor tweaks and updates to the program or the algorithm.

The tasks performed by these software workers reflect the operational workflow of the digital organization. At an e-commerce site, you can imagine how one electronic worker helps the user find possible products that might match his or her search. Another shows information about the products. Yet another suggests alternative choices. Once the customer has chosen to buy a product, a digital worker presents a web form requesting payment and validates the input (for example, checking whether the credit card number provided is valid or whether the password presented matches the one that is stored). Another worker creates an order and associates it with the customer’s record. Yet another constructs a warehouse pick list to be executed by a human or a robot. One more stores data about that transaction in the company’s accounting system, and another sends out an email acknowledgment to the customer.

In an earlier generation of computing, these actions might be taken by a single monolithic application responding to the requests of a single user. But modern web applications may well be servicing millions of simultaneous users, and their functions have been decomposed into what are now called “microservices”—collections of individual functional building blocks that each do one thing, and do it very well. If a traditional monolithic application like Microsoft Word were reimplemented as a set of microservices, you could easily swap out the spell-checker for a better one, or add a new service that would turn web links into footnotes, or the reverse.

Microservices are an evolution of the communications-oriented design pattern that we saw in the design of Unix and the Internet, and in Jeff Bezos’s platform memo. Microservices are defined by their inputs and outputs—how they communicate with other services—not by their internal implementation. They can be written in different languages, and run cooperatively on multiple machines; if designed correctly, any one of them can be swapped out for an improved component that performs the same function without requiring the rest of the application to be updated. This is what allows for continuous deployment, in which new features can be rolled out on a constant basis rather than in one big splash, and for A/B testing, in which alternate versions of the same feature can be tested on subsets of the user population.

THE UNREASONABLE EFFECTIVENESS OF DATA

As the scale and speed of Internet applications have grown, the nature of many of the software workers has also changed. It’s a bit like the shift in aeronautics from propellers to jet engines. You can only go so fast with a motor that relies on mechanical pistons and rotating parts. A radically different approach was required, one that burns the fuel more directly. For a large class of applications, that jet engine has come in the form first of applied statistics and probability theory, then of machine learning and increasingly sophisticated AI algorithms.

In 2006, Roger Magoulas, O’Reilly Media’s VP of research, first used the term big data to describe the new tools for managing data at the scale that enables the services of companies like Google. Former Bell Labs researcher John Mashey had used the term as early as 1998, but to describe the increasing scale of data that was being collected and stored, not the kind of data-driven services based on statistics, nor the software engineering breakthroughs and business processes that make these services possible.

Big data doesn’t just mean a larger-scale version of a relational database like Oracle. It is something profoundly different. In their 2009 paper, “The Unreasonable Effectiveness of Data,” (a homage in its title to Eugene Wigner’s classic 1960 talk, “The Unreasonable Effectiveness of Mathematics in the Natural Sciences”), Google machine learning researchers Alon Halevy, Peter Norvig, and Fernando Pereira explained the growing effectiveness of statistical methods in solving previously difficult problems such as speech recognition and machine translation.

Much of the previous work had been grammar based. Could you construct what was in effect a vast piston engine that used its knowledge of grammar rules to understand human speech? Success had been limited. But that changed as more and more documents came online. A few decades ago, researchers relied on carefully curated corpora of human speech and writings that, at most, contained a few million words. But eventually, there was so much content available online that the game changed profoundly. In 2006, Google assembled a trillion-word corpus for use by language researchers, and developed a jet engine to process it. Progress since then has been swift and decisive.

Halevy, Norvig, and Pereira noted that in many ways, this corpus, taken from the web, was far inferior to the curated versions used by previous researchers. It was full of incomplete sentences, grammatical and spelling errors, and was not annotated and tagged with grammatical constructs. But the fact that it was a million times larger outweighed all those drawbacks. “A trillion-word corpus—along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions—captures even very rare aspects of human behavior,” they wrote. Instead of building ever-more-complex language models, researchers began to “make use of the best ally we have: the unreasonable effectiveness of data.” Complex rule-based models were not the path to language understanding; they should just use statistical analysis and let the data itself tell them what the model should be.

While this paper was focused on language translation, it summed up the approach that has been essential to the success of Google’s core search service. Its insight, that “simple models and a lot of data trump more elaborate models based on less data,” has been fundamental to progress in field after field, and is at the heart of many Silicon Valley companies. It is even more central to the latest breakthroughs in artificial intelligence.

In 2008, D. J. Patil at LinkedIn and Jeff Hammerbacher at Facebook coined the term data science to describe their jobs, naming a field that a few years later was dubbed by Harvard Business Review as “the sexiest job of the 21st century.” Understanding the data science mindset and approach and how it differs from older methods of programming is critical for anyone who is grappling with the challenges of the twenty-first century.

How Google deals with search quality provides important lessons. Early on, Google made a commitment to build search results with statistical methods, with a strong bias against manual overrides to correct problems. A search for “Peter Norvig” should have things like his Wikipedia page and official company bio near the top. If some inferior page comes out on top, one way to fix it would be to add a rule “for the search ‘Peter Norvig,’ don’t allow this inferior URL in the top 10.” Google decided not to do that, but instead to always look for the underlying cause. In a case like this, the fix might be something like “On a search for any well-known person, give a lot of credit to high-quality encyclopedic sources (such as Wikipedia).”

The fitness function of Google’s Search Quality team has always been relevance: Does the user appear to find what he or she was looking for? One of the signals Google now uses, which makes the concept very clear, is that of “the long click” versus “the short click.” If a user clicks on the first search result and doesn’t come back, she was presumably satisfied with the result. If the user clicks on the first search result, spends a modest amount of time away, and then comes back to click on the second result, he was likely not completely satisfied. If users come back immediately, that’s a signal that what they found was not at all what they were looking for, and so on. If the long click happens on the second or third or fifth result more often than it does on the first, perhaps that result is the most relevant. When one person does this, it might be an accident. When millions of people make the same choice, it surely tells you something important.

Statistical methods are not only increasingly powerful; they are swifter and more subtle. If our software workers were once clanking robotic mechanisms, they are now becoming more like djinns, the powerful, independent spirits from Arabian mythology who can be coerced into fulfilling our wishes, but who so often artfully reinterpret the wish to their master’s maximum disadvantage. Like the broom in Disney’s version of The Sorcerer’s Apprentice, algorithmic djinns do whatever it is that we ask them to do, but they are likely to be very single-minded and obtuse in interpreting it, with unintended and sometimes frightening results. How do we ensure that they do what we ask of them?

Managing them is a process of comparing the result of the programs and algorithms to some ideal target and testing to see what changes get you closer to that target. In the case of some work, such as Google’s web crawl, the key functions to evaluate might be speed, completeness, and freshness. In 1998, when Google started, the crawl and the computed index of web pages was updated every few weeks. Today it happens nearly instantaneously. In the case of determining relevance, it is a matter of comparing the results of the program to what an informed user might expect. In the first implementation of Google, this practice was fairly primitive. In their original paper on Google Search, published while they were still at Stanford, Larry and Sergey wrote: “The ranking function has many parameters. . . . Figuring out the right values for these parameters is something of a black art.”

Google says that the number of signals used to calculate relevance has grown to over 200, and search engine marketing guru Danny Sullivan estimates that there may be as many as 50,000 subsignals. Each of these signals is measured and calculated by a complex of programs and algorithms, each with its own fitness function it is trying to optimize. The output of these functions is a score that you can think of as the target of a master fitness function designed to optimize relevance.

Some of these functions, like PageRank, have names, and even research papers explaining them. Others are trade secrets known only to the engineering teams that create and manage them. Many of them represent fundamental improvements in the art of search. For example, Google’s addition of what it called “the Knowledge Graph” allowed it to build on known associations between various kinds of entities, such as dates, people, places, and organizations, understanding for instance that a person might be “born on,” an “employee of,” a “daughter of” or “mother of,” “living in,” and so on. This work was based on a database created by a company called Metaweb, which Google acquired in 2010. When Metaweb unveiled its project in March 2007, I wrote enthusiastically, “They are building new synapses for the global brain.”

Other components of the overall search algorithm were created in response to changing conditions in that global brain, the collective expression of billions of connected humans. For example, Google at first struggled to adapt to the real-time stream of consciousness coming from Twitter; the algorithms also had to be adjusted as smartphones made video and images as common on the Internet as text; as more and more searches were being made from mobile phones, devices whose precise location is known, local results became far more important; with the advent of speech interfaces, search queries became more conversational.

Google constantly tests new ideas that might give better results. In a 2009 interview, Google’s then VP of search, Udi Manber, noted that they’d run more than 5,000 experiments in the previous year, with “probably 10 experiments for every successful launch.” Google would launch a tweak to the algorithms or a new ranking factor on the order of 100 to 120 times a quarter, or an average of once a day. Since then, that speed has only accelerated. There were even more experiments on the advertising side.

How do they know that a change improves relevance? One way to evaluate a change is short-term user response: What are users clicking on? Another is long-term user response: Do they come back to Google for more? Another is talking to actual users one-on-one and asking them what they think.

Google also has a team of human evaluators check the results of a standardized list of common queries that are run automatically on a continuous basis. In the earliest days of Google, both the list of queries and the evaluation were done by the engineers themselves. By 2003 or 2004, Google had built a separate Search Quality team devoted to this effort. This team includes not just the search engineers but a statistically significant panel of external users who work Mechanical Turk–style, to give a thumbs-up or thumbs-down to a broad range of search results. In 2015, Google actually published the manual that they provide to their Search Quality raters.

It’s important to remember, though, that when the raters find a problem, Google doesn’t manually intervene to push the rank of a site up or down. When they find an anomaly—a case where the result the algorithm produces doesn’t match what the human testers expect—they ask themselves, “What additional factors or different weighting can we apply in the algorithm that will produce the result we believe users are looking for?”

It’s not always immediately obvious how to solve some search problems with pure ranking. At one point, the best algorithmically determined result for “Glacier Bay” turned up the Glacier Bay brand of faucets and sinks rather than the US national park of the same name. The algorithm was correct that more people were linking to and searching for Glacier Bay plumbing products, but users would be very surprised if the park didn’t show up at the top of search results.

My own company, O’Reilly Media, was the subject of a similar problem. O’Reilly Media (at the time still called O’Reilly & Associates) was one of the earliest sites on the web and we published a lot of content—rich, high-quality pages that were especially relevant to the web’s early adopters—so we had many, many inbound links. This gave us a very high page rank. At one point early in Google’s history, someone published “the Google alphabet”—the top result for searching on a single letter. My company owned the letter o. But what about O’Reilly Auto Parts, a Fortune 500 company? They didn’t even show up on the first page of search results.

For a brief time, until they came up with a proper algorithmic fix, Google divided pages like these into two parts. In the case of Glacier Bay, the national park occupied the top half of the search results page, with the bottom half given over to sinks, toilets, and faucets. In the case of O’Reilly, Bill O’Reilly and I came to share the top half while O’Reilly Auto Parts got the lower half. Eventually, Google improved the ranking algorithms sufficiently to interleave the results on the page.

One factor requiring constant adjustment to the algorithms is the efforts of the publishers of web pages to adapt to the system. Larry and Sergey foresaw this problem in their original search paper:

Another big difference between the web and traditional well controlled collections is that there is virtually no control over what people can put on the web. Couple this flexibility to publish anything with the enormous influence of search engines to route traffic and companies which deliberately manipulate search engines for profit become a serious problem.

That was an understatement. Entire companies were created to game the system. Many of Google’s search algorithm changes were responses to what came to be called “web spam.” Even when web publishers weren’t using underhanded tactics, they were increasingly struggling to improve their ranking. “Search engine optimization,” or SEO, became a new field. Consultants with knowledge of best practices advised clients on how to structure their web pages, how to make sure that keywords relevant to the search were present in the document and properly emphasized, why it was important to get existing high-quality sites to link to them, and much more.

There was also “black hat SEO”—creating websites that intentionally deceive, and that violate the search engine’s terms of service. Black hat SEO techniques included stuffing a web page with invisible text readable by a search engine but not by a human, and creating vast web “content farms” containing algorithmically generated low-quality content, including all the right search terms but little useful information that the user actually might want, pages cross-linked to each other to provide the appearance of human activity and interest. Google introduced numerous search algorithm updates specifically to deal with this kind of spam. The battle against bad actors is unrelenting for any widely used online service.

Google had one enormous advantage in this battle, though: its focus on the interests of the user, as expressed through measurable relevance. In his 2005 book, The Search, John Battelle called Google “the database of intentions.” Web pages might use underhanded techniques to try to improve their standing—and many did—but Google was constantly working toward a simple gold standard: Is this what the searcher wants to find?

When Google introduced its pay-per-click ad auction in 2002, what had started out as an idealistic quest for better search results became the basis of a hugely successful business. Fortunately, unlike other advertising business models, which can pit the interests of advertisers against the interests of users, pay-per-click aligns the interests of both.

In the pay-per-impression model that previously dominated online advertising, and continues to dominate print, radio, and television, advertisers pay for the number of times viewers see or hear an ad (or in the case of less measurable media, how often they might see or hear it, based on estimates of readership or viewership), usually expressed as CPM (cost per thousand). But in the pay-per-click model, introduced by a small company called GoTo (later renamed Overture) in 1998, the same year Google was founded, advertisers pay only when a viewer actually clicks on an ad and visits the advertised website.

A click on an ad thus becomes similar to a click on a search result: a sign of user intention. In Overture’s pay-per-click model, ads were sold to the highest bidder, with the company willing to pay the most to have their ad appear on a popular page of relevant search results getting the coveted spots. The company had achieved modest success with the model, but it didn’t really take off till Google took the idea further. Google’s insight was that the actual revenue from a pay-per-click ad was the combination of its price and the probability that the ad would actually be clicked on. An ad costing only $3 but twice as likely to be clicked on as a $5 ad would generate an additional dollar in expected revenue. Measuring the probability of an ad click and using it to rank the placement of an ad is obvious in retrospect, but like Amazon’s 1-Click shopping or Uber’s automatic payment, it was unthinkable to people wrapped in the coils of the prevailing paradigm for how advertising was sold.

This is a vast oversimplification of how Google’s ad auction actually works, but it highlights the alignment of Google’s search business model with its promise to users to help them find the most relevant results.

Facebook was not so lucky in finding alignment between the goals of its users and those of its advertisers.

Why? People don’t just turn to social media for facts. They turn to it for connection with their friends, breaking news, entertainment, and the latest memes. In an attempt to capture these user goals, Facebook chose for its fitness functions measures of what they believe users find “meaningful.” Like Google, Facebook uses many signals to determine what their users find most meaningful in their feed, but one of the strongest is what we might call “engagement.” The omnipresent “Like” button on every post is one measure of engagement; users look for the endorphin rush that comes when their friends pay attention and give approval to the content they share. Facebook also measures clicks, just like Google, but the clicks they value most are not the ones that send people away, but the ones that keep them on the site, and searching for more like what they just saw.

The Facebook News Feed was originally a strict timeline of updates from the friends you’d chosen to follow. It was a neutral platform. But once Facebook realized that it could get higher engagement by promoting the most liked pages and the most clicked-on links to the top of the News Feed, sometimes showing them again and again, it became something like the television shopping channels of old.

In the early days of Internet commercialization, I had the opportunity to visit QVC, the granddaddy of television shopping, which was looking to build an online equivalent. Three rotating soundstages held products and the hosts who sold them to viewers by describing them in glowing terms. Immediately facing the stage was an analyst with a giant computer workstation, monitoring call volume and sales from each of the company’s call centers in real time, giving the signal to switch to the next product only when attention and sales fell off. I was told that hosts were hired for their ability to talk nonstop about the virtues of a pencil for at least fifteen minutes.

That’s the face of social media with engagement as its fitness function. Millions of nonstop hosts. Billions of personalized shopping channels for content.

And as was the case with Google, both legitimate players and bad actors soon were playing to the strengths and weaknesses of the algorithm. As Father John Culkin so aptly summarized the ideas of Marshall McLuhan, “We shape our tools, and thereafter our tools shape us.” You choose the fitness function of your algorithms, and in turn, they shape your company, its business model, its customers, and ultimately our entire society. We’ll explore some of the downsides of Facebook’s fitness function in Chapter 10, and of financial markets in Chapter 11.

FROM JET ENGINES TO ROCKETS

If the introduction of probabilistic big data was like replacing a piston engine with a jet, the introduction of machine learning is like moving to a rocket. A rocket can go where a jet cannot, since it carries with it not only its own combustible fuel, but its own oxygen. This is a poor analogy, but it hints at the profound change that machine learning is bringing to the practices of even a company like Google.

Sebastian Thrun, the self-driving-car pioneer who led Google’s early efforts in that area and is now the CEO of Udacity, an online learning platform, described how much the practice of software engineering is changing. “I used to create programs that did exactly what I told them to do, which forced me to think of every possible contingency and make a rule for every contingency. Now I build programs, feed them data, and teach them how to do what I want.”

Using the old approach, a software engineer working on Google’s search engine might have a hypothesis about a signal that would improve search results. She’d code up the algorithm, test it on some subset of search queries, and if it improved the results, it might go into deployment. If it didn’t, the developer might modify her code and rerun the experiment. Using machine learning, the developer starts out with a hypothesis, just like before, but instead of producing a handcrafted algorithm to process the data, she collects a set of training data reflecting that hypothesis, then feeds the data into a program that outputs a model—a mathematical representation of features to be looked for in the data. This cycle is repeated again and again, with the program making minute adjustments to the model, gradually modifying the hypothesis using a technique such as gradient descent until it more perfectly matches the data. In short, the refined model is learned from the data. That model can then be turned loose on real-world data similar to that in the training data set.

Yann LeCun, a pioneer in a breakthrough machine learning technique called deep learning and now the head of the Facebook AI Research lab, uses the following analogy to explain how a model is trained to recognize images:

A pattern recognition system is like a black box with a camera at one end, a green light and a red light on top, and a whole bunch of knobs on the front. The learning algorithm tries to adjust the knobs so that when, say, a dog is in front of the camera, the red light turns on, and when a car is put in front of the camera, the green light turns on. You show a dog to the machine. If the red light is bright, don’t do anything. If it’s dim, tweak the knobs so that the light gets brighter. If the green light turns on, tweak the knobs so that it gets dimmer. Then show a car, and tweak the knobs so that the red light gets dimmer and the green light gets brighter. If you show many examples of the cars and dogs, and you keep adjusting the knobs just a little bit each time, eventually the machine will get the right answer every time. . . . The trick is to figure out in which direction to tweak each knob and by how much without actually fiddling with them. This involves computing a “gradient,” which for each knob indicates how the light changes when the knob is tweaked. Now, imagine a box with 500 million knobs, 1,000 light bulbs, and 10 million images to train it with. That’s what a typical Deep Learning system is.

Deep learning uses layers of recognizers. Before you can recognize a dog, you have to be able to recognize shapes. Before you can recognize shapes, you have to be able to recognize edges, so that you can distinguish a shape from its background. These successive stages of recognition each produce a compressed mathematical representation that is passed up to the next layer. Getting the compression right is key. If you try to compress too much, you can’t represent the richness of what is going on, and you get errors. If you try to compress too little, the network will memorize the training examples perfectly, but will not generalize well to novel inputs.

Machine learning takes advantage of the ability of computers to do the same thing, or slight variations of the same thing, over and over again very fast. Yann once waggishly remarked, “The main problem with the real world is that you can’t run it faster than real time.” But computers do this all the time. AlphaGo, the AI-based Go player created by UK company DeepMind that defeated one of the world’s best human Go players in 2016, was first trained on a database of 30 million Go positions from historical games played by human experts. It then played millions of games against itself in order to refine its model of the game even further.

Machine learning has become a bigger part of Google Search. In 2016, Google announced RankBrain, a machine learning model that helps to identify pages that are about the subject of a user’s query but that might not actually contain the words in the query. This can be especially helpful for queries that have never been seen before. According to Google, RankBrain’s opinion has become the third most important among the more than two hundred factors that it uses to rank pages.

Google has also applied deep learning to language translation. The results were so startlingly better that after a few months of testing, the team stopped all work on the old Google Translate system discussed earlier in this chapter and replaced it entirely with the new one based on deep learning. It isn’t yet quite as good as human translators, but it’s close, at least for everyday functional use, though perhaps not for literary purposes.

Deep learning is also used in Google Photos. If you have tried Google Photos, you’ve seen how it can recognize objects in your photos. Type “horse” and you will turn up pictures of horses, even if they are completely unlabeled. Type castle or fence, and you will turn up pictures of castles or fences. It’s magical.

Remember that Google Photos is doing this on demand for the photos of more than 200 million users, photos that it’s never seen before, hundreds of billions of them.

This is called supervised learning, because, while Google Photos hasn’t seen your photos before, it has seen a lot of other photos. In particular, it’s seen what’s called a training set. In the training set, the data is labeled. Amazon’s Mechanical Turk, or services like it, are used to send out pictures, one at a time, to thousands of workers who are asked to say what each contains, or to answer a question about some aspect of it (such as its color), or, as in the case of the Google Photos training set, simply to write a caption for it.

Amazon calls these microtasks HITs (Human Intelligence Tasks). Each one asks a single question, perhaps even using multiple choice: “What color is the car in this picture?” “What animal is this?” The same HIT is sent to multiple workers; when many workers give the same answer, it is presumably correct. Each HIT may pay as little as a penny, using a distributed “gig economy” labor force that makes driving for Uber look like a good middle-class job.

The role of Amazon’s Mechanical Turk in machine learning is a reminder of just how deeply humans and machines are intertwined in the development of next-generation applications. Mary Gray, a researcher at Microsoft who has studied the use of Mechanical Turk, noted to me that you can trace the history of AI research by looking at how the HITs used to build training data sets have changed over time. (An interesting example is the update to Google’s Site Rater Guidelines early in 2017, which was made, according to Paul Haahr, a Google search ranking engineer, in order to produce training data sets for the algorithmic detection of fake news.)

The holy grail in AI is unsupervised learning, in which an AI learns on its own, without being carefully trained. Popular excitement was inflamed by DeepMind’s creators’ claim that their algorithms “are capable of learning for themselves directly from raw experience or data.” Google purchased DeepMind in 2014 for $500 million, after it demonstrated an AI that had learned to play various older Atari computer games simply by watching them being played.

The highly publicized victory of AlphaGo over Lee Sedol, one of the top-ranked human Go players, represented a milestone for AI, because of the difficulty of the game and the impossibility of using brute-force analysis of every possible move. But DeepMind cofounder Demis Hassabis wrote, “We’re still a long way from a machine that can learn to flexibly perform the full range of intellectual tasks a human can—the hallmark of true artificial general intelligence.”

Yann LeCun also blasted those who oversold the significance of AlphaGo’s victory, writing, “most of human and animal learning is unsupervised learning. If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake. We need to solve the unsupervised learning problem before we can even think of getting to true AI.”

At this point, humans are always involved, not only in the design of the model but also in the data that is fed to the model in order to train it. This can result in unintended bias. Possibly the most important questions in AI are not the design of new algorithms, but how to make sure that the data sets with which we train them are not inherently biased. Cathy O’Neil’s book Weapons of Math Destruction is essential reading on this topic. For example, if you were to train a machine learning model for predictive policing on a data set of arrest records without considering whether police arrest blacks but tell whites “don’t let me catch you doing that again,” your results are going to be badly skewed. The characteristics of the training data are much more important to the result than the algorithm. Failure to grasp that is itself a bias that those who have studied a lot of pre–machine learning computer science will have trouble overcoming.

This unfortunate example also provides insight into how machine learning models work. There are many feature vectors in any given model, creating an n-dimensional space into which the classifier or recognizer places each new item it is asked to process. While there is fundamental research going on to develop entirely new machine learning algorithms, most of the hard work in applied machine learning involves identifying the features that might be most predictive of the desired result.

I once asked Jeremy Howard, formerly the CTO of Kaggle, a company that carries out crowdsourced data science competitions, what distinguished the winners from the losers. (Jeremy himself was a five-time winner before joining Kaggle.) “Creativity,” he told me. “Everyone is using the same algorithms. The difference is in what features you choose to add to the model. You’re looking for unexpected insights about what might be predictive.” (Peter Norvig noted to me, though, that the frontier where creativity must be exercised has already moved on: “This was certainly true back when random forests and support vector machines were the winning technologies on Kaggle. With deep networks, it is more common to use every available feature, so the creativity comes in picking a model architecture and tuning hyperparameters, not so much in feature selection.”)

Perhaps the most important question for machine learning, as for every new technology, though, is which problems we should choose to tackle in the first place. Jeremy Howard went on to cofound Enlitic, a company that is using machine learning to review diagnostic radiology images, as well as scanning many other kinds of clinical data to determine the likelihood and urgency of a problem that should be looked at more closely by a human doctor. Given that more than 300 million radiology images are taken each year in the United States alone, you can guess at the power of machine learning to bring down the cost and improve the quality of healthcare.

Google’s DeepMind too is working in healthcare, helping the UK National Health Service to improve its operations and its ability to diagnose various conditions. Switzerland-based Sophia Genetics is matching 6,000 patients to the best cancer treatment each month, with that number growing monthly by double digits.

Tellingly, Jeff Hammerbacher, who worked on Wall Street before leading the data team at Facebook, once said, “The best minds of my generation are thinking about how to make people click ads. That sucks.” Jeff left Facebook and now plays a dual role as chief scientist and cofounder at big data company Cloudera and faculty member of the Icahn School of Medicine at Mount Sinai, in New York, where he runs the Hammer Lab, a team of software developers and data scientists trying to understand how the immune system battles cancer.

The choice of the problems to which we apply the superpowers of our new digital workforce is ultimately up to us. We are creating a race of djinns, eager to do our bidding. What shall we ask them to do?

Get WTF?: What's the Future and Why It's Up to Us now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.