Real-World Active Learning

Applications and strategies for human-in-the-loop machine learning.

By Ted Cuzzillo

February 28, 2015

The Bean, Downtown Chicago (source: Rlobes (Pixabay))

The online world has blossomed with machine-driven riches. We don’t send letters; we email. We don’t look up a restaurant in a guide book; we look it up on OpenTable. When a computer that makes any of this possible goes wrong, we even search for a solution online. We thrive on the multitude of “signals” available.

But where there’s signal, there’s “noise”—inaccurate, inappropriate, or simply unhelpful information that gets in the way. For example, in receiving email, we also fend off spam; while scouting for new employment, we receive automated job referrals with wildly inappropriate matches; and filters made to catch porn may confuse it with medical photos.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

We can filter out all of this noise, but at some point it becomes more trouble than it’s worth—that is when machines and their algorithms can make things much easier. To filter spam mail, for example, we can give our machine and algorithm a set of known-good and known-bad emails as examples so the algorithm can make educated guesses while filtering mail.

Even with solid examples, though, algorithms fail and block important emails, filter out useful content, and cause a variety of other problems. As we’ll explore throughout this report, the point at which algorithms fail is precisely where there’s an opportunity to insert human judgment to actively improve the algorithm’s performance.

In a recent article on Wired (“The Huge, Unseen Operation Behind the Accuracy of Google Maps,” 12/08/14), we caught a glimpse of the massive active-learning operation behind the management of Google Maps. During a visit to Google, reporter Greg Miller got a behind-the-scenes look at Ground Truth, the team that refines Google Maps using machine-learning algorithms and manual labor. The algorithms collect data from satellite, aerial, and Google’s Street View images, extracting data like street numbers, speed limits, and points of interest. Yet even at Google, algorithms get you to a certain point, and then humans need to step in to manually check and correct the data. Google also takes advantage of help from citizens—a different take on “crowdsourcing”—who give input using Google’s Map Maker program and contribute data for off-road locations where Street View cars can’t drive.

Active learning, a relatively new strategy, gives machines a guiding hand—nudging the accuracy of algorithms into a tolerable range, often toward perfection. In crowdsourcing, a closely related trend made possible by the Internet, humans make up a “crowd” of contributors (or “labelers,” “workers,” or “turkers,” after the Amazon Mechanical Turk) who give feedback and label content; those labels are fed back into the algorithm; and in a short time, the algorithm improves to the point where its results are useable.

Active learning is a strategy that, while not hard to deploy, is hard to perfect. For practical applications and tips, we turned to several experts in the field and bring you the knowledge they’ve gained through various projects in active learning.

“Gold Standard” Data: A Best Practice Method for Assessing Labels

Patrick Philips, a crowdsourcing expert and data scientist at Euclid Analytics, describes a best practice method in active learning: formulating “gold standard” data. Before any crowd of contributors sees the data for a job, Philips spends one to two hours scoring a small subset of the data by hand. He adds that “gold standard” data can also be extracted from contributor-labeled data when there is strong agreement (on the labels) among the contributors. Creating and managing a set of “gold standard” data (e.g., four to five examples for each class of data) provides a standard for judging labels that come in from contributors; this can be done in several ways. First, it’s an up-front filter: each worker’s contributions are automatically compared with the “gold standard” data to measure understanding, ability, and trustworthiness for the job. Second, using the “gold standard” data allows for ongoing monitoring and provides a means to train and retrain workers and to offer corrections to improve performance. Third, the “gold standard” data allows you to score worker’s accuracy and automatically exclude work that falls below a certain percentage of accuracy; in addition, this provides the opportunity to discover problems with the data itself.

Elements of “Gold Standard” Data

“Gold standard” data is the standard by which all other data in one application can be measured.

The Benefits

Setting up your own “gold standard” data gives you an overview of your data and helps you decide what labels you need. It also helps you avoid designing unhelpful/bad labels, which can result in severe mislabeling and problems later on. Your early work on a subset of “gold standard” data can save you time and money later.

Tips

Start with just a small subset of your data, perhaps just four or five examples from each class.
Use your “gold standard” data to measure the performance of each contributor so you know when to retrain workers. When a contributor’s score falls below 70% accuracy, exclude his work and retrain.
Continually review your “gold standard” data to ensure it’s as accurate and useful as possible so that it maintains its purpose.

Managing the Crowd

The “crowd” solves a problem and has definite usefulness in active learning, but it also has a flip side: “Humans will sometimes give you wrong answers,” explains Adam Marcus of GoDaddy. Mislabeled items sometimes result from boredom, perhaps even resentment. Also, some questions might be unintentionally misleading, or the contributors might have raced through with little attention to the questions and answers. Whatever the cause, wrong answers can badly skew training data and take hours to correct.

One simple solution, explains Marcus, is to ask questions not once but several times. Redundant questioning establishes confidence in the labeling. An item that’s labeled in one certain way by four different contributors is far more likely to be correctly classified than one that’s labeled by just one contributor.

A more complex but beneficial solution is the creation of worker hierarchies. A hierarchy allows more than simple, redundant labeling—it sends items with low certainty up the ladder to more trusted workers. Hierarchies rely on long-term relationships with contributors. To enable hierarchies, organizations can recruit through companies such as oDesk, Elance, and other online marketplaces with a plentiful supply of customer-rated candidates. As workers become known and trusted, they’re given more work and asked to review other workers. They might also receive recognition and bonuses, and they can even move up to more interesting tasks and even manage projects. “We have reviewers running jobs, doing way more interesting work than they were hired for,” says Marcus. “These incentives give contributors a clear sense of upward mobility.”

Contributors whose work falters, on the other hand, are given less work. The weaker their performance becomes, the more scrutiny they receive, and their work volume is incrementally reduced.

The hierarchy system also improves training. At GoDaddy’s Locu team, a new worker recruited through oDesk, or other such agency, would have a week of training and practice; his work cleaning up the classifier’s output would go first to a trusted worker, whose review would go back to the new worker. According to Marcus, within just a few weeks, the new recruit’s work improves.

A Challenge: Overconfident Contributors

Redundant questioning and “gold standard” data are methods for helping to address a common problem identified by crowdsourcing expert Patrick Philips: overconfidence among contributors. Confidence bias, a phenomenon well known to psychologists, is the systematic overconfidence among individuals of their own ability to complete objective tasks accurately.

In one experiment by Philips, as described in his 2011 blog post “Confidence Bias: Evidence from Crowdsourcing,” individuals were asked to answer a set of standardized verbal and math-related questions and to identify how confident they were in each answer. The difference between each individual’s average confidence and actual performance was an estimate of confidence bias. Of the 829 people who answered 10 or more questions, more than 75% overestimated their abilities.

Philips found that confidence bias rises with a person’s level of education and age, and also with the number of questions they answer accurately. In his experiment, US contributors were much more accurate and slightly more biased than the average. Individuals from India had average accuracy, but much higher confidence. In looking at gender, Philips found that women were more accurate and less biased than men.

More Tips for Managing Contributors

Pick your problems carefully. Think about the problem you’re trying to solve and structure it in a way that makes it easy to get meaningful feedback.
Pick a solvable problem (have your team try it first). According to Philips, “If you can’t do it on your team, it’s probably not a solvable problem.” If you find yourself with a seemingly unsolvable problem, consider using a parallel problem that can be solved more easily.
Make sure the task is clearly defined. Whatever you want your labelers to do—test it with your team first. If your team has trouble, the labelers certainly will; this gives you the chance to make sure your task is clearly defined.
Use objective labels that reasonable people agree on. In Philips’ previous role at LinkedIn, the company set out to classify content in its newsfeed; he did this by having crowdsource workers label content using only a handful of descriptors, such as “exciting,” “exhilarating,” “insightful,” and “interesting.” The team sent out about 50,000 articles for labeling, and the task seemed easy enough until the labeled data came back. “It was a mess,” says Philips. No one agreed internally on what each of these labels meant before sending the task out, so they couldn’t agree whether the results were accurate when they came back either. Results improved when the team switched to more objective labels, with a four-tier selection that accounted for overall quality based on coherence, spelling, and grammar; a second selection indicated the general content, such as nonfiction, fiction, and op-ed.

When to Skip the Crowd

You may discover that you need no crowd at all because the answer is right in front of you. In one project at LinkedIn, Philips’ team wanted to refer people to job postings appropriate for their level of experience. The team hoped to have crowdsource workers classify members into one of three categories: individual contributor, manager, or executive manager. Though a seemingly straightforward task, it proved quite difficult. For starters, job titles vary wildly among companies; and even when they are the same title, the size of the company impacts the role itself; for example, the vice president at Google may not belong in the same seniority category as the vice president at a startup.

Other more indicative data, such as salary, wasn’t available, so the team tried proxies. They looked at the number of years since graduation, which was useful, though not enough. Other proxies included endorsements within a network, the seniority of immediate connections, and maps that show whose profiles members have viewed.

Eventually, the team at LinkedIn found a solution based on data they already had: when LinkedIn members write recommendations, they explicitly indicate their relationship to that person—peer, manager, or direct report. With help from millions of LinkedIn recommendations, the team developed a system to rank employees by seniority within a company.

“Crowdsourcing is a great tool, but it’s not without its challenges,” says Philips. “Definitely look around first; you may already have the data that you need.”

Expert-level Contributors

In cases where active learning requires expert-level knowledge or educated judgment, the recruitment and management of labelers becomes much more complex. This occurs when the task graduates from simple, accurate assessments (such as spam or not, and human face or not), into tasks that only an expert crowd can perform.

In addition to finding the expert crowd, another challenge is that when you do find expert labelers, their labels can be wrong or random—and the non-expert would never know it. (Only specialists can distinguish, for example, an American tree sparrow from a white-crowned sparrow.)

Panos Ipeirotis, a leading researcher in crowdsourcing and associate professor at New York University, recalled one such instance when he asked contributors to give the name of Apollo astronaut Neil Armstrong’s wife. The choices included “Apollo,” “Gemini,” “Laika,” “None of the above,” and “I do not know.” Only one of these options is likely to have been a human name (Laika) and was actually the name of the dog sent into space by the Soviet Union, yet it was the answer chosen by some aspiring “experts.” In these cases, “contributors are choosing an answer that is plausible,” says Ipeirotis, “because they want to convey as much information as they can and don’t want to admit that they don’t know.” (Ipeirotis found that in retrospect, replacing “I do not know” with “Skip” proved to be a much better choice.)

What complicates the matter is that a plausible-but-wrong answer can’t be easily detected by a machine algorithm. If five people give the same plausible-sounding answer, for example, the algorithm becomes confident based on that inaccurate data, resulting in a bad classification that’s reinforced by the workers’ collective agreement.

In short, the best labelers are those who admit when they don’t know the answer.

How to Find the Experts

For tasks that require expert knowledge, the usual crowdsource marketplaces offer little support; the challenge is that they usually cannot supply enough contributors with specialized knowledge. Ornithologists, historians, and fluent speakers of other languages, such as Swahili, Sicilian, or Southern Tujia, for example, all have to be recruited differently.

One promising method of expert recruitment, Quizz, is described in the research paper “Quizz: Targeted Crowdsourcing with a Billion (Potential) Users” by Panagiotis G. Ipeirotis and Evgeniy Gabrilovich. The authors found that the best way to find subject-matter experts was to lure them into demonstrating their knowledge.

Ipeirotis and Gabrilovich began their experiment with eight quizzes that they placed as ads on popular websites. Each quiz challenged passersby with a question; for example, one question might be, “What is a symptom of morgellons?” (Those with medical knowledge know that morgellons involves delusions of having things crawling on the skin.) Each quiz question offered several plausible choices, as shown in Figure 4, and anyone who offered an answer learned instantly whether it was correct.

In the background was an algorithm created by Ipeirotis and Gabrilovich that kept score and judged the expertise of each respondent. Participants who were judged to be sufficiently knowledgeable were invited to go further, and Quizz continued to measure their total contribution and the quality of their results.

In addition to scoring participants, the Quizz algorithm also used advertising targeting capabilities to score the websites where the ads appeared. Sites that produced too few qualified candidates were dropped, as a way to continually optimize results. The algorithm also recorded the “origin” sites of those who gave good answers and began recruiting on those sites more heavily. For example, the recruiting algorithm quickly learned that consumer-oriented medical websites, such as Mayo Clinic and Healthline, produced many qualified labelers with medical knowledge, while ads on medical websites with a professional audience did not manage to attract contributors with sufficient willingness to participate.

Participants who clicked on an ad and answered the quiz questions constituted a “conversion” that was tallied by the algorithm. At the time Ipeirotis and Gabrilovich wrote their paper in 2014, the Quizz application began with a 10–15% conversion rate, which, over time, rose to a 50% conversion rate—by simply giving feedback to the advertising targeting algorithm.

Managing Expert-level Contributors

A key consideration in managing experts is how to get the most out of each contributor. According to Ipeirotis, the trick is to balance two types of questions: one type (“calibration”) estimates the contributor’s knowledge, and the other type (“collection”) collects their knowledge. Balancing these two types of questions allows you to sustain the stream of collected knowledge as long as possible and explore the person’s potential to give more.

The optimal balance of these two types of questions (calibration versus collection) depends in part on each contributor’s recent behavior; it also depends on her expected behavior, which is based on that of other users. For example, the user who shows signs of dropping out is likely to be steered toward a proven “survival” mix: since contributors are motivated mainly by the contribution of good information, the “survival” mix lets them have more questions they are likely to answer correctly, followed by prompt acknowledgement of their work.

Payment is another factor to consider in managing expert-level contributors. In their research, Ipeirotis and Gabrilovich found that paid workers not only cost more, they often produced poorer quality data and were less knowledgeable than those who were unpaid. Ipeirotis and Gabrilovich describe an experiment in which one selection of contributors were paid piecemeal rates, with bonuses based on scores; this group dropped out at a lower rate than a selection of unpaid workers. However, while the paid workers were staying on, they were submitting lower-quality answers than those who were unpaid. Interestingly, offering payment was not linked with high-quality answers; payment simply sustained workers, presumably in cases where unpaid workers, lacking the satisfaction of offering high-quality answers, would have given up.

A Real-World Example: Expert Stylists + Machine Learning

Expert contributors can do more than identify birds and medical symptoms. In one application, customers actually seem to trust the experts more than they trust themselves. Stitch Fix, an online personal styling and shopping service for women—relies on both expert contributors and machine learning to present customers with styles that are based on their own personal data.

The process at Stitch Fix begins with a basic model, an estimate of what customers will like based on their stated preferences for style and budget. Then, the model evolves based on information from actual purchases. Notably, a customer’s model may be disrupted. For example, the model may find that the customer who gave her size as 12 actually purchases items at a size 14 or that clothes she describes as “bohemian chic style” are actually what most people would call “preppy”; she may also buy clothes that reveal a higher budget than the one she gave.

Handling these types of disruptions and matching stated with actual preferences are the biggest challenges at Stitch Fix, says Chief Algorithms & Analytics Officer, Eric Colson. In addition, the lack of an industry standard for clothing sizes adds to the problem; for example, a size 10 at one store could be a size 6 at another. Customers also give bad data, such as their aspirational size (i.e., one that anticipates weight loss), rather than their true size. They may also misunderstand industry terms, confusing size with fit, for example.

Aside from customer-based data, a second set of data describes each item of clothing in fine-grain detail. Stitch Fix’s expert merchandisers evaluate each new piece of clothing and encode its attributes, both subjective and objective, into structured data, such as color, fit, style, material, pattern, silhouette, brand, price, and trendiness. These attributes are then compared with a customer profile, and the machine produces recommendations based on the model.

But when the time comes to recommend merchandise to the customer, the machine can’t possibly make the final call. This is where Stitch Fix stylists step in. Stitch Fix hands off final selection of recommendations to one of roughly 1,000 human stylists, each of whom serves a set of customers. Stylists assess unstructured data from images and videos of the merchandise and from all available customer comments (e.g., “I need clothes for a big meeting at work.”). They may even reach outside of the machine’s recommendations and use their own judgment to make final selections for which pieces will go to the customer. Before a shipment goes out, the stylist scrutinizes each piece to see how they look together and may even explain the selections to the customer.

According to Colson, occasional “smart risk” is also built in to the algorithm. Stitch Fix deliberately injects randomness to add value; to stay completely safe within a narrow range of customer preferences would truncate the possibilities. “On our 10th or 11th shipment,” says Colson, “that’s when you need to start mixing it up.” A school teacher who dresses conservatively during the week, for example, probably has enough conservative clothing. “What’s it going to take to create a meaningful relationship?” asks Colson. “It might be to take her on the next part of her journey.”

Stylists work anywhere that has Internet access, and though they are paid hourly, they often report intangible benefits, such as the satisfaction of happy customers.

Machines and Humans Work Best Together

Futurists once dreamed of machines that did everything, all guided by an unseen autopilot. Little did these visionaries know that the autopilot can do so much more with help from a crowd.

Active learning has put machines hand in hand with humans, and the success so far hints at huge potential. If this duo can choose clothing, thwart email spammers, and classify subtly different images, what else could it do?

Post topics: Economy

Real-World Active Learning

Learn faster. Dig deeper. See farther.

When Active Learning Works Best

Real-World Example: The Spam Filter

Real-World Example: Matching Business Listings at GoDaddy

Real-World Example: Ranking Top Search Results at Yahoo!

Where Active Learning Works Best

Basic Principles of Labeling Data

Beyond the Basics