If prejudice lurks among us, can our analytics do any better?

Technical and policy considerations in combatting algorithmic bias.

By Andy Oram

December 12, 2016

Scales. (source: Pixabay)

A drumbeat is growing among journalists and policy-makers that something is amiss with some of the most promising and powerful tools the computing field has developed. As predictive analytics finds its way into more and more domains—serving up ads; developing new markets; and making key decisions, such as who gets a loan, who gets a new job, even such ethically fraught decisions as whom to send to jail and whom to assassinate—the evidence of bias against minority groups, women, and others is mounting.

This article highlights the technical and social aspects of this pervasive trend in analytics. I look at why analytics are difficult to carry out in a just and fair manner, and what this says about the societal context in which they operate. I gained some insights on this subject from a workshop I attended in October by the Association for Computing Machinery (ACM), and from research I did surrounding the workshop.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Analytics everywhere

Predictive analytics seem to affirm Marc Andreessen’s famous claim that “software is eating the world.” Industries are finding analytics critical to stay competitive, while governments find it critical to meet their obligations to constituents. These pressures drive the high salaries claimed by data scientists (data science is more than just statistics, although a strong background in statistics is a requirement) and Gartner’s prediction of a massive shortage of data scientists.

Analytics (or more precisely, simulations) even play a major role, mostly as villains, in the recent popular movie Sully. The movie illustrates one of the most uncomfortable aspects of society’s increasing dependence on algorithms: powerful decision-makers in high places sometimes turn their judgment over to the algorithm without understanding how it works or what its ramifications are. In Sully, the investigators fed one key incorrect piece of information into the system, and additionally trained it on unrealistic assumptions. When the victim of their calculations challenged the assumptions behind the simulations, the investigators kept crowing, “We ran 20 simulations!” They didn’t realize that 20 simulations run on the same wrong assumptions will return the same wrong conclusion. And when the victim requested details about the input, they denied the request on bureaucratic grounds. Although Sully probably massaged the facts behind the case, the movie offers many lessons that apply to the use of analytics in modern life.

It’s important to remember, though, that analytics can contribute to good decision-making. Ironically, while I was attending the ACM workshop, analytics run by my credit card vendor discovered that someone had purloined my credit card information and was trying to use the card fraudulently. Thanks to their expert analytical system, they cut off access to the card right away and no money was lost. Although it was an inconvenience to find my card being declined during my travels, I appreciated the tools that protected both me and the bank.

Furthermore, most organizations that use analytics hope that they actually reduce bias, by reducing subjectivity. Discrimination has existed throughout time, and needs no computer. Growing research suggests that job interviews are ineffective in determining who will perform well at a job, due largely to the snap decisions we make evaluating people in person—which are highly susceptible to implicit bias. Research into brain functioning has shown that white and Asian people have deeply embedded, unconscious distrust of blacks—a reaction that complicates efforts to reform police practices (as one example). And bias affects the course of people’s lives very early. Because black students are punished more than white students for the same activities in school, old-fashioned human discrimination can warp lives at an early stage.

Unfortunately, predictive analytics often reproduce society’s prejudices because they are built by people with prejudice or because they are trained using historical data that reinforces historical stereotypes.

One well-publicized and undisputed example comes from research in 2013 by Latanya Sweeney. She is a leading privacy researcher, most famous for showing that public records could be used to uncover medical information about Massachusetts Governor William Weld. Her research there led to major changes in laws about health privacy. The 2013 research showed that a search on Google for a name commonly associated with African Americans (such as her own name) tended to turn up an ad offering arrest records for that person. Searches for white-sounding names tend not to turn up such ads. But any human resources manager, landlord, or other person doing a search on a potential candidate could easily be frightened off by the mere appearance of such an ad—especially when, among a field of job-seekers, only the African American candidate’s name returned one.

A large group of policy action groups signed a document called “Civil Rights Principles for the Era of Big Data,” urging fairness but not suggesting how it could be achieved. In the United States, drawing attention to this issue among policy-makers might be hard as a new crop comes to power openly embracing bias and discrimination, but ethical programmers and their employers will still seek solutions.

Let’s look now at what it really means to diagnose bias in analytics.

Becoming a discriminating critic

I remember a grade-school teacher telling her class that she wanted us to become “discriminating thinkers.” Discrimination can be a good thing. If someone has borrowed money and frittered it away on expensive purchases he can’t afford, it would benefit both banks and the public they serve not to offer him a loan. The question is what criteria you use to discriminate.

ACM workshop participants circled around for some time on a basic discussion of values. Should computing professionals erect certain specific values to control the use of analytics? Or should experts aim for transparency and to educate the public about the way decisions are made, without attempting to lay down specific values?

I believe the best course of action is to uphold widely accepted social standards. In the 1960s, for instance, based on the First Amendment to the Constitution, the United States banned discrimination on the basis of race, ethnicity, and religion. Gender and disability were added later as protected classes, then (in 22 states) sexual orientation. More recently, gender identification (meaning transgender and non-binary people) has started to join the list. The 1948 Universal Declaration of Human Rights by the United Nations calls in Article 2 for equal treatment based on “race, colour, sex, language, religion, political or other opinion, national or social origin, property, birth, or other status. Furthermore, no distinction shall be made on the basis of the political, jurisdictional or international status of the country or territory to which a person belongs, whether it be independent, trust, non-self-governing or under any other limitation of sovereignty.” That “other status” phrase throws the statement into murky territory, but otherwise the list is pretty specific.

In short, it’s up to political bodies and policy-makers engaging in public discussions to decide what’s OK to discriminate on and what’s not OK. In some cases, computer algorithms may be using criteria such as race or gender, or proxies for race and gender, for decisions such as hiring, in which using these criteria are not even legal.

A key precept of computing—promoted for years by the historic but now-defunct Computer Professionals for Social Responsibility—is that computers don’t alter human responsibility. If something is illegal or unethical when done by a human, it is equally illegal or unethical when done by a computer program created by humans. But too many people hide behind the computer program. “We’re using analytics, so it’s OK”—this is the digital version of “I was just following orders.”

The same message comes through in the classic 1976 book, Computer Power and Human Reason: From Judgment To Calculation, by Joseph Weizenbaum. There, Weizenbaum offered an essential principle regarding artificial intelligence, of which he was one of the leading researchers of his time: he said the question is not whether computers could make decisions governing critical human activities, but whether they should make those decisions.

So, I submit that numerous laws and policy statements have identified the areas where we should be wary of bias. As we’ll see in the course of this article, such policy considerations can drive technical decisions.

Data scientist Cathy O’Neil makes several cogent points in a recent, well-received book, Weapons of Math Destruction. A few of these observations are:

Data collection and processing often takes place in tiers or layers, as one organization buys up data or analytics from another. The initial opacity of the algorithm gets multiplied as layers pile up—and so do the error factors introduced at each layer.
Algorithms that compare us to trends end up judging us by other people’s behavior, not by our own. We may get overcharged for auto insurance not because were bad drivers, but because other people with our purchase or credit histories were bad drivers in the past. This is the essence of prejudice: assuming that one person will behave like others in her category.
Once a person gets assigned to a poor track—labeled as being an unreliable worker, a potential criminal, or even a bad credit risk—discrimination by algorithms cuts off opportunities and pushes her further and further down into poverty and lack of opportunity.

O’Neil’s remedies go beyond mere testing for bias and encompass a broad social program for assessing our goals for society, factoring in fairness to combat the pressures of financial gain and coding to help disadvantaged people advance instead of punishing them.

The murkiness of transparency

Transparency is the great rallying cry of our time—let everyone see your decision-making processes! Seventy nations around the world have joined an Open Government Partnership and pledged to bring citizens into budgeting and regulatory decisions. Most of these countries are continuing with wars, corruption, and other business as usual.

Yet, let us not be too cynical about transparency. In many ways, it is increasing, aided by rising education levels and and new communications media. The open source movement offers a great deal more transparency among programmers. Can open source software or other measures make predictive analytics more fair?

One of the worries of organizations carrying on analytics for classifying people is that the subjects of those analytics will game the system if they know the input measures. Many measures involve major lifestyle elements that are hard to change, such as income. But a surprising number seem to be simple proxies for more important traits, and thus might be gamed by savvy subjects.

In the Israeli TV comedy Arab Labor, the Arab protagonist is frustrated at being stopped regularly at check-points. He asks an Israeli friend how to avoid being stopped, and the friend advises him to buy a particular make and model of car. The Arab does so and, ludicrously, starts sailing through checkpoints without harassment. When predictive analytics are in play, many people are looking for that car that will take them through a tough situation.

Thus, those who have followed the use of analytics by online companies and other institutions admit that transparency is not always desirable. Some experts advise against using simple up-or-down criteria in the analytics, saying such criteria are too crude to contribute to good decisions. I think the experience of thousands of institutions has proven that such crude criteria can be eerily prescient. But the conditions of people being judged by analytics constantly change, so the criteria must be updated with them.

And that’s another drawback to transparency efforts: some organizations alter their analytics continually, as Google does with its ranking algorithm. Vetting every change with outsiders would be unfeasible. Machine learning techniques also tend to produce inexplicable decision-making trees that are essentially black boxes even to those who wrote the programs.

On the other hand, an algorithm that remains fixed will probably diverge from accurate predictions over time because the conditions of life that were part of the input data keep changing. This simple principle explains why the Dow Jones Industrial Average changes the companies it tracks from time to time: the companies that formed a major part of the U.S. economy in the 1880s have vanished or become insignificant, while key parts of the modern economy couldn’t even be imagined back then. (Of the original 12 firms on the DJIA, only General Electric remains on the index.) Analytics must be recalculated regularly with new, accurate input data for similar reasons. And here we come to another risk when analytics products are sold: they may drift away from reality over time and be left fighting the last war, with negative consequences both for the organizations depending on them and for the people misclassified by them.

The power imbalances at play are hugely important as well. As I’ll discuss below in the context of a famous article about criminal sentencing, challenging an algorithm from the outside is exceedingly difficult—the institutions deploying algorithms are massively more powerful than the individuals they classify. And an article by Alex Rosenblat, Tamara Kneese, and danah boyd points out that winning a discrimination lawsuit is difficult. The most effective road to fairness may be for companies to bring their analytics before approval boards, similar to an Institutional Review Board (IRB) used in academic research, composed of many stakeholders—including the people likely to be adversely affected by the analytics. Whether such boards can evaluate these subtle algorithms remains an unanswered question. At least they can inform the programmers when certain input data is inherently prejudiced.

Remedies

Outside of academia, critics of bias in predictive analytics have focused on uncovering such bias (one might note: using the same machine learning tools!), often with the implication that we should simply stop using these analytical tools for decisions that have far-reaching human impacts. The expected impact of the analytics is one criterion that organizations could use to decide whether to rely on the analytics. Using A/B testing to determine whether web site visitors click on green icons more than blue icons seems pretty innocuous. On the other hand, Facebook’s experiment to affect users’ moods through the postings displayed seems to be universally considered unethical.

So, society has not yet learned the proper roles for analytics or become adept at identifying the bad outcomes—what I’ve heard technologist Meng Weng Wong call malgorithms. Analytics are too powerful to be willfully rejected, however—and also too useful.

One remedy is to offer users the chance to challenge the results of analytics, a remedy in line with the Fair Information Practice Principles (FIPPS) promulgated by the Federal Trade commission and adopted, in varying forms, by many institutions decades ago. The idea here is that the institution can use any means it wants to make a decision, but should be transparent about it and grant individuals the right to challenge the decision. The principle has been enshrined in the European Union by its April 2016 Data Protection Directive, which updates 1980s-era regulations concerning privacy. Additionally, a guide for programmers wishing to build fair algorithms has been created by a group from Princeton University.

The guiding assumption governing analytics is that institutions using those analytics can provide a trail or auditable record of their decision. The directive requires information processors to provide any individual with the reasoning that led to some decision, such as denying a request for a loan.

This principle is high-minded but hard to carry out. It requires all the following to be in place:

To start, the individual has to understand that analytics were used to arrive at a decision, has to know what institution made that decision, has to appreciate that she has the right to challenge the decision, has to have access to a mechanism for issuing that challenge, and has to feel safe doing so.
All those conditions are missing in many circumstances. For instance, if a woman is not shown a job ad for which she is qualified because the ad algorithm is biased toward men, she will never know she was the victim of this discrimination. It’s also hard to know who is responsible for the decision. And if the institution using the analytics is an entity with power over your life—such as your employer or your insurer—you might well play it safe and not request an investigation.
The analytics have to be transparent. Sometimes this is easy to achieve. For instance, I’m impressed that Wolfram Alpha will show you the rules it used to return a result to your query. Some analytics are rules-based and open to display.
But many artificial intelligence programs, such as those using genetic algorithms or deep learning, are not transparent (unless designed to be). They change and refine themselves outside of human intervention. They are very powerful and can be very accurate—but by the time they reach a conclusion, they have followed a course too complex for most humans to follow.
Remedy by individual challenge is not generalizable: requirements that apply to intrepid individuals calling for the reversal of a decision do not affect the overall fairness of the system. An institution may review its decision for one individual, but there is no reason for it to stop the practice that may be hurting thousands of people. In fact, the EU directive is not set up to reflect the shared needs of large communities—they are treated as isolated individuals, and no one has the personal clout to change an unjust system.

With all these caveats in mind, it seems worthwhile to demand various forms of transparency from institutions doing analytics.

First, they need to identify themselves and reveal that they have been employed to make decisions affecting individuals.

They should open up discussions with stakeholders—and especially the populations of people affected by the systems—to talk about what is fair and whether they are accurately reflecting the realities of people’s lives.

They also need to provide audits or traces on all systems that do predictive analytics with serious human consequences. Systems that cannot provide audits are like voting machines that don’t produce paper output: they just aren’t appropriate for the task.

Well-known computer scientist Cynthia Dwork has done work on an interesting approach to fairness that she calls “Fairness Through Awareness.” Drawing on experiments in differential privacy, her team tries to use cryptography to prove that an algorithm is fair. The technology, unfortunately, is probably too sophisticated to be incorporated into the burgeoning analytical systems eating the world.

Another approach that brings testing for fairness into the development process has been suggested by computer scientists. That article raises an intriguing premise: we cannot just pretend to be blind to differences of race, gender, etc. We must be very conscious of these differences and must explicitly test for them. This “affirmative action” approach challenges statisticians and data scientists who believe they can be aloof from social impacts and that their technologies ensure objectivity.

Case study: Sentencing criminals

To wrap up this article, I’ll look at one of the most well-publicized studies of bias in analytics, and draw some new conclusions from it. I am referring to the famous article published last May in ProPublica about sentencing convicted criminals to jail. This article has played a major role in bringing the risks of predictive analytics to the public. Julia Angwin and her co-authors focus on one piece of software, COMPAS, used by many jurisdictions to decide whether to release criminals on probation or incarcerate them. They say such decisions are racially biased because COMPAS unfairly assigns black convicts to high-risk categories (meaning they are more likely to commit another crime when released) than whites.

Angwin and her co-authors could have pointed out that COMPAS is often wrong, although it is right a pretty good amount of the time. They could have suggested that, given the high error rate, it should be treated as merely one item of data among many by judges. But they went much further and propelled themselves into a hot controversy.

Everyone seems to agree on two things:

COMPAS’ algorithm is equally accurate in predicting that whites and blacks will commit more crimes (re-offend).
COMPAS’ algorithms are wrong far more often for blacks than for whites—and wrong in the direction that hurts blacks, claiming they will re-offend when, in fact, they don’t.

So, what’s fair?

ProPublica’s analysis has not gone undisputed. Several critics said that ProPublica did not take into account another important disparity: black convicts are much more likely to be convicted of a second offense than white convicts. The American Conservative printed an explanation of why this leads to the results ProPublica found. A Washington Post article made the same point (although less clearly and less persuasively, in my opinion). Basically, these articles argue that the over-classification of blacks as high-risk criminals is dictated by the input data, and cannot be fixed.

The creators of COMPAS software, Northpointe, also hammered on this idea in their rebuttal of the ProPublica article. Countering ProPublica’s core assertion that blacks are much more likely than whites to be incorrectly classified as high-risk, Northpointe answers, “This pattern does not show evidence of bias, but rather is a natural consequence of using unbiased scoring rules for groups that happen to have different distributions of scores.” (Page 8) They cite an unrelated study (page 7) to claim that they can’t help giving blacks a higher false high-risk rating.

Northpointe also says that the whites in the study tended to be older than the blacks (page 6), which makes them less likely to re-offend. The ProPublica study itself finds age to be closely correlated with crime (see the Analysis section about one-third of the way down the page). And they criticize ProPublica’s study on other grounds—but it seems to me that the tendency of blacks to be re-arrested more often is the core issue in all these critiques.

We can draw many general conclusions of interest from this controversy. First of all, data science is intrinsically controversial. Despite the field’s aim for objectivity, statisticians don’t always agree. Second, one must articulate one’s values when judging the impact of analytics. ProPublica assessed COMPAS along different ethical criteria from those used by Northpointe.

But the major lesson we should take from this is to ask: why do blacks have higher rates of recidivism? If that is the source of the bias claimed by ProPublica, why does it happen?

Here, we have to draw on research from social science, which is largely beyond the scope of this article. Basically, researchers have persuasively shown that black people lack the support that whites tend to have when coming away from criminal convictions. The well-known book by Michelle Alexander, The New Jim Crow, is a good place to start. Black people are less likely to have contacts on which they can draw to get a job, are less likely than whites to be hired (particularly after a criminal conviction), are less likely to have support for housing and other critical resources to fall back on, and generally are less likely than whites to have a social structure to keep them from falling back into crime.

Thus, disparities in the operations of predictive analytics help us face the disparities all around us in real life.

A similar conclusion comes from Sweeney’s work, cited earlier. She asks who is responsible for disproportionately offering “arrest records” for black-sounding names. Google and the company that offers the ads both disclaimed any intentional bias. I tend to believe them, because they would be at great risk if they built racial prejudice into ad displays. So, what’s the remaining alternative? End-user practices: ordinary Web users must be doing searches for arrest records more often on black people than on white people. This social bias gets picked up by the algorithms.

A similar conclusion, that bias from ordinary individuals passes into the contingent economy through rating systems, is aired in a review of studies by the MIT Technology Review. And so, we reach Pogo’s classic assessment: we have met the enemy, or in the formulation Mike Loukides used in his article on the O’Reilly site, “Our AIs are ourselves.”

Possible remedies

Data scientists reflexively check for accuracy in two ways: looking at the input data and looking at the model. When the real-life environment you’re drawing data from embodies unjust discrimination, the situation calls for active scrutiny—bending over backward to eliminate discrimination in the data, as suggested by the Dwork article I mentioned earlier. COMPAS, for instance, is clearly based on data that reduces to proxies for race. Conscious effort should be taken to reintroduce fairness.

Programmers and data scientists can take the lead in combating bias. But users in the industry or domain employing the algorithm, along with policy-makers who regulate the domain, can also take the lead by demanding that algorithms be submitted for review. Ideally, the analytics would be open to public review, but this is often unfeasible for reasons mentioned earlier (the need for trade secrets, the desire to avoid gaming the system, etc.) But a group of experts can be granted access under a strict license granting them the right only to evaluate the data and algorithm for the purpose of assessing potential bias.

The first step the public needs to take (after acknowledging that bias is bad—a principle that is often in doubt nowadays) is to understand that algorithms can introduce and reinforce bias. Then, we must recognize that the source of bias is not a programmer (even though he’s probably white, male, and high-income) or a program, but the same factors that have led to injustice for thousands of years. Algorithms are not objective, but they form an uncomfortably objective indictment of our sins.

Post topics: Data science