Reclaiming the stories that algorithms tell

Getting curious about the numbers attached to other people can help us to use data wisely—and to see others clearly.

By David G. Robinson
May 27, 2020

Algorithms tell stories about who people are. The first story an algorithm told about me was that my life was in danger. It was 7:53 pm on a clear Monday evening in September of 1981, at the Columbia Hospital for Women in Washington DC. I was exactly one minute old. The medical team scored me—as it does for nearly all of the 98% of American newborns who arrive in hospitals—using a ten-point scale known as the Apgar, a simple algorithm based on direct observations of newborn health. (You get two points for waving your arms and legs, for instance.) My exact score is lost to history, but one of the doctors in the room tells me it was probably a six or less out of ten. Numbers like that typically mean a baby needs help. Whether driven by my score, or by their own firsthand experience, the doctors sent me straight to the neonatal intensive care ward, where I spent my first few days. I lived in a clear incubator, basking under a warming light like a very well oxygenated burrito.

Doctors and nurses have always cared about whether newborns are healthy. But before Virginia Apgar introduced her numerical scale in 1953, doctors varied widely in their treatment of vulnerable newborns. Using the new scores, Apgar and her colleagues proved that many infants who initially seemed lifeless could be revived, with success or failure in each case measured by the difference between an Apgar score at one minute after birth, and a second score taken at five minutes. Standard measures made systematic knowledge about infants’ welfare possible, and also simplified decision making about what to do in the urgent first moments after a difficult birth. The algorithm does have its limits: It’s partly subjective, and Apgar warned that because the doctor who delivers a baby is “inevitably emotionally involved,” someone else should do the scoring. More importantly, while a low score nearly always means the infant needs help, the converse isn’t true—some newborns who are in trouble nonetheless get high scores. An Apgar score is a tiny story, easily made and compared. It’s very often useful, but it isn’t always right.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Most algorithms in the news these days are calculated by software. But an algorithm is just a rule expressed in numbers, and there’s no hard line separating simple rules of thumb like the Apgar from the most complex mathematical formulas.

When an algorithm describes a human being—no matter how complex or simple the math may be—the goal is to distill something essential and true, something usable and standardized, out of the mess of unique circumstances that make up each human life. And yet a number or category label that describes a human life is not only machine-readable data. It is also a story, one that can live vividly in the mind and imagination of the person being scored, as well as the minds of others who sit in judgment. Sometimes, as with the Apgar, the score has a clear, limited purpose and does its job well.

But there’s often a gap between how much of a person’s story an algorithm can tell, and how much we want it to tell.  The temptation to ignore that gap, or to jump across it through wishful thinking, can be overwhelming.

Reading the numbers in a California classroom

To see this up close, consider the scores we give to older kids in classrooms. A friend of mine, who I’ll call Audrey, teaches sixth grade on California’s central coast. I visited her brightly lit classroom recently over a school break. The chairs and tables were low and kid-sized, so that walking in at adult height, I had the disorienting feeling of becoming a slightly larger version of myself. Below the windows on one long side of the room sat books in colorful bins—hundreds of them, a class library.

Under school district policy, each of Audrey’s eleven- and twelve-year old students is tested at least three times a year to determine his or her Lexile, a number between 200 and 1,700 that reflects how well the student can read. Books, in turn, get matching scores to reflect their difficulty. Some students ignore these numbers, Audrey tells me, but for others, their personal score can be a big deal: Some are proud to be officially scored as a precocious reader, and others feel badly when their score tags them as behind the curve. These scores go on student report cards, and are a frequent topic at parent-teacher conferences. Earlier this year, Audrey got permission from one of her students to tell the whole class that his score had improved by more than 200 points, and they all applauded.

But Audrey tells me that these Lexile numbers don’t really tell the whole story of who’s a good reader. They test each student’s grasp of a particular sentence or paragraph—but not of a whole story. Some of the students who ace this test still struggle to discuss a book in class, while others who prepare well for discussion, and think deeply about what they’ve read, still earn low scores.

Sixth graders are on the cusp of becoming truly independent learners. “This is kind of the last shot I’ve got,” Audrey tells me, “to get [the students] to feel like they can own books, and discuss and analyze books, and really think deeply about them. And that, to my mind, is more important than just reading a short amount of text and answering a question.” She wants to build confidence and skill in her students, especially the ones who struggle—and she worries that a low score could become a self-fulfilling, discouraging prophecy. Each of the classroom’s library books has a color coded sticker on its spine reflecting its Lexile score—a visual announcement of its official complexity level, and  thus of which students might be officially ready to read it. Audrey knows it would be easy for kids to feel ashamed to be in the same simple category as the easiest books, so she makes sure to to add some thick, imposing chapter books at the less advanced levels, and some thinner books at the more advanced categories.

This whole scoring system also changes the story about who librarians and teachers are. In 2001, just as the Lexile system was rolling out state-wide, a professor of education named Stephen Krashen took to the pages of the California School Library Journal to raise an alarm. It’s a core role of librarians and teachers to know their students and to recommend interesting books, he wrote. The best book for a given student may be very easy for them to read, or might be more advanced than they find comfortable, depending on topic. The larger problem is that many students simply don’t have “good books and a comfortable place to read them.”  But the Lexile system’s inventor, A. Jackson Stenner, disagreed, and cast the role of educators in far more mechanical terms.  He suggested that “inattention to targeting” books based on reading level “is the single best explanation for why students don’t read more.” His system was needed because “beginning teachers and librarians” were less expert at “forecasting comprehension rates” than the algorithm was. But one might equally wonder what made this computational task, at which the new algorithm happened to excel, a good yardstick for judging teachers? If you accept this mechanism for judging reading ability, you’re implicitly accepting a much more mechanical role for teachers: hand out books according to the numbers.

Heart surgeons, by the numbers

Other stories get distorted in similar ways, even when the people being described by an algorithm are a small and elite group. There are about 150 cardiac surgeons in New York State, for instance. Ever since 1989, the state has periodically published a report card that rates each surgeon, by name, based on how many of that surgeon’s patients died in hospital or within 30 days after coronary artery bypass surgery. Of course, these mortality numbers depend in part on each surgeon’s patient mix: those who operate on sicker patients can expect more deaths, even if they are equally (or more) skilled. So the state calculates and publishes a “Risk Adjusted Mortality Ratio”—a comparison between the actual number of observed deaths and the number that would be statistically expected, on average, for patients medically similar to those each doctor actually operated on. This process controls for prior heart attacks, age, and several other factors, though, of course, it can’t cover everything. The report has pages of careful caveats, but in the end it treats these risk-adjusted ratios as a good measure of a surgeon’s performance. Ratios much less than one mean “the provider has significantly better performance than the state as a whole,” and conversely, ratios larger than one mean worse performance.

These report cards have changed the way that cardiac surgeons in New York do their jobs, but thirty years on, it’s still not clear whether the metrics make things better. A few surgeons, who did few heart surgeries and had below-average performance, stopped doing these surgeries when reporting began, which is perhaps a credit to the system. But report cards also seem to make even great surgeons more cautious than they think is best for their patients. In a 1997 survey of all New York heart surgeons,  most respondents said they had “refused to operate on at least one high risk patient” within the last year “primarily due to public reporting.” The expected fatality rate after cardiac surgery is low—just 1.67% at last count—which is good, but leaves surgeons with little room to take chances. Inevitably, patients with risk factors that are excluded from the model’s adjustments present a threat to each surgeon’s statistics. At the Mayo Clinic, whose Cleveland location puts it 110-miles west of the New York state line, scholars noted a 31% increase in heart surgery patients coming over from New York after the report cards began. They concluded that “out-of-state risk-shifting appears to be a significant by-product of report-card medicine,” and warned that if Ohio adopted a similar system, patients might find it even harder to get needed surgeries. Surgeons also worry that the system discourages new approaches to hard cases, which carry more risk but could also be good for patients.

Mass produced and farm-to-table

Looking at these three examples side by side—algorithms that judge newborns, young readers, and cardiac surgeons—I find myself reminded of those dehydrated meals you can take on a camping trip. There might have been a bunch of complex fresh ingredients at the beginning, but they get reduced to something small and portable and shelf-stable, something easy to manage. “Just add hot water,” say the instructions. But when you heat it up again, you never quite get the same meal back. Like hot water applied to a dehydrated meal, algorithms applied to data about people can quickly give us something that’s simple, consistent, and easy to use. The score an algorithm calculates about a person isn’t their real story, any more than a foil bag of reconstituted noodles is a gourmet feast.

But that isn’t the whole story about algorithms; the analogy is useful, but incomplete. Yes, algorithms can distort our beliefs about who’s doing well in the classroom, or what it even means to be a good student or teacher. But “dehydrated stories” also produce the kind of quick comparisons that may have helped to save my newborn life. And when used well, they can do much more.

Tell people’s stories through numbers, and collective identities and trends can swim into view. An unemployment rate comes from the dehydrated story of people who are looking for work, and haven’t found it yet. Likewise, the civil rights groups who fight against racially biased data in courtroom algorithms, for instance, aren’t opposed to all algorithms. They’re ardently in favor of a comprehensive and accurate census count. Census data gives a map of unmet needs, and can also point out discriminatory patterns. Stories rendered into data make discrimination visible, and make remedies possible.

In the 2008 financial crisis, for instance, Wells Fargo bank staff were “steering” some black borrowers toward costly subprime mortgages, even when those borrowers had sterling credit, and would have qualified for a mortgage on far more favorable terms. How did investigators know that black borrowers really were getting worse loans than their histories should have earned them? Credit scores. Federal prosecutors showed that black borrowers got worse loan terms than white borrowers with the same scores—even though they posed the same risk of default for the bank. Wells Fargo eventually settled the case for $175 million, much of it earmarked to go back to the black borrowers whose scores showed that had been saddled with overpriced loans.

Stories that come out of an algorithm depend on simplified, numerical reflections of the endless variety of human experience. These stories will never be as rich or real as the ones we learn firsthand.  In some sense, this means that criticizing algorithms will always be easy. There will always be  some newborns who need intensive care despite scoring high on the Apgar, some sixth graders who do well on the reading test without really mastering the skills that the test is meant to measure, and some driven cardiac surgeons whose high post-surgical fatality rates reflect an intrepid willingness to take on the hardest cases. Every time we let an algorithm tell a person’s story by distilling it down to numbers, we’re losing much of what is best about that story, most engaging, most human.

Yet the simplicity and predictably of algorithm-based stories can also be radically empowering. People (and organizations) often need to understand each other beyond the scale of a village—to understand something important about a distant, unfamiliar stranger, without the benefit of first-hand interaction. Economic, social, and political opportunity can also be conveyed at a distance, as the present moment of pandemic-driven remote work is forcing many of us to discover. The world would be a narrower, more parochial and less appetizing place, if the only food that we could eat were home-cooked or farm to table.

We need both kinds of stories

We need both kinds of stories—the gourmet farm-to-table kind, and the shelf-stable, industrialized, comparison-ready kind.

Admitting this opens up a raft of harder questions. When and why is this algorithmic bargain of simplification and standardization really worth its cost? How can those costs be minimized? If we have to choose between mechanical and personal stories, how should that be done—mechanically, with numerical pros and cons, or personally, with a holistic sense of what’s best in a situation?

There’s also a human and personal challenge here for each of us. We’ve got to learn to mind the gap between real stories and the ones told by data—to learn when, for both personal and organizational reasons, it’s necessary to see what the algorithms obscure. The more we work with algorithms, the more urgent and important our complementary, direct access to each others’ human stories becomes.

If you only have the numbers, you’re likely to be missing something important. Researchers have long sought ways to record and share the human context that surrounds high stakes algorithms. Google’s Model Cards, for instance, include discussion in plain language about the tradeoffs engineers had to make when designing a system. In the one for their Perspective algorithm—a tool for deciding which comments in an online discussion are “toxic”— they warn people not to use the system for character judgment. This seems like a useful direction, but careful labeling is at best an incomplete solution. Unless decisionmakers build up a healthy habit of questioning people-judging algorithms, the labels and warnings may fall on deaf ears.

Maybe what we need—more than flawless data—is data whose flaws are known and appreciated by everyone involved. Appreciating the limits of people-judging algorithms won’t force us to reject such systems outright. It will empower us to use them responsibly.

Post topics: AI & ML
Post tags: Deep Dive

Get the O’Reilly Radar Trends to Watch newsletter