Reading Qualitative Research

Having described what qualitative methods are, we now turn to a discussion of how to read qualitative studies like the ones that appear throughout this book. For example, what does a particular study teach? When can you trust a study’s results? When can you generalize a study’s results to the larger world? To discuss these issues, let’s consider The Errors of TeX, published in 1989 by the Turing Award winner Donald Knuth [Knuth 1989].

In this classic article, Knuth analyzes more than 850 errors he logged while writing the TeX software. The study, as Knuth described it, was “to present a list of all the errors that were corrected in TeX while it was being developed, and to attempt to analyse those errors.” Knuth describes the rationale for his approach as overcoming the limitations of quantitative methods:

The concept of scale cannot easily be communicated by means of numerical data alone; I believe that a detailed list gives important insights that cannot be gained from statistical summaries.

What did Knuth discover in this study? He presents 15 categories of errors, gleaned from a much larger catalog, and then describes them with examples from his log. For example, Knuth describes the “blunder or blotch” category, which included program statements that were syntactically correct but semantically wrong. The root cause of these errors was variable names that were closely related conceptually but led to very different program semantics (e.g., reversing variables named before and after, or next_line and new_line). Knuth goes on to describe the other error categories, the history behind the TeX software project, his personal experiences in writing the software, and how he recorded errors in his log.

At the end of the article, he concludes:

What have I really learned then? I think I have learned, primarily, to have a better sense of balance and proportion. I now understand the complexities of a medium-size software system, and the ways in which it can be expected to evolve. I now understand that there are so many kinds of errors, we cannot stamp them out by systematically eliminating everything that might be ‘considered harmful.’ I now understand enough about my propensity to err that I can accept it as an act of life; I can now be convinced more easily of my fallacy when I have made a mistake.

Now let us step back and reflect on the merits of this work: what did we learn, as readers? The first time I read this article in the mid-1990s, I learned a great deal: I had never written a medium-sized software system, and the rich details, both contextual and historical, helped me understand the experience of undertaking such a large system by one’s self. I recognized many of the error categories that Knuth described in my own programming, but also learned to spot new ones, which helped me become better at thinking of possible explanations for why my code wasn’t working. It also taught me, as a researcher, that the human factors behind software development—how we think, how our memory works, how we plan and reason—are powerful forces behind software quality. This was one of just a few articles that compelled me to a career in understanding these human factors and exploiting them to improve software quality through better languages, tools, and processes.

But few of these lessons came immediately after reading. I only started to notice Knuth’s categories in my own work over a period of months, and the article was just one of many articles that inspired my interests in research. And this is a key point in how to read reports on qualitative research critically: not only do the implications of their results take time to set in, but you have to be open to reflecting on them. If you dismiss an article entirely because of some flaw you notice or a conclusion you disagree with, you’ll miss out on all of the other insights you might gain through careful, sustained reflection on the study’s results.

Of course, that’s not to say you should trust Knuth’s results in their entirety. But rather than just reacting to studies emotionally, it’s important to read them in a more systematic way. I usually focus on three things about a study: its inputs, its execution, and its outputs. (Sounds like software testing, doesn’t it?) Let’s discuss these in the context of Knuth’s study.

First, do you trust the inputs into Knuth’s study? For example, do you think TeX is a representative program? Do you think Knuth is a representative programmer? Do you trust Knuth himself? All of these factors might affect whether you think Knuth’s 15 categories are comprehensive and representative, and whether they still occur in practice, decades after his report. If you think that Knuth isn’t a representative programmer, how might the results have changed if someone else did this? For example, let’s imagine that Knuth, like many academics, was an absent-minded professor. Perhaps that would explain why so many of the categories have to do with forgetting or lack of foresight (such as the categories a forgotten function, a mismatch between modules, a surprising scenario, etc.). Maybe a more disciplined individual, or one working in a context where code was the sole focus, would not have had these issues. None of these potential confounding factors are damning to the study’s results, but they ought to be considered carefully before generalizing from them.

Do you trust Knuth’s execution of his study? In other words, did Knuth follow the method that he described, and when he did not, how might these deviations have affected the results? Knuth used a diary study methodology, which is often used today to understand people’s experiences over long periods of time without the direct observation of a researcher. One key to a good diary study is that you don’t tell the participants of the study what you expect to find, lest you bias what they write and how they write it. But Knuth was both the experimenter and the participant in his study. What kinds of expectations did he have about the results? Did he already have categories in mind before starting the log? Did he categorize the errors throughout the development of TeX, or retrospectively after TeX was done? He doesn’t describe any of these details in his report, but the answers to these questions could significantly change how we interpret the results.

Diary studies also have inherent limitations. For example, they can invoke a Heisenberg-style problem, where the process of observing may compel the diary writer to reflect on the work being captured to such a degree that the nature of the work itself changes. In Knuth’s study, this might have meant that by logging errors, Knuth was reflecting so much about the causes of errors that he subconsciously averted whole classes of errors, and thus never observed them. Diary studies can also be difficult for participants to work on consistently over time. For example, there was a period where Knuth halted his study temporarily, noting, “I did not keep any record of errors removed during the hectic period when TeX82 was being debugged....” What kinds of errors would Knuth have found had he logged during this period? Would they be different from those he found in less stressful, hectic periods?

Finally, do you trust the outputs of the study, its implications? It is standard practice in academic writing to separate the discussion of results and implications, to enable readers to decide whether they would draw the same conclusions from the evidence that the authors did. But Knuth combines these two throughout his article, providing both rich descriptions of the faults in TeX and the implications of his observations. For example, after a series of fascinating stories about errors in his Surprises category (which Knuth describes as global misunderstandings), he reflects:

This experience suggests that all software systems be subjected to the meanest, nastiest torture tests imaginable; otherwise they will almost certainly continue to exhibit bugs for years after they have begun to produce satisfactory results in large applications.

When results and implications appear side-by-side, it can be easy to forget that they are two separate things, to be evaluated independently. I trust Knuth’s memory of the stories that inspired the implication quoted here because he explained his process for recording these stories. However, I think Knuth over-interpreted his stories in forming his recommendation. Would Knuth have finished TeX if he spent so much time on torture tests? I trust his diary, but I’m skeptical about his resulting advice.

Of course, it’s important to reiterate that every qualitative study has limitations, but most studies have valuable insights. To be an objective reader of qualitative research, one has to accept this fact and meticulously identify the two. A good report will do this for you, as do the chapters in this book.

Get Making Software now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.