Chapter 3. Fair Data

The choice of what data to use in developing algorithmic and machine learning code is important and will likely continue to be important long into the future. This chapter considers what it means for a data set to be fair and how to identify aspects of a data set that can be problematic for fairness.

In our discussion we will stick to two main themes throughout. First is the concern about garbage in, garbage out, which I’ll call the fidelity concern. An example is building an algorithm for college admissions decisions that is based on students’ grades and names that had been mixed up in the data set, creating false information and likely leading to an unrealistic and faulty algorithm. Second is the concern about whether data was obtained in a way consistent with fair play, which I’ll call the provenance concern. An example is a psychiatrist selling the names of depressed patients to a marketing agency after obtaining these names through their own practice of medicine. If you were a data scientist working in a marketing firm, you probably wouldn’t (and shouldn’t) feel comfortable using such data once you knew where it had come from.

The fidelity concern is what most people think about when terms like data integrity are thrown around. The concern need not have anything to do with fairness. For reasons related to the bottom line, most businesses care about data quality because they want accurate algorithmic products, and accurate algorithms generally require ...

Get Practical Fairness now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.