Chapter 2. A Risk-Based De-Identification Methodology
Before we can describe how we de-identified any health data sets, we need to describe some basic methodology. It’s a necessary evil, but we’ll keep the math to a bare minimum. The complete methodology and its justifications have been provided in detail elsewhere. Here we’ll just provide a high-level description of the key steps. The case studies that we go through in subsequent chapters will illustrate how each of these steps applies to real data. This will help you understand in a concrete way how de-identification actually works in practice.
Some important basic principles guide our methodology for de-identification. These principles are consistent with existing privacy laws in multiple jurisdictions.
- The risk of re-identification can be quantified
- Having some way to measure risk allows us to decide whether it’s too high, and how much de-identification needs to be applied to a data set. This quantification is really just an estimate under certain assumptions. The assumptions concern data quality and the type of attack that an adversary will likely launch on a data set. We start by assuming ideal conditions about data quality for the data set itself and the information that an adversary would use to attack the data set. This assumption, although unrealistic, actually results in conservative estimates of risk (i.e., setting the risk estimate a bit higher than it probably is) because the better the data is, ...