Chapter 13. De-Identification and Data Quality: A Clinical Data Warehouse

As is evident from the case studies we’ve presented, anonymization results in some distortion of the original data. In this chapter we’ll discuss the amount of distortion that can be introduced and how it can be effectively managed. We’ll focus on de-identification, not masking, because it’s de-identification that distorts the variables we might want to use for analysis. The amount of distortion is referred to as “information loss,” or conversely “data utility”.

Data utility is important for those using anonymized data, because the results of their analyses are critical for informing major care, policy, and investment decisions. Also, the cost of getting access to data is not trivial, making it important to ensure the quality of the data received. We don’t want to be wasteful, spending time and money collecting high quality data, to then watch that quality deteriorate through anonymization practices meant to prepare the data for secondary use. What we really want to know is whether the inferences drawn from de-identified data are reliable—that is, are they the same inferences we would draw from the original data?

Useful Data from Useful De-Identification

Although obvious, it’s worth repeating that poor de-identification techniques result in less data utility. In fact, that is one key way to evaluate the quality of a de-identification method. Many of the de-identification techniques that we’ve described are essentially ...

Get Anonymizing Health Data now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.