Chapter 11. Masking: Oncology Databases

When we need to remove all useful data from a field, we turn to data masking—the second of our pillars discussed in The Two Pillars of Anonymization. Usually this means replacing real data with entirely random values, possibly from a large database (for things like names). Obviously, this isn’t something we do to fields we need for analytics. Rather, it’s something we apply to things like names, Social Security numbers, and ID fields. De-identification involves protecting fields we need for analytics, and is a trade-off between privacy and utility; masking involves protecting fields we don’t need for analytics, and is meant to completely hide the original data.

To understand the reasons for masking and its trade-offs, we’ll take a short look at a real database. The American Society of Clinical Oncology (ASCO) has launched an ambitious project to build tools on top of oncology electronic health record (EHR) data collected from sites across the country. Its goal is to improve the quality of care by having millions of patients essentially participate in a large clinical trial, pooling all of their data in a system called CancerLinQ.[104]

Schema Shmema

Before we discuss data masking, let’s look at an example database that the ASCO CancerLinQ system may come across. This will give us examples to think about when we go through approaches to masking. Figure 11-1 is a schema for our invented database. Direct identifiers include the names, address (although ...

Get Anonymizing Health Data now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.