Chapter 12. Secure Linking

A lot of useful health data is linked health data. This means that the data set consists of multiple data sets that have been linked together. But anonymization, by design, makes linking difficult. How do you accurately link records together between two data sets when they have next to nothing in common after anonymization? This is a challenge we see a lot with anonymization. Data sets are collected at different points in time, or by different organizations, and they need to be linked together to run the desired analytics.

To be useful for analysis, data sets must be linked before anonymization. This means that the different organizations holding the constituent data sets need to share information to allow this linking to happen. The best fields to use for linking data sets are almost always identifiers, which means that the organizations have to share personally identifying information in order to link their data. We’re back to the original problem now—how can these organizations share personal identifiers if they do not have appropriate authority or consent?

Only once they’re linked can the data sets be de-identified and used and disclosed for secondary purposes. We’ve seen many useful projects die because it wasn’t possible to link the needed data sets together. But this problem can be solved: this chapter describes a way to securely link data sets together without disclosing personal information. The mechanism is an equi-join process for two tables that ...

Get Anonymizing Health Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.