Anonymizing Health Data

Preface

Although there is plenty of research into the areas of anonymization (masking and de-identification), there isn’t much in the way of practical guides. As we tackled one anonymization project after another, we got to thinking that more of this information should be shared with the broader public. Not an academic treatise, but something readable that was both approachable and applicable. What better publisher, we thought, than O’Reilly, known for their fun technical books on how to get things done? Thus the idea of an anonymization book of case studies and methods was born. (After we convinced O’Reilly to come along for the ride, the next step was to convince our respective wives and kids to put up with us for the duration of this endeavor.)

Audience

Everyone working with health data, and anyone interested in privacy in general, could benefit from reading at least the first couple of chapters of this book. Hopefully by that point the reader will be caught in our net, like a school of Atlantic herring, and be interested in reading the entire volume! We’ve identified four stakeholders who are likely to be specifically interested in this work:

Executive management looking to create new revenue streams from data assets, but with concerns about releasing identifiable information and potentially running afoul of the law
IT professionals who are hesitant to implement data anonymization solutions due to integration and usability concerns
Data managers and analysts that are unsure about their current methods of anonymizing data and whether they’re compliant with regulations and best practices
Privacy and compliance professionals who need to implement defensible and efficient anonymization practices that are pursuant with the relevant regulations in their jurisdiction

Conventions Used in this Book

The following typographical conventions are used in this book:

Italic: Used for emphasis, new terms, and URLs.

Tip

This element signifies a tip, suggestion, or a general note.

Warning

This element indicates a trap or pitfall to watch out for, typically something that isn’t immediately obvious.

Safari® Books Online

Note

Safari Books Online is an on-demand digital library that delivers expert content in both book and video form from the world’s leading authors in technology and business.

Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.

Safari Books Online offers a range of product mixes and pricing programs for organizations, government agencies, and individuals. Subscribers have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and dozens more. For more information about Safari Books Online, please visit us online.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.

1005 Gravenstein Highway North

Sebastopol, CA 95472

800-998-9938 (in the United States or Canada)

707-829-0515 (international or local)

707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://oreil.ly/anonymizing-health-data.

To comment or ask technical questions about this book, send email to bookquestions@oreilly.com.

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Watch us on YouTube: http://www.youtube.com/oreillymedia

Content Updates

August 2014

Chapter 1, Introduction: This chapter includes a new section on automating the anonymization of data sets, and describes how this will increase the number of anonymization professionals by making the methods accessible to a broader and less specialized audience.
Chapter 2, A Risk-Based De-Identification Methodology: We have provided new guidance on selecting direct and indirect identifiers, including a decision tree to simplify the process.
Chapter 13, De-Identification and Data Quality: A Clinical Data Warehouse: Here, we consider the before and after effects of anonymizing a clinical data warehouse—specifically, two study protocols that are of interest to researchers and a closer look at date shifting.

Acknowledgements

Everything accomplished in this book, and in our anonymization work in general, would not have been possible without the great teams we work with at the Electronic Health Information Lab at the CHEO Research Institute, and Privacy Analytics, Inc. As the saying goes, surround yourself with great people and great things will come of it. A few specific contributors to this book should get a high five: Andrew Baker, an expert in algorithms, for his help with covering designs and geoproxy risk; Abdulaziz Dahir, a stats co-op, who helped us with some of the geospatial analysis; Aleksander Essex, a wizard in cyber security and applied cryptography, for helping develop the secure linking protocol; Ben Eze and his team of merry developers that put code to work; Youssef Kadri, an expert in natural language processing, for helping us with text anonymization; and Ann Waldo, a legal expert on privacy, information security, and health care issues.

Of course, a book of case studies wouldn’t be possible without data sets to work with. So we need to thank the many people we have worked with to anonymize the data sets discussed in this book: BORN Ontario (Ann Sprague and her team), the Healthcare Cost and Utilization Project, Heritage Provider Network (Jonathan Gluck) and Kaggle (Jeremy Howard and team, who helped organize the Heritage Health Prize), the Clinical Center of Excellence at Mount Sinai (Lori Stevenson and her team, in particular Cornelia Dellenbaugh, sadly deceased and sorely missed), Informatics for Integrating Biology and the Bedside (i2b2), the State of Louisiana (Lucas Tramontozzi, Amy Legendre, and everyone else that helped) and organizers of the Cajun Code Fest, the American Society of Clinical Oncology (Joshua Mann and Andrej Kolacevski), IMS Brogan (Neil Corner and his team) for their help with the data quality analysis, and the Public Health Agency of Canada (Tom Wong and his team) as well as Jay Mercer at the Bruyere Hospital and family clinic for working with us on the chlamidya protocol.

Finally, thanks to the poor souls who slogged through our original work, catching typos and helping to clarify a lot of the text and ideas: Andy Oram, technical editor extraordinaire; Jean-Louis Tambay, an expert statistician with a great eye for detail; Bradley Malin, a leading researcher in health information privacy; David Paton, an expert methodologist in clinical standards for health information; and Darren Lacey, an expert in information security. It’s no exaggeration to say that we had great people review this book! We consider ourselves fortunate to have received their valuable feedback.

Get Anonymizing Health Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Anonymizing Health Data by Khaled El Emam, Luk Arbuckle