O'Reilly logo

Anonymizing Health Data by Luk Arbuckle, Khaled El Emam

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required


Although there is plenty of research into the areas of anonymization (masking and de-identification), there isn’t much in the way of practical guides. As we tackled one anonymization project after another, we got to thinking that more of this information should be shared with the broader public. Not an academic treatise, but something readable that was both approachable and applicable. What better publisher, we thought, than O’Reilly, known for their fun technical books on how to get things done? Thus the idea of an anonymization book of case studies and methods was born. (After we convinced O’Reilly to come along for the ride, the next step was to convince our respective wives and kids to put up with us for the duration of this endeavor.)


Everyone working with health data, and anyone interested in privacy in general, could benefit from reading at least the first couple of chapters of this book. Hopefully by that point the reader will be caught in our net, like a school of Atlantic herring, and be interested in reading the entire volume! We’ve identified four stakeholders that are likely to be specifically interested in this work:

  • Executive management looking to create new revenue streams from data assets, but with concerns about releasing identifiable information and potentially running afoul of the law
  • IT professionals that are hesitant to implement data anonymization solutions due to integration and usability concerns
  • Data managers and analysts that are unsure about their current methods of anonymizing data and whether they’re compliant with regulations and best practices
  • Privacy and compliance professionals that need to implement defensible and efficient anonymization practices that are pursuant with the HIPAA Privacy Rule when disclosing sensitive health data

Conventions Used in this Book

The following typographical conventions are used in this book:

Used for emphasis, new terms, and URLs.


This element signifies a tip, suggestion, or a general note.


This element indicates a trap or pitfall to watch out for, typically something that isn’t immediately obvious.

Safari® Books Online


Safari Books Online is an on-demand digital library that delivers expert content in both book and video form from the world’s leading authors in technology and business.

Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.

Safari Books Online offers a range of product mixes and pricing programs for organizations, government agencies, and individuals. Subscribers have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and dozens more. For more information about Safari Books Online, please visit us online.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://oreil.ly/anonymizing-health-data.

To comment or ask technical questions about this book, send email to .

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Follow us on Twitter: http://twitter.com/oreillymedia

Watch us on YouTube: http://www.youtube.com/oreillymedia


Everything accomplished in this book, and in our anonymization work in general, would not have been possible without the great teams we work with at the Electronic Health Information Lab at the CHEO Research Institute, and Privacy Analytics, Inc. As the saying goes, surround yourself with great people and great things will come of it. A few specific contributions to this book are worth a high five: Ben Eze and his team of merry developers that put code to work; Andrew Baker, an expert in algorithms, for his help with covering designs and geoproxy risk; Abdulaziz Dahir, a stats co-op, who helped us with some of the geospatial analysis; and Youssef Kadri, an expert in natural language processing, for helping us with text anonymization.

Of course, a book of case studies wouldn’t be possible without data sets to work with. So we need to thank the many people we have worked with to anonymize the data sets discussed in this book: BORN Ontario (Ann Sprague and her team), the Health Care Cost and Utilization Project, Heritage Provider Network (Jonathan Gluck) and Kaggle (Jeremy Howard and team, who helped organize the Heritage Health Prize), the Clinical Center of Excellence at Mount Sinai (Lori Stevenson and her team, in particular Cornelia Dellenbaugh, sadly deceased and sorely missed), Informatics for Integrating Biology and the Bedside (i2b2), the State of Louisiana (Lucas Tramontozzi , Amy Legendre, and everyone else that helped) and organizers of the Cajun Code Fest, and the American Society of Clinical Oncology (Joshua Mann and Andrej Kolacevski).

Finally, thanks to the poor souls who slogged through our original work, catching typos and helping to clarify a lot of the text and ideas in this book: Andy Oram, technical editor extraordinaire; Jean-Louis Tambay, an expert statistician with a great eye for detail; Bradley Malin, a leading researcher in health information privacy; David Paton, an expert methodologist in clinical standards for health information; and Darren Lacey, an expert in information security. It’s no exaggeration to say that we had great people review this book! We consider ourselves fortunate to have received their valuable feedback.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required