Anonymizing Health Data

Book Description

Updated as of August 2014, this practical book will demonstrate proven methods for anonymizing health data to help your organization share meaningful datasets, without exposing patient identity. Leading experts Khaled El Emam and Luk Arbuckle walk you through a risk-based methodology, using case studies from their efforts to de-identify hundreds of datasets.

Clinical data is valuable for research and other types of analytics, but making it anonymous without compromising data quality is tricky. This book demonstrates techniques for handling different data types, based on the authors’ experiences with a maternal-child registry, inpatient discharge abstracts, health insurance claims, electronic medical record databases, and the World Trade Center disaster registry, among others.

  • Understand different methods for working with cross-sectional and longitudinal datasets
  • Assess the risk of adversaries who attempt to re-identify patients in anonymized datasets
  • Reduce the size and complexity of massive datasets without losing key information or jeopardizing privacy
  • Use methods to anonymize unstructured free-form text data
  • Minimize the risks inherent in geospatial data, without omitting critical location-based health information
  • Look at ways to anonymize coding information in health data
  • Learn the challenge of anonymously linking related datasets

Publisher Resources

View/Submit Errata

Table of Contents

  1. Preface
    1. Audience
    2. Conventions Used in this Book
    3. Safari® Books Online
    4. How to Contact Us
    5. Content Updates
      1. August 2014
    6. Acknowledgements
  2. 1. Introduction
    1. To Anonymize or Not to Anonymize
      1. Consent, or Anonymization?
      2. Penny Pinching
      3. People Are Private
    2. The Two Pillars of Anonymization
      1. Masking Standards
      2. De-Identification Standards
        1. Lists
        2. Heuristics
        3. Risk-based methodology
    3. Anonymization in the Wild
      1. Organizational Readiness
      2. Making It Practical
      3. Making It Automated
      4. Use Cases
    4. Stigmatizing Analytics
    5. Anonymization in Other Domains
    6. About This Book
  3. 2. A Risk-Based De-Identification Methodology
    1. Basic Principles
    2. Steps in the De-Identification Methodology
      1. Step 1: Selecting Direct and Indirect Identifiers
      2. Step 2: Setting the Threshold
      3. Step 3: Examining Plausible Attacks
      4. Step 4: De-Identifying the Data
      5. Step 5: Documenting the Process
    3. Measuring Risk Under Plausible Attacks
      1. T1: Deliberate Attempt at Re-Identification
      2. T2: Inadvertent Attempt at Re-Identification
      3. T3: Data Breach
      4. T4: Public Data
    4. Measuring Re-Identification Risk
      1. Probability Metrics
      2. Information Loss Metrics
    5. Risk Thresholds
      1. Choosing Thresholds
      2. Meeting Thresholds
    6. Risky Business
  4. 3. Cross-Sectional Data: Research Registries
    1. Process Overview
      1. Secondary Uses and Disclosures
      2. Getting the Data
      3. Formulating the Protocol
      4. Negotiating with the Data Access Committee
    2. BORN Ontario
      1. BORN Data Set
    3. Risk Assessment
      1. Threat Modeling
      2. Results
      3. Year on Year: Reusing Risk Analyses
    4. Final Thoughts
  5. 4. Longitudinal Discharge Abstract Data: State Inpatient Databases
    1. Longitudinal Data
      1. Don’t Treat It Like Cross-Sectional Data
    2. De-Identifying Under Complete Knowledge
      1. Approximate Complete Knowledge
      2. Exact Complete Knowledge
      3. Implementation
      4. Generalization Under Complete Knowledge
    3. The State Inpatient Database (SID) of California
      1. The SID of California and Open Data
    4. Risk Assessment
      1. Threat Modeling
      2. Results
    5. Final Thoughts
  6. 5. Dates, Long Tails, and Correlation: Insurance Claims Data
    1. The Heritage Health Prize
    2. Date Generalization
      1. Randomizing Dates Independently of One Another
      2. Shifting the Sequence, Ignoring the Intervals
      3. Generalizing Intervals to Maintain Order
      4. Dates and Intervals and Back Again
      5. A Different Anchor
      6. Other Quasi-Identifiers
      7. Connected Dates
    3. Long Tails
      1. The Risk from Long Tails
      2. Threat Modeling
      3. Number of Claims to Truncate
      4. Which Claims to Truncate
    4. Correlation of Related Items
      1. Expert Opinions
      2. Predictive Models
      3. Implications for De-Identifying Data Sets
    5. Final Thoughts
  7. 6. Longitudinal Events Data: A Disaster Registry
    1. Adversary Power
      1. Keeping Power in Check
      2. Power in Practice
      3. A Sample of Power
    2. The WTC Disaster Registry
      1. Capturing Events
      2. The WTC Data Set
      3. The Power of Events
    3. Risk Assessment
      1. Threat Modeling
      2. Results
    4. Final Thoughts
  8. 7. Data Reduction: Research Registry Revisited
    1. The Subsampling Limbo
      1. How Low Can We Go?
      2. Not for All Types of Risk
      3. BORN to Limbo!
    2. Many Quasi-Identifiers
      1. Subsets of Quasi-Identifiers
      2. Covering Designs
      3. Covering BORN
    3. Final Thoughts
  9. 8. Free-Form Text: Electronic Medical Records
    1. Not So Regular Expressions
    2. General Approaches to Text Anonymization
    3. Ways to Mark the Text as Anonymized
    4. Evaluation Is Key
      1. Appropriate Metrics, Strict but Fair
      2. Standards for Recall, and a Risk-Based Approach
      3. Standards for Precision
    5. Anonymization Rules
    6. Informatics for Integrating Biology and the Bedside (i2b2)
      1. i2b2 Text Data Set
    7. Risk Assessment
      1. Threat Modeling
      2. A Rule-Based System
      3. Results
    8. Final Thoughts
  10. 9. Geospatial Aggregation: Dissemination Areas and ZIP Codes
    1. Where the Wild Things Are
    2. Being Good Neighbors
      1. Distance Between Neighbors
      2. Circle of Neighbors
      3. Round Earth
      4. Flat Earth
    3. Clustering Neighbors
      1. We All Have Boundaries
      2. Fast Nearest Neighbor
    4. Too Close to Home
      1. Levels of Geoproxy Attacks
      2. Measuring Geoproxy Risk
      3. Accounting for Geoproxy Risk
    5. Final Thoughts
  11. 10. Medical Codes: A Hackathon
    1. Codes in Practice
    2. Generalization
      1. The Digits of Diseases
      2. The Digits of Procedures
      3. The (Alpha)Digits of Drugs
    3. Suppression
    4. Shuffling
    5. Final Thoughts
  12. 11. Masking: Oncology Databases
    1. Schema Shmema
    2. Data in Disguise
      1. Field Suppression
      2. Randomization
      3. Pseudonymization
      4. Frequency of Pseudonyms
    3. Masking On the Fly
    4. Final Thoughts
  13. 12. Secure Linking
    1. Let’s Link Up
    2. Doing It Securely
      1. Don’t Try This at Home
      2. The Third-Party Problem
      3. Basic Layout for Linking Up
    3. The Nitty-Gritty Protocol for Linking Up
      1. Bringing Paillier to the Parties
      2. Matching on the Unknown
    4. Scaling Up
      1. Cuckoo Hashing
      2. How Fast Does a Cuckoo Run?
    5. Final Thoughts
  14. 13. De-Identification and Data Quality: A Clinical Data Warehouse
    1. Useful Data from Useful De-Identification
    2. Degrees of Loss
    3. Workload-Aware De-Identification
      1. Questions to Improve Data Utility
    4. A Clinical Data Warehouse
      1. GI Protocol
      2. Chlamydia Protocol
      3. Date Shifting
    5. Final Thoughts
  15. Index
  16. Colophon
  17. Copyright

Product Information

  • Title: Anonymizing Health Data
  • Author(s): Khaled El Emam, Luk Arbuckle
  • Release date: December 2013
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781449363079