Chapter 5. Dates, Long Tails, and Correlation: Insurance Claims Data

Insurance claims data represents a treasure trove of health information for all manner of analytics and research. In it you can find diagnostic and treatment information, details on providers used, and a great deal regarding finance and billing. The data is very precise, due to the need for precise accounting of medical charges and reimbursements, but it’s primarily administrative data, which can pose all kinds of data cleaning and formatting problems. More importantly for our purposes, however, is that it can also present some unique challenges to de-identification that you wouldn’t necessarily see in data collected primarily for research.

The Heritage Provider Network (HPN) presented us with a unique challenge: to de-identify a large claims data set, with up to three-year longitudinal patient profiles, for an essentially public data release. This work motivated a lot of the methods we present here. The methods are useful whenever you deal with longitudinal data, but especially when there are a large number of observations per patient.

The Heritage Health Prize

The Heritage Provider Network is a provider of health care services in California. They initiated the Heritage Health Prize (HHP) competition to develop a predictive algorithm that can “identify patients who will be admitted to a hospital within the next year using historical claims data.”[57] The data provided to competitors consisted of up to three years’ ...

Get Anonymizing Health Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.