Chapter 8. Free-Form Text: Electronic Medical Records

It might seem like a somewhat trivial task to find and anonymize all personal health information in a document, and there’s research in the academic literature about this, but here we’re interested in where the rubber meets the road. It’s not enough for a system to catch 80% of the names in a medical document—it has to catch all of them or it’s considered a breach. That means our standards have to be much higher than what you’d typically find.


You’ll notice that we’re dealing with both direct and indirect identifiers when processing free-form text. Both types of identifiers are extracted from the text and dealt with in the same way (e.g., tagging or redaction). So does that make text de-identification a form of masking or a form of de-identification, as we’ve defined them earlier in this book? In fact, it’s both. That’s why we call it text anonymization.

Not So Regular Expressions

Many modern clinical systems have free-form text: nurses’ notes, consultation letters, radiology reports, pathology reports, and so on. This text gets into electronic systems through health care providers who input the data. Even though electronic medical records allow the entry of structured information, sometimes using these systems takes longer than just writing in the information (for example, choosing a diagnosis from a long drop-down list), and if no analytics will be performed on this data that will directly inform their practice (which is commonly ...

Get Anonymizing Health Data now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.