Unraveling Möbius strips of edge-case data

Connecting data scientists with domain expertise makes more of the unknowns known.

By Rachel Shadoan
May 26, 2016
Top of Mobius Tower, Shanghai (turned horizontally). Top of Mobius Tower, Shanghai (turned horizontally). (source: Carsten Ullrich on Flickr)

It was a deceptively simple request: create a data set of mothers and their babies.

I was working with a research hospital, providing support for a research project investigating the impact of a mother’s health and economic factors on outcomes for their babies—a wonderful opportunity to use data tools to make a positive impact in the world. Having painstakingly exported 25 years’ worth of electronic medical records (from nearly 200,000 patients) from the hospital’s database and persisted them in Elasticsearch, I thought the hardest part was behind me.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

That was, of course, before I realized that nowhere in 26,000 tables were mothers directly linked to their babies.

Historically, large data sets that that try to match moms and babies get it right about 80% of the time. And that 20% of edge cases often include the most vulnerable populations. —Patrick Myers, MD

Worse, the records didn’t even consistently identify patients who were born in the hospital. Sometimes, this was recorded in the “Employer” field of the patient record, using text like “Newborn Born Here” or “Minor Born Here.” However, once the patient was old enough to have an employer, that data was overwritten with actual employment details. I was only able to identify around 7,000 babies using the “Employer” field—far fewer than expected.

Next, I attempted to identify babies using ICD9 diagnosis codes. Any time a patient is seen at the hospital, they are assigned one or more ICD9 codes describing the purpose of their visit. When a baby is born, their “first visit” is assigned an ICD9 code in the V30-V39 (“Liveborn Infants According To Type Of Birth”) range. I located 15,008 patients with “purpose of visit” in that range.

Having identified the babies born in the hospital, I hoped to make short work of identifying their mothers. After all, Elasticsearch is specifically tailored for text search; fuzzy matching will catch a lot of misspellings and typos.

Then I realized that the patient records are only guaranteed to list the mother’s first name and maiden name. Mothers are sometimes listed as the next of kin using their full legal names, but the only surname definitely present is the mother’s maiden name.

Worse, nowhere in a patient’s record is their own maiden name listed.

It was then that the true horror of Western naming conventions became clear. Women’s names change constantly. Even in cases where the mother’s legal name was listed in the baby’s record, the mother’s legal name as listed in the baby’s record is not necessarily the legal name listed in her own record! Divorce and remarriage really confound identifying people by name.

While I strongly considered abandoning the project to lobby for assigning every human a universally unique identifier at birth (perhaps IPV6 address), I pressed onward, trying to match mothers with babies by searching for patients whose names best matched the names listed in the babies’ records.

It didn’t make sense to do computationally expensive name comparisons on patients who could definitely not be the mother, however; I needed to narrow the search space to patients who had actually given birth in the hospital. By selecting only patients who had a visit with a V27 (Outcome of Delivery) ICD9 code, I reduced the field to 10,021 potential mothers.

But the search approach had a problem: it provided me with the best match to a particular query, but made it difficult to assess where that match fit in the distribution of possible matches. It worked well for simple cases, like when the mother’s full legal name was listed in the “next of kin” field in the baby’s record and hadn’t changed since, but in more complicated cases it was difficult to assess whether the best match was actually the baby’s mother, or just the best match for the query.

Frustrated by this and tired of writing clauses to catch edge cases (this data, as it turns out, is entirely composed of edge cases, like a Möbius strip of irritation), I changed approaches. Instead of searching for the mother by name and assuming the best match to the query was the baby’s mother, I would compare every baby to every mother and calculate a match score for that pairing. That would allow me not only to characterize the best matches, but also those that didn’t match. I hoped that this approach would simplify verification. I was expecting the matching algorithms to be imperfect, but I needed to be able to answer the question, “If this person isn’t the baby’s mother (in spite of being the best match according to our criteria), who is the baby’s mother?”

This pairwise match score calculation included a zillion variations of: “Do the names listed for the mother in the baby’s record match the name listed in the mother’s record?” But, determined as I was to write this set of queries only one more time, I included other information as well: looking for the baby’s father’s name anywhere in the mother’s record; or verifying that the mother’s city of birth listed in the baby’s record is the same as the city of birth listed in the mother’s record, for example.

Finally, I added in a date comparison. If there was a large gap in time between the baby’s date of birth and the closest date to that date of birth for which the mother had a visit with an Outcome of Delivery, it was a safe guess that the mother was unlikely to have given birth to that baby. Some discrepancy—days or weeks—was to be expected, but months or years of distance between the mother’s Outcome of Delivery date and the baby’s date of birth made a match highly unlikely. I weighted the date distance accordingly.

Twenty-seven hours of runtime later (n^2 calculations are expensive), it became apparent that some component of my date assumption was very, very wrong. In the simple case, the match calculation worked fine. But the date distance sometimes swamped the match signal; if the actual mother’s Outcome of Delivery POV date was very far from the baby’s date of birth, the best match would be someone whose name and other details didn’t match well at all but who gave birth in the hospital close to the baby’s date of birth. There were many reasons a mother might have a wildly inaccurate date listed for the visit with the Outcome of Delivery, data entry errors being the most likely (I thought). I could have simply adjusted the weighting on the date distance, but because of the long run time and my incomplete understanding of the problems around the dates, I shelved that part of the calculation.

Re-running the pairwise match calculation provided a single best-match mother for 14,136 of the 15,008 babies; the remaining 872 babies were ambiguous, having multiple mothers with (comparatively bad) best match scores. This was a fantastic result, and 100% of the babies in the random sample of 100 that I hand-verified were correctly matched to their mothers.

While assembling the other components of the data set, however, I made a troubling discovery. For a given mother, the ratio of instances of the Outcome of Delivery ICD9 Code to babies matched to that mother should be 1:1 for singleton pregnancies—i.e., one Outcome of Delivery ICD9 code per actual delivery. Instead, I found that many mothers correctly matched with multiple singleton babies had fewer Outcome of Delivery ICD9 Codes than they had babies. That meant that some deliveries were not assigned Outcome of Delivery ICD9 Codes[1].

That meant I hadn’t identified all the mothers. On the one hand, this was good news—I still had 872 babies who may be missing mothers, and they had been a complete puzzle at that point. On the other hand, how else can mothers who have given birth be identified?

It was past time to call in the experts. Patrick Myers, a neonatologist at Northwestern University Feinberg School of Medicine, provided the domain knowledge that I lacked. The Outcome of Delivery diagnosis codes would likely only be assigned in uncomplicated deliveries; any delivery with complications would be coded based on the complication. The same was true for babies; the V30-V39 diagnosis codes would often be missing in cases of babies requiring special treatment. Not only was I missing mothers, then—I was missing babies, too.

Dr. Myers generously offered to curate a list of ICD9 codes that might be used to indicate delivery for a mom or birth for a baby, which I will be incorporating into the next iteration of the mothers and babies identification. He also suggested additional layers of verification for mother-to-baby matches—babies almost invariably start out in the same room that their mother was in, for instance.

The project is ongoing; there’s likely a hundred more hours of work to bring the data set up to something I am comfortable handing over for research, but there’s much to be learned from the process so far:

  • Names are a terrible way to uniquely identify people. All data about people should be designed and collected with this in mind.
  • There’s no such thing as a “safe” assumption with human-entered data. All of your assumptions are wrong.
  • Consult with domain experts, early and often. Doing so will not only save you time, it will also reduce your general suffering.

With those takeaways in mind, may your own Möbius strips of edge-case data become much easier to unravel!


Post topics: Perils of big data
Share: