Chapter 4. Deep Dive: Electronic Health Records Data
We have been spending a lot of time on foundational ideas, so I’m sure you are ready to start playing around with some data. In this chapter, we will dive into electronic health record data. As we discussed in Chapter 1, I use electronic health record and electronic medical record somewhat interchangeably. For the purpose of this book, EHR data is that data collected by hospitals and clinics during the course of transacting healthcare delivery. As such, some of the primary motivations are to capture data for billing, legal compliance, or communication with other caregivers. This is important to keep in mind since it can affect the quality and meaning of the data, depending on the use case.
Now, we’ll go back to a dataset that I introduced in Chapter 3 in the EHR case study. The MIMIC dataset is a publicly accessible dataset from a hospital intensive care unit. Though most simply refer to it as “the MIMIC data,” it is the Medical Information Mart for Intensive Care dataset. Though this dataset represents a specific subset of EHR data, it is real data and reflects many of the challenges we have as an industry when working with real-world data. I will also discuss the Synthea synthetic dataset though we won’t be working with it in any detail.
Before we get too deep into the MIMIC dataset, I’d like to take a step back and look at publicly accessible data in general, particularly in the context of de-identification and anonymization.
Publicly Accessible Data
From regulatory requirements to privacy concerns to complexity of the data, there are many challenges in making data available to the general public. In the United States, one common regulatory hurdle is the Health Insurance Portability and Accountability Act (HIPAA) though many states also have their own requirements in addition to HIPAA (e.g., California Consumer Privacy Act) that may or may not be healthcare specific. In European Union, there is the General Data Protection Regulation (GDPR) in addition to country-specific regulations.
Even if organizations are fully compliant with legal requirements, there is still the risk of privacy breaches and the ensuing political and public relations challenges. For example, in 2006, Netflix released a de-identified dataset with which researchers were able to subsequently re-identify 68% of reviewers.1 Additionally, there are moral and ethical considerations to protecting the privacy of patients in the healthcare system.
In this section, we will discuss two datasets. One is a real dataset that has undergone de-identification. The other is a synthetic dataset generated in a variety of data models.
De-identification Versus Anonymization
De-identification, anonymization, and pseudononymization are different approaches to mitigating privacy concerns in datasets. While there are generally accepted definitions, these terms are also defined by statute, case law, and regulatory frameworks. So, it is important to be sure that you understand what exactly is required of each.
With that out of the way, let’s look at the MIMIC dataset in a bit more detail.
Medical Information Mart for Intensive Care
The MIMIC dataset is an NIH-funded project that started as a collaboration between Harvard and MIT. It began in 2003 as a multiparameter dataset of ICU patient data and was first publicly released in 2011 to the broader community.
Since then, it has gone through several major iterations and is currently available as MIMIC-IV. Each of the high-level releases (MIMIC-II, MIMIC-III, MIMIC-IV) is essentially a different dataset that happens to have overlapping patient populations. Within each release, there is additional versioning—the latest version of MIMIC-III is v1.4, while the latest version of MIMIC-IV is v1.0.
The MIMIC data is a great source of data and contains everything from demographics to medications to lab tests to free-text clinical notes. In some cases, there are associated datasets of electrocardiograms (ECG/EKG) and other bedside monitoring and physiological waveform data. The MIMIC data has been popular with data scientists looking to develop digital biomarkers, create clinical natural language processing algorithms, and even conduct more traditional clinical research.
For this book, we will be working with MIMIC-III since MIMIC-IV was only recently released. I will provide a short overview of the schema and of a few tables we will reference in this book. For more information, there is fairly comprehensive documentation available on the MIMIC website.
MIMIC-III schema
The MIMIC-III schema is fairly straightforward and reflects the delivery of inpatient care. In this section, we will talk through several tables, noting some key considerations for those who may be less familiar with inpatient patient care.
Patients
The MIMIC-III dataset is a patient-level dataset, and it is easiest to anchor
our discussion around the PATIENTS
table. Every single unique patient
within the dataset contains an entry in PATIENTS
, which also contains basic
demographics such as date of birth/death and genotypical sex (though
referred to as “gender”) of the patient. One thing to note is that there are
multiple dates of death that are tracked—the date from the hospital system
and the date from the Social Security database. There is also a generic date
of death column that merges the two dates, giving priority to the hospital
system.
Admissions
Next, each patient is connected to rows within the ADMISSIONS
table, one row per hospital
admission. This table is sourced from what is typically referred to as the ADT
system, or the admission, discharge, and transfer system. This includes the
date/time a patient was admitted to and discharged from the hospital and the
location of admission/discharge (e.g., they were admitted via the emergency
department but discharged to a skilled nursing facility or to home health).
Other information typically considered demographic information is also
contained within this table. For example, religion, language, marital status,
etc., are also contained within this table. As a result, it is important to
note that a patient might be admitted to the hospital multiple times and have
differing data for each of these fields.
ICU stays
One thing to note is that while admitted to the hospital, a patient may then be
admitted to the intensive care unit multiple times. For example, a
patient may be on the medical-surgical floor of the hospital, develop sepsis,
and be moved to the ICU, improve and be moved back to their original unit, and then
deteriorate and be moved to the ICU again. In this case, there is a single
hospital admissions with multiple ICU stays. Each ICU stay is an entry in the
ICUSTAYS
table. Similar to
the ADMISSIONS
table, there are timestamps indicating when the patient was
transferred to/from the ICU.
As with most hospitals, there are multiple different types of ICUs such as the
medical ICU, surgical ICU, and coronary care unit. Patients may also be
transferred from one type of ICU to another; this table tracks the first and
last units with additional details of transfers captured in the TRANSFERS
table. Abbreviations and descriptions of the different care units can be found
in Table 4-1.
Abbreviation | Description |
---|---|
CCU |
Coronary care unit |
CSRU |
Cardiac surgery recovery unit |
MICU |
Medical intensive care unit |
NICU |
Neonatal intensive care unit |
NWARD |
Neonatal ward |
SICU |
Surgical intensive care unit |
TSICU |
Trauma/surgical intensive care unit |
Transfers
The TRANSFERS
table
sources its data from the ADT system and contains a listing of each transfer of
a patient between care units. While the ICUSTAYS
table tracks the first and
last units, intermediate units can be found within this table. Within the
dataset, if a patient is transferred to an ICU within 24 hours of a previous ICU stay,
it is considered to be the same stay, thus resulting in only a single entry in
the ICUSTAY
table.
One point of note is that data for patients from 2001 through 2008 is sourced from the CareVue system, while data from patients from 2008 through 2012 is sourced from the MetaVision system. These are two different systems for tracking ICU patients and may have an impact on the underlying data (e.g., querying data regarding input events).2
Prescriptions
Now that we have a bit of understanding of the tables that contain more general data on patients and their stays in the hospital, let’s dive a bit into the more clinically focused tables. Since we will spend some time discussing EHR versus claims data, one obvious point of focus is medication/prescription data.
The PRESCRIPTIONS
table contains medication
orders that are entered within the hospital’s order entry system (also known as
the computerized provider order entry, or CPOE, system). Of course, there are
typical columns you would expect, such as timestamps for starting or
discontinuing the medication, or the ICU stay or hospital admission with which
the order is associated.
The more interesting columns, however, are those that capture the medication
name, dosage information, and links back to standard coding systems. For
example, the PROD_STRENGTH
column contains a free-text string that captures
information typically seen on product packaging (e.g., “25mcg/0.5mL Vial” or
“10mg Suppository”). In addition, there are additional columns DOSE_VAL_RX
and DOSE_UNIT_RX
that break this information down for easier computation.
Another interesting column is ROUTE
, which captures the route of
administration (e.g., “PO” for oral medications, or “SC” for injections under
the skin).
Dose Versus Dosage
Dose is the amount of a medication taken at one time. For example, a common dose for ibuprofen is 200mg though it may be available with a prescription as an 800mg dose.
Dosage contains more information such as the amount, number of pills, frequency, and duration.
One of the challenges we often have working with data such as medications is linking it to other data such as claims. This table also contains NDC and GSN codes to facilitate these linkages. NDC codes are national drug codes that are used in the United States. One point of note is that the NDC code is a combination of labeler (manufacturer, repacker, relabler, distributor), product information (strength, dosage, and formulation), and package information (size and type of packaging).
As a result, the NDC code often provides too much granularity for many use cases. For example, say we are trying to link our EHR dataset with the FDA’s Adverse Event Reporting System (FAERS) to see if certain combinations of drugs correlate with length of stay or transfers to the ICU, with a particular focus on acetaminophen (Tylenol).
Acetaminophen is found as an active ingredient in many drugs and, consequently, will be buried among many NDC codes. This is where something such as the UMLS can come in handy. While not perfect, it contains many mappings and relationships that can help us find all NDC codes that contain acetaminophen as an active ingredient; then we can link this set of codes against the NDC codes contained within this table.
Procedure events
Similar to the PRESCRIPTIONS
table, the dataset also tracks procedures that
are ordered for patients. Although there are two different systems (CareVue
and MetaVision), there is only a single table containing procedure events, the
PROCEDUREEVENTS_MV
table.
This table contains a listing of a variety of procedures that a patient might
undergo, ranging from starting an IV to imaging or to dressing changes. It
is structured to contain all procedures for all patients across all admissions
and ICU stays. As a result, it relies heavily on primary-foreign key
relationships with the D_ITEMS
table.
There are also quite a few fields to provide additional context and information regarding a specific procedure. While some are generally self-explanatory (e.g., start/end time), most require some knowledge of clinical practice or anatomy and physiology.
D_ITEMS
There are several tables that begin with D_, one of which is D_ITEMS
. These are known as
definition tables and store definitions of sets of codes (though not
necessarily standardized codes). D_ITEMS
stores definitions for items in
either the CareVue or MetaVision databases. The DBSOURCE column contains a
string that indicates which system.
Because this table is linked to a variety of different event tables, it is a
very generic table and contains two columns (UNITNAME and PARAM_TYPE),
which are used to interpret the specific values in the corresponding event
table. In some cases, the unit name is also captured in the linked table (e.g., the VALUEUOM column in PROCEDUREEVENTS_MV
also captures the unit of measure).
One point of note is that while this table may seem like it contains a controlled vocabulary, it most certainly does not! There might be multiple items (i.e., row) within this table that capture the same concept but with spelling mistakes or synonyms. This occurs because the data is sourced from two different systems (CareVue versus MetaVision) and also as a result of free-text entry in CareVue.
This will be one area where graph databases shine as we work to link synonymous or equivalent concepts across the dataset. For example, the items with ITEMIDs of 211 and 220045 both represent “heart rate” as taken from CareVue and MetaVision. In a SQL-based system, linking these would be done at the level of the SQL query. This requires that the data engineers and scientists track and document this within their queries. In a graph-based system, we can easily link these with an edge that captures synonymy or equivalence so that others can benefit from the mapping as well.
D_CPT, D_ICD_DIAGNOSES, D_ICD_PROCEDURES, and D_LABITEMS
As with D_ITEMS
, the D_CPT
, D_ICD_DIAGNOSES
, D_ICD_PROCEDURES
, and
D_LABITEMS
tables all contain definitions that are referenced in other
tables. However, they all contain definitions that are aligned with controlled
vocabularies. The columns within each table are tailored to the nuances that
come with the associated controlled vocabulary.
The D_CPT
table contains definitions for items
that are part of the Current Procedural Terminology (CPT), which is a set of
codes maintained by the American Medical Association. CPT codes
are primarily used as part of the billing and claims process between healthcare
providers and payers (both public and private).
The D_ICD_DIAGNOSES
table contains ICD-9 (not ICD-10) codes related to patient diagnoses. Given that the MIMIC-III dataset contains data only from 2001–2012, it does not contain any ICD-10 codes since
the mandate to move from ICD-9 to ICD-10 did not take effect until
2015.
Similar to D_ICD_DIAGNOSES
, the D_ICD_PROCEDURES
table contains ICD-9 codes
related to hospital procedures pre-2015.
CPT Versus ICD-9 Procedure Codes
While both CPT and ICD-9 (and ICD-10) procedure codes capture medical procedures, you may be wondering what the difference is.
CPT codes are primarily used to bill for physician services, while ICD procedure codes are used to bill for hospital services. While not commonly known, physician services and hospital services are billed separately and use different sets of codes to manage the reimbursement process.
If you’ve ever dealt with the claims process for a hospital admission, this is one reason it gets so confusing! You get a series of different bills because no one organization is behind all of the services that are provided. One exception to this is the integrated health system where the people (physicians, nurses, pharmacists, etc.) and the hospital itself are all part of the same organization.
As a point of note, while there was an industry shift (in the United States) to move to ICD-10 for billing and reimbursement in 2015, there was a window of time where many institutions were performing “dual coding” as part of the transition. During this time, organizations often used both ICD-9 and ICD-10 codes simultaneously.
However, not all datasets that span the 2015 implementation date contain both sets of codes. When working with datasets that span the implementation date, either all pre-2015 codes need to be mapped to their ICD-10 equivalent or all post-2015 codes need to be mapped to their ICD-9 equivalent. This is often accomplished through the use of “crosswalks,” a table of mappings from one coding system to another.
Mapping ICD-9 and ICD-10
There are many more ICD-10 codes than ICD-9 codes, so it’s natural to assume that ICD-10 codes are more granular and detailed. For example, there may be two different codes to handle laterality (e.g., left arm versus right arm) of a particular diagnosis. The natural conclusion is that a particular ICD-9 code maps to one or more ICD-10 codes (1:1 or 1:many mappings). However, this isn’t always true. Given the intended uses of ICD, there were some codes where multiple ICD-9 codes were compressed into a single ICD-10 code.
For example, both ICD-9 codes 493.00 (Extrinsic asthma, unspecified) and 493.10 (Intrinsic asthma, unspecified) get mapped to a single ICD-10 code J45.50 (Severe persistent asthma, uncomplicated). This is because ICD-9 asthma codes are organized by intrinsic versus extrinsic asthma, while ICD-10 asthma codes are organized by severity.3,4
The last table we will cover in this section is D_LABITEMS
, which
contains definitions for laboratory and other observational tests and is
linked to LOINC codes. However, while most concepts extracted from the
hospital’s lab system were mapped to LOINC codes, not all have been.
That was a whirlwind tour of the schema and was intended as more of an orientation than anything. The more time you spend with the data, the more intuitive it will become. Another element of experience is being able to recognize when a dataset will not work for a particular use case. Though it’s great to be excited when we get our hands on data, it is important for us to know of any limitations. While the MIMIC datasets are tremendously valuable, let’s spend a little bit of time looking at some of the limitations.
Limitations
The biggest limitation to the MIMIC-III dataset is that it is focused on patients who have had an ICU stay during their admission. As a result, there is a huge bias in the dataset toward higher-acuity patients (that is, patients who have more complex medical situations). So, any insights generated from this dataset will likely not apply to the majority of the population.
For example, let’s say we wanted to look at diabetics within this dataset and correlate blood sugar levels to hospital readmissions or adverse event reactions involving insulin. It would be difficult to remove confounding factors such as interactions with in-patient medications such as sedatives that are not typically administered outside the hospital. So, as with any dataset, it is important to think through the potential biases; however, the biases within this dataset may be buried deep in the clinical context and not otherwise evident.
For example, this dataset is captured at an academic medical center that is affiliated with a top-tier medical school. Buried deep in the clinical workflow is a direct connection to cutting-edge research that may not make it to community hospitals for years. So, even if you are working on an inpatient analytics project and your patient populations match those within the MIMIC dataset, you need to further consider if your clinical practices are closely aligned enough.
Sometimes, you need a dataset that you can freely share and you are not interested in extracting any actual insights. Synthetic data is an active area of research and has significant potential to help us break through the privacy concerns that often limit researchers and innovators. In this next section, we will discuss Synthea, one of the first synthetic datasets that was made available to the public.
Synthea
Mainly used in the context of data engineering, another common dataset is the Synthea dataset. It is a fully synthetic dataset, so it does not contain any actual patient data. This makes it easy to work with the data, especially if you want to include data in a variety of formats or data models as part of unit and integration tests.
As a project, Synthea is an open source data simulator/generator. That is, it generates realistic healthcare data (as opposed to just creating data that may be syntactically correct but is semantically meaningless). However, there are a few pregenerated datasets that you can download from the Synthea website as well.
Synthea is a powerful framework that uses a modular approach for generating different aspects of healthcare data. This allows data engineers and scientists to tune the system to match local patient populations as necessary. Their documentation does a good job of providing examples and guides for creating your own modules.
As with any dataset, one of the first things we need to consider is the schema or data dictionary and how well it aligns with our needs. So, let’s take a look at the Synthea schema.
Schema
Unlike the MIMIC dataset, the data from SyntheticMass is available in a variety of data models/formats. This makes SyntheticMass a very rich source of realistic data, especially when testing data engineering and mapping pipelines. However, like any dataset, there are trade-offs between the various pregenerated datasets, so you may find that you need to create a custom dataset to meet your needs and requirements.
Some of the datasets are available in both CSV and FHIR formats, making them great options for testing data engineering pipelines that convert to/from FHIR. However, it is important to note that such tests are more about the technology of your pipeline. Given the synthetic nature, it won’t help you validate your data mappings or semantic harmonization tasks.
The CSV version of the dataset is a normalized, relational structure such that each CSV file could be loaded as a separate table in a relational database. Universally unique identifiers (UUIDs) are used to link across files.
There is one CSV file for each of the following:
-
Allergies
-
Careplans
-
Claims
-
Claims Transactions
-
Conditions
-
Devices
-
Encounters
-
Imaging Studies
-
Immunizations
-
Medications
-
Observations
-
Organizations
-
Patients
-
Payer Transitions
-
Payers
-
Procedures
-
Providers
-
Supplies
Within each CSV file, the first line is a header line that identifies each of the columns. For example, the conditions.csv
file contains the following header:
START,STOP,PATIENT,ENCOUNTER,CODE,DESCRIPTION
In the FHIR version of the generated dataset, each JSON file corresponds to a single patient and follows the FHIR schema. We will discuss this later in this chapter in the section “Fast Health Interoperability Resources”.
Though I mentioned some potential limitations of the Synthea data earlier, I wanted to quickly discuss the most obvious limitation when working with Synthea.
Limitations
Obviously, the main limitation to the Synthea dataset is that it’s not a real dataset. The modules used to generate different aspects are tunable and often reflect our current understanding (mainly, from observational research studies). However, this assumes that the underlying research used to tune the modules is accurate and reflective of the patient population we are trying to study. Additionally, because these observational studies tend to focus on descriptive statistics of observed variables, they may not reflect higher-order interactions between variables (especially if this higher-order interaction is not yet understood or even evident to researchers).
We have looked at the MIMIC-III and Synthea data, each containing its own data model. When every data source comes with its own data model, we struggle with adapting or rebuilding our data pipelines to account for the new data model. In this next section, we will dive into two commonly discussed data models when working with EHR data. Both of these attempt to work toward a common data model though one is focused on storing data for analytics, while the other is focused on the communication and transport of data.
Data Models
When working with data of any kind, one of the first considerations is the data model. Most of the time, as data scientists, we do not have much say in which data model is used since that decision is made much further upstream. In the case of healthcare data, the data model is often decided upon by the EHR manufacturer or the team that developed the clinical data warehouse. When we get access to datasets in the life sciences, we are often at the mercy of the data vendors and whatever data model they choose to use, frequently a proprietary one. Regardless, it is important that we look at the underlying goals or intention of the data model, which we will discuss next.
Goals
There is a slow but increasing trend to focus on the use of “common” data models. The idea is to create an ecosystem of tools and data sources that all operate on a single data model, minimizing the need to constantly reinvent the wheel. One of the best analogies I have heard is that we are all thirsty for insights, yet we are all digging our own wells to the same sources of data.
We will cover a couple examples of projects that are attempting to do this, but before we do, it’s important to note that there will never be a silver bullet. If you talk to most data and information architects (myself included), one of the first points they will make is that the model is dependent on the use case—where did the data come from and what sorts of questions are being asked of the data? Depending on the requirements of the project, sometimes the data model might depend on other requirements such as write latency (as often seen in transactional, relational data models).
The most common challenge when trying to create a common data model for healthcare data (particularly RWD) is the variation in what data need to be collected as we go from one disease area to another. For example, in oncology, we need to track a patient’s staging (of which there are already two options, TNM staging and group staging), the morphology and histology of the cancer, and the body part/location.
Cancer staging data is a common source of headache for those working with data in the oncology space. Right off the bat, there is the need to harmonize “group staging” and “TNM staging”, two different approaches to determining the stage of a patient’s cancer. Group staging is the most generally recognized and is just a single number (e.g., stage 1, 2, 3, …). It captures the presence of cancer, the size of the tumor, and how much it has spread to nearby tissue or other parts of the body.
Given our increasing understanding of cancers, another form of staging is called TNM staging:
- T
-
Primary tumor—the size or extent of the main tumor
- N
-
Regional lymph nodes—whether lymph nodes are involved and, if so, the number and location
- M
-
Metastases—whether the cancer has spread to other parts of the body
In addition to the different types of staging, there is the consideration of the AJCC staging guidelines for TNM staging. The AJCC is the American Joint Committee on Cancer and has developed cancer staging guidelines for use by clinicians and scientists. While they have been historically published as editions, the AJCC has recently shifted (as of 2021) to versioning the guidelines.
While the staging guidelines were originally intended for human consumption as a set of guidelines, it is evolving to become more computable, including API access as well as a shift away from publishing “editions” to releasing “versions.”
This doesn’t even include the latest developments in genetic sequencing and biomarker data! These are all data elements that are specific to solid tumors and not applicable to other disease areas. So, how do we create a common data model that would enable a hospital to work with data across all disease areas and medical specialties?
Despite the (sometimes overwhelming!) quantity of nuances within RWD, why is there still such a push for common data models? For example, we see academics, biopharma companies, and data vendors investing a significant amount of time and money into such efforts.5
While there are certainly a lot of nuances between cancer and neurology, or between descriptive statistics for evidence generation and predictive analytics models for decision support, there are also many areas of overlap. For example, regardless of the type of analytics, one of the first steps is to define a cohort of patients and perform some high-level exploratory analysis. Before spending the time and effort diving into a dataset, we first want to know roughly how many patients actually meet our inclusion/exclusion criteria or what the distributions of key variables and data elements might be.
So, perhaps we can use a common data model to help facilitate this early phase of analytics projects. Once the data scientists are happy with the cohort they have defined, they can extract the same subset of patients from the original dataset. This allows us to speed up the identification of the appropriate dataset and patient cohorts while also providing data scientists with the full depth and nuance available within the original dataset.
So, let’s take a look at two examples of common data models that are gaining traction across all levels of the healthcare industry.
Examples of Data Models
There are quite a few “common” data models out there, most coming from the academic side of healthcare. However, there is no single, authoritative body, so adoption is organic and bottom-up. We will cover two of the most commonly discussed data models (among many others), ones that are quickly gaining popularity across both academic medicine as well as on the industry side.
Observational Health Data Sciences and Informatics (OHDSI) OMOP Data Model
OHDSI (pronounced “odyssey”) is a collaborative that is centrally coordinated through a research group at Columbia University, though it has numerous collaborators and sponsors from industry. In addition to providing a community and forum, concrete artifacts include the specifications for a common data model (OMOP, pronounced “oh mop”), as well as tools for mapping and interacting with the data and a curated set of terminologies.
The OHDSI documentation is quite good, including third-party materials such as those provided by the European Health Data Evidence Network (EHDEN) Academy. My goal here isn’t to make you an expert in OMOP but to give you a basic introduction as we compare and contrast with FHIR and other approaches. Please check out the Book of OHDSI and other OHDSI resources if you would like to dive more deeply into the world of OHDSI.
We will discuss the high-level approach that OHDSI has taken with OMOP and considerations from our perspective as data engineers and data scientists, particularly as it relates to the different types of databases we are comparing.
Working with OMOP
The latest version of the OMOP CDM is v5.4 and was just released in September 2021. If you poke around the OMOP documentation, you will also see mention of v6.0. Although this is a “released” version, OHDSI specifically recommends that new projects continue to use the 5.x branch of the CDM. Version 6.0 makes some breaking changes, and the current set of released tools (as well as the community overall) have not yet begun to support v6.0.
As you can see in Figure 4-1, the CDM is patient-centric with
everything stemming from the Person
table. Similar to the schema from the
MIMIC dataset, patients are immediately associated with visits. Since OMOP is
not specific to inpatient data, the idea of a visit can be either in- or
outpatient. From there, there are tables to connect the patient to a variety
of observational data ranging from conditions (i.e., diagnoses) to
medications to procedures, etc.
One thing you will notice is that the tables are grouped into several categories:
-
Clinical data
-
Health system data
-
Health economics data
-
Derived elements
-
Metadata
-
Vocabularies
In addition to the clinical focus that we would expect, OMOP was also designed to integrate data from the broader healthcare ecosystem, particularly claims data as well as data on a single patient from multiple different providers.
Lastly, you will also notice that nearly every clinical table has connections
to a Concept
table. OMOP is closely aligned with the use of controlled
vocabularies, particularly industry-standard terminologies such as SNOMED-CT
and LOINC. Please see the Athena documentation for a complete list of
supported terminologies. Because many terminologies support a richer set of
relationships beyond simple hierarchy, the other vocabulary tables allow us to
capture these relationships in a SQL format. This makes it extremely
convenient for us to work with these relationships, embedding them directly
within the SQL query (versus needing to interact with a separate
terminology
management service). The trade-off, however, is that the SQL queries themselves
can become quite complex and difficult to read.
Fast Health Interoperability Resources
Another common data model that is gaining traction is FHIR (pronounced “fire”). That said, calling it a “data model” is a bit of a misnomer. Typically, when we talk about data models, we do so in the context of data storage and persistence, namely, in a database. FHIR, however, was not designed to be a storage model. However, since it is available as both JSON and XML, many have started to save the JSON directly into a document-oriented database or any other database that natively supports JSON documents (e.g., PostgreSQL with JSON support).
FHIR was designed as an interoperability standard in order to facilitate the exchange of information between different clinical information systems. As a result, it is verbose and often duplicates data in multiple places rather than efficiently normalizing the data. On the other hand, as an interoperability standard, it provides some powerful features such as implementation guides and profiles.
Resources
At the core of FHIR is the resource, a definition of the structure and contents of a payload of data to be shared between two healthcare information systems. In many ways, a resource is similar to a table within a traditional
database schema. For example, there is a Patient
resource to which an
Observation
may be linked, similar to the Person
and Observation
tables
within OMOP.
For each key-value pair within a resource, there is a rigorous definition of cardinality, data type, and flags that dictate how the resource is to be defined and used.
Resource definitions tend to be fairly generic and tend to capture the absolute minimum specificity. This allows FHIR to be a great starting point that users can extend and add additional restrictions. To do this, we will use profiles.
Profiles
So, when we look at FHIR resources, we see something akin to typical schema definitions of a data model. FHIR profiles are where we start to see FHIR differentiate itself as an interoperability and information exchange standard versus a data or information model.
Profiles are a way for specific implementations to create and publish a specific set of resource definitions that contains constraints or extensions to the base set of resources. The key feature within FHIR is that these profiles are published and consumable by systems in a computable manner. This allows implementations within electronic health records or other software systems to be able to process and, more importantly, validate data that conforms to particular profiles and resource definitions at runtime.
Implementation guides
Another key element of the FHIR approach is the implementation guide, essentially a set of rules and constraints about how specific resources/profiles are to be used within a particular context. While creating implementation guides is outside the scope of this book, there are tools to facilitate the creation of implementation guides such as FHIR Shorthand (FSH, pronounced “fish”) and SUSHI Unshortens Short Hand Inputs (SUSHI).
Of course, OMOP and FHIR are not the only common data models out there. The following are several other ones that pop up often.
Other data models
There are many other data models that you might encounter as you work with RWD. Here are some that I have encountered or used in the past:
- Analysis Data Model (ADaM)
-
The ADaM model comes from the Clinical Data Interchange Standards Consortium (CDISC) and sits downstream of SDTM. ADaM focuses on supporting the analysis process, again with a regulatory perspective. This includes facilitating replication or studies and traceability of results.
- Informatics for Integrating Biology and the Bedside (i2b2)
-
The i2b2 data model is another research-focused data model that uses a star schema to represent facts and modifiers of those facts. This makes it an extremely flexible and powerful data model. The trade-off, however, is that it is quite complex and can be challenging for newcomers or organizations that lack the resources to effectively manage data in i2b2, though it is used by more than 200 institutions.
- The National Patient Centered Outcomes Research Network (PCORnet)
-
As the name implies, the PCORnet data model is used to support patient-centered outcomes research. Similar to OMOP, PCORnet is intended to facilitate the consolidation of data within a single institution from a variety of clinical information systems and then enable comparison of queries and analyses across multiple institutions.
- Study Data Tabulation Model (SDTM)
-
The SDTM model is another standard from CDISC and sits upstream of ADaM. It is one of the required data standards for submitting data to the FDA and other regulatory agencies. The intent is to facilitate the aggregation and warehousing of study data that will eventually be used as part of the regulatory submission process.
There are also plenty of proprietary data models that are controlled by vendors that produce datasets. Since those typically require licenses to the data and nondisclosure agreements, we can’t really discuss them.
Similar to terminologies or other data standards, the key consideration is what the intended use cases are and how to balance that against operational considerations. For example, it may make sense to adopt a particular data model that isn’t perfect, but it is also one that all of your collaborators are already using or one for which you already have tooling and infrastructure in place. No data model is perfect, and there will always be trade-offs. Some may be focused entirely on syntactic and structural considerations, while others attempt to handle some level of semantics.
Some are quick to say that they are using graphs as the data model. While this may be true in a very technical sense, even the use of graph databases require that we clearly define some sort of schema and structure to our data beyond simply “graphs.” As we will see later in this chapter, even though we persist the data in a graph database, we are essentially maintaining the MIMIC-III data model.
Regardless of what data model you choose, there are always going to be limitations. The most obvious is that there will be some number of use cases that your data model simply doesn’t support or that would be too cumbersome. One common challenge (though not always obvious) is that translation of data from one data model to another naturally results in some amount of information loss or shift. For example, if working with the OMOP data model, it has very strict rules on what data to include or remove (e.g., as a result of missing values). As you consider what data models to use, it is important to engage your data science teams early in the process to conduct feasibility studies. This should generally involve translating a subset of your data (or all of it if you have the time and resources) to the new data model and then conducting the same analysis on both datasets. As you compare the results, you would need to determine what accounts for any differences. This can include the data getting filtered, bugs in the translation process, or variations in structured codes.
Now that we have a sense of data models, some that are closer to EHR data than others, let’s start our deep dive into EHR data in the context of a use case around medication data.
Case Study: Medications
As we discussed earlier, a common problem of working with medication data is the variety of coding systems out there, assuming that a coding system was used in the first place. For now, we will ignore the free-text medication problem where the medication names are just free-text strings. Instead, we will focus on the problem of linking data when a coding system such as NDC is being used.
For this case study, we will use the MIMIC-III dataset, focusing on the use of
medication data within the PRESCRIPTIONS
table. There is a column, ndc
,
that contains the NDC code corresponding to the string name provided in the
column drug
. As I discussed earlier, the NDC code is highly specific such
that there would be a different code for the tablet form versus the capsule
form of a drug. There would even be a different code between the 12-pack
versus 100-pill bottle. So, as data scientists, how do we analyze this data
such that we can ask broad questions about an active
ingredient?
One solution would be to include every possible NDC code within a SQL query as we extract the data. This creates a problem when managing our queries, especially if new data or new mappings are added in the future. This also makes it harder for data scientists and engineers to interact with the queries because they would be dealing with extremely long lists of seemingly random numbers.
Another solution would be to map all of these together during the ETL phase when loading the data into our analytics system. The challenge with this approach is that we assume all requirements are known during the ETL phase. But, what happens if we decide the drugs are equivalent after we have already ETL’ed all the data? We could default back to embedding this in the SQL query, but now we need to manage this knowledge of equivalent drugs in two different places.
The Medication Harmonization Problem
Heparin is a commonly used medication within the inpatient setting since it helps reduce blood clotting in a variety of situations (e.g., following surgery). That said, if the patient is already on another medication that also results in reduced clotting, such as aspirin and other nonsteroidal anti-inflammatory drugs (NSAIDs), this could result in a dangerous situation. Of course, clinicians are aware of this potential interaction between medications. However, we have been tasked to create a decision support tool that will help flag potentially dangerous situations.
From a high level, we will need to do a few things with the data:
-
Find all instances of heparin and determine all of the ways it may be captured or represented within the dataset.
-
Determine all other anticoagulants (or medications that reduce clotting factors).
With the MIMIC dataset, we can do one of two things (or some combination thereof):
-
Search the
drug
,drug_name_poe
, anddrug_name_generic
columns withinPRESCRIPTIONS
table for case-insensitive “heparin.” -
Look up all the NDC codes for formulations that contain heparin.
While these might be the simplest (and even fastest) solution to get to some immediate results, they come with some major trade-offs. If we just search through the few columns that contain free-text strings, we will naturally overfit to the data that we currently have access to. Can we be sure that any future data coming in might conform—what if there are misspellings or abbreviations?
More critically, the string “heparin” might also match “heparin flush,” which is used to flush IV catheters to help prevent clots from forming in the tubing and catheter. Heparin flushes are not used to treat patients and will likely be excluded from most analyses looking at patients being prescribed multiple anticoagulants. To address this, we might consider excluding mentions of “flush” though we are now getting into the realm of increasingly complex (and fragile) data pipelines.
The latter option is the cleaner of the two but, as we discussed earlier, imparts some implicit clinical knowledge into your SQL query and analysis code. For many use cases, this may not be a problem but runs counter to the idea of separation of concerns.6 Do we really want to be in a situation where we need to update our analytics code every time there is an update to the NDC database, or a clarification of the scientific question?
Given the complexity of biomedical data and especially in combination with clinical care, it’s important that we build robust data pipelines. However, it also highlights that we need to have a good communication channel between our data teams and our clinical teams.
This whole discussion highlights one of the main differences in perspective when working with data. If a clinician data scientist (i.e., someone with dual training as a clinician and a data scientist) were to approach the previous problem, they would immediately realize the differences between “heparin” and “heparin flush.” So, for simple analyses and with the appropriate training, this may be the simplest approach.
However, as organizations scale their approach to data science, it would be very difficult to staff an entire team of clinician data scientists. If we start to integrate data scientists who lack clinical experience, how do we ensure that such nuances are captured during analysis? Additionally, if we are building more complex analytics (e.g., machine learning), how do we ensure that future data pipelines (e.g., in the context of ML ops) can account for these nuances?
This is where proper data engineering at the semantic level is critical! We need ways to scale the organization’s approach to such complex problems while also ensuring that we capture the subtle nuance that can make or break a machine learning model. And, more critically, how do we ensure that production deployments of such models account for these nuances?
While we won’t answer these questions directly, in the next section, we jump into loading the MIMIC data into a variety of databases and then query that data in the context of our heparin example. The approaches we will discuss are tools for your toolbox, a toolbox that will help you design data pipelines that mitigate many of these concerns.
Technical Deep Dive
In this section, we explore the process of:
-
Loading MIMIC data into a database
-
Handling some basic harmonization tasks
-
Extracting/querying the data from the database
We will walk through each of the previous using a variety of databases—SQL, property graph, and hypergraph. The SQL database is a pretty straightforward load using the existing schema of the MIMIC dataset. For the NoSQL databases, we keep the schema fairly simple while focusing on the key opportunities and challenges of the particular database. The schemas presented have not been optimized for any particular use case. So, while you are free to adapt them to your projects, please keep in mind that your project will likely need a different schema.
We will be using the MIMIC-III Clinical Database Demo, a subset of the full MIMIC-III database containing 100 patients. Access to the full dataset requires accepting a data use agreement (DUA).
The code in this section assumes that you can run the necessary containers in Docker and that you have configured your environment appropriately. Our first stop is to look at MIMIC-III in a relational database.
Relational database with SQLite
If you are using PostgreSQL, I recommend using the repo available here since we won’t be loading the MIMIC data any differently than using the scripts provided by the MIMIC team. My fork contains some very minor changes from the upstream version. Namely, it sets several environment variables given the directory structure used in the book.
To start the containers, you just need to run ./build.sh
from the
terminal. The script will start the containers, wait for them to
initialize, and then start the process of copying the MIMIC data into its
respective tables. This will start both the database server as well as
the web interface. There is also a GraphQL implementation.
That said, SQLite is a very lightweight SQL database that uses a single simple file to store and manage the database. It does not require setting up a database server nor fiddling with network settings and IP addresses. Though many developers use SQLite because it is easy to embed within applications, it is also a great for small experiments or working locally on your machine. So, for this section, we will walk through the heparin example by loading the MIMIC data into SQLite.
Importing data into SQLite
Since this is a very small dataset, to keep things simple we will not be dealing with foreign key constraints or indexes. We will simply import each CSV file as a different table using some Python code, as shown in Example 4-1.
Example 4-1. Importing MIMIC-III files into SQLite
import
os
import
pandas
as
pd
import
numpy
as
np
import
sqlite3
mimic_path
=
"mimic-iii-clinical-database-demo-1.4"
# In case we are using ~ for path
mimic_path
=
os
.
path
.
expanduser
(
mimic_path
)
db_path
=
"mimic-iii.db"
db_path
=
os
.
path
.
expanduser
(
db_path
)
conn
=
sqlite3
.
connect
(
db_path
)
csv_files
=
[
f
for
f
in
os
.
listdir
(
mimic_path
)
if
f
.
endswith
(
".csv"
)]
# Sort to make it easier to debug since tables are in predictable order
csv_files
.
sort
()
(
f
"Found
{
len
(
csv_files
)
}
files, expected 26"
)
for
f
in
csv_files
:
path
=
os
.
path
.
join
(
mimic_path
,
f
)
table_name
=
f
.
split
(
"."
)[
0
]
(
f
"Importing:
{
path
}
into table:
{
table_name
}
"
)
data
=
pd
.
read_csv
(
path
,
dtype
=
str
)
data
.
to_sql
(
table_name
,
conn
,
if_exists
=
"replace"
)
Querying data in SQLite
Now that we have data loaded, let’s go ahead and query the database in the
context of our heparin example. Example 4-2 contains a simple SQL
query that does a case-insensitive search for drugs that contain “heparin”
within the drug
, drug_name_generic
, or drug_name_poe
columns. SQLite
defaults to the case-insensitive LIKE
, but, if not, simply run the following
command in the SQLite command-line utility: PRAGMA case_sensitive_like=OFF;
.
Example 4-2. Querying PRESCRIPTIONS for “heparin”
SELECT
count
(
*
)
FROM
`
PRESCRIPTIONS
`
WHERE
drug
LIKE
"%heparin%"
OR
drug_name_generic
LIKE
"%heparin%"
OR
drug_name_poe
LIKE
"%heparin%"
However, as noted, this will return all drugs containing heparin, both those used therapeutically as well as to maintain IVs and catheters. To investigate this further, Example 4-3 gives us a distinct list.
Example 4-3. Distinct entries for heparin
SELECT
DISTINCT
ndc
,
drug
FROM
`
PRESCRIPTIONS
`
WHERE
drug
LIKE
"%heparin%"
OR
drug_name_generic
LIKE
"%heparin%"
OR
drug_name_poe
LIKE
"%heparin%"
ORDER
BY
ndc
Harmonizing data with SQLite
We can see there are 11 entries of interest, many with the same name but all with different NDCs. Remember that NDCs are very granular and change with different types of packaging, vendors, size/volume, etc. Example 4-4 shows a listing of the NDC and drug name, making it easier for us to decide which values we want to embed in our next query.
Example 4-4. Distinct entries for heparin (results)
+
-------------+--------------------------------------+
|
ndc
|
drug
|
+
-------------+--------------------------------------+
|
63323026201
|
Heparin
|
|
00641040025
|
Heparin
|
|
17191003500
|
Heparin
CRRT
|
|
17191003500
|
Heparin
Flush
|
|
08290036005
|
Heparin
Flush
(
10
units
/
ml
)
|
|
00409115170
|
Heparin
Flush
(
10
units
/
ml
)
|
|
64253033335
|
Heparin
Flush
(
100
units
/
ml
)
|
|
63323026201
|
Heparin
Flush
(
5000
Units
/
mL
)
|
|
00641040025
|
Heparin
Flush
CRRT
(
5000
Units
/
mL
)
|
|
63323026201
|
Heparin
Flush
CRRT
(
5000
Units
/
mL
)
|
|
64253033335
|
Heparin
Flush
CVL
(
100
units
/
ml
)
|
|
64253033335
|
Heparin
Flush
Hickman
(
100
units
/
ml
)
|
|
64253033335
|
Heparin
Flush
PICC
(
100
units
/
ml
)
|
|
00409115170
|
Heparin
Lock
Flush
|
|
00338055002
|
Heparin
Sodium
|
|
00074779362
|
Heparin
Sodium
|
|
00264958720
|
Heparin
Sodium
|
|
00409779362
|
Heparin
Sodium
|
|
63323054207
|
Heparin
Sodium
|
+
-------------+--------------------------------------+
Based on these results, we decide we want to keep seven of the entries:
63323026201 00641040025 00338055002 00074779362 00264958720 00409779362 63323054207
So, as we see in Example 4-5, we now update our query to only pull prescriptions containing these NDC codes instead of doing a string-based search.
Example 4-5. Prescriptions by NDC
SELECT
ndc
,
drug
FROM
`
PRESCRIPTIONS
`
WHERE
ndc
in
(
"63323026201"
,
"00641040025"
,
"00338055002"
,
"00074779362"
,
"00264958720"
,
"00409779362"
,
"63323054207"
)
We can now use this query within nested SELECT statements or in JOIN statements to analyze patients, admissions, or other items of interest that focus solely on heparin prescriptions that were not simply flushes. As you can see, we would need to carry this list throughout our code and different queries. What if we made a mistake and need to add or remove another NDC from this list? It would become quite problematic to manage and maintain, particularly if we end up sharing this code with other data scientists.
In the next section, we will walk through the same example but using a graph-based approach instead. This will highlight how we can use the database itself to mitigate the need to copy and paste a list of NDC codes across a bunch of different analyses/projects.
Property graph with Neo4j
Using graphs to harmonize data is where the magic and fun happens! One of the key benefits of a graph database is the ability to connect/join individual data points. For our particular use case, we will define a new concept, representing medications that contain heparin but that are used to treat conditions (thus excluding flushes). Once we have created this concept, we can use it in future queries. This is something that we can’t easily do in a relational database—create nodes that we will connect to the actual prescription data within the MIMIC dataset.
One thing to note is that if we were to do this at scale (creating hundreds or thousands of these new concepts), we would want to add some properties to the new nodes so that we can track and manage them. For example, we might include a description of the concept and why we created it, a date/timestamp of when it was created and last modified, and tags or other metadata to help organize the concepts. Otherwise, you might end up with a pile of concepts with no way to manage and maintain them.
Unfortunately, Neo4j doesn’t neatly embed into our Python code like SQLite did. So, the first thing we need to do is create a Docker instance of our database. We can use the same command to start a Docker container as we used in Example 3-9, or we can just reuse the same container.
Importing data into Neo4j
To load the data into Neo4j, we need to follow pretty much the same process as with SQLite. Example 4-6 should look quite familiar since it’s nearly identical to what we did before except we are now switching from the sqlite3 Python module to Neointerface. It is a wrapper of the Neo4j Python library and also provides functionality around pandas dataframes.
Example 4-6. Importing MIMIC-III into Neo4j
import
os
import
pandas
as
pd
import
numpy
as
np
import
neointerface
mimic_path
=
"mimic-iii-clinical-database-demo-1.4"
# In case we are using ~ for path
mimic_path
=
os
.
path
.
expanduser
(
mimic_path
)
db
=
neointerface
.
NeoInterface
(
host
=
"bolt://localhost"
,
credentials
=
(
"neo4j"
,
"test"
),
apoc
=
True
)
csv_files
=
[
f
for
f
in
os
.
listdir
(
mimic_path
)
if
f
.
endswith
(
".csv"
)]
# Sort to make it easier to debug since tables are in predictable order
csv_files
.
sort
()
(
f
"Found
{
len
(
csv_files
)
}
files, expected 26"
)
for
f
in
csv_files
:
path
=
os
.
path
.
join
(
mimic_path
,
f
)
table_name
=
f
.
split
(
"."
)[
0
]
(
f
"Importing:
{
path
}
into table:
{
table_name
}
"
)
data
=
pd
.
read_csv
(
path
,
dtype
=
str
)
data
=
data
.
replace
({
np
.
nan
:
None
})
# This is the only line different compared to SQLite
db
.
load_df
(
data
,
table_name
)
Optimizing Imports
For the MIMIC-III demo data, the code in Example 4-6 works well enough and takes only a minute or two to load the data. If you are attempting to load the entire MIMIC-III dataset, you will want to follow an approach similar to what we used for loading the UMLS in Chapter 3 since the previous code will take quite a long time.
Now that we have the base concepts loaded, we will want to connect all of the nodes per the MIMIC-III schema. While there are certainly opportunities to optimize the overall data model, we will stick to replicating the existing structure we find within the MIMIC schema to keep things consistent and simple. In Example 4-7, I’ve created a single dictionary that contains the relationships we will be creating. I’ve trimmed it for the book, but the complete version is available in the associated repo.
Example 4-7. MIMIC-III relationships in Neo4j
relationships
=
{
(
'PATIENTS'
,
'subject_id'
):
[
(
'has_admission'
,
'ADMISSIONS'
),
(
'has_callout'
,
'CALLOUT'
),
(
'has_chartevent'
,
'CHARTEVENTS'
),
(
'has_cptevent'
,
'CPTEVENTS'
),
(
'has_datetimeevent'
,
'DATETIMEEVENTS'
),
(
'has_diagnoses_icd'
,
'DIAGNOSES_ICD'
),
(
'has_drgcode'
,
'DRGCODES'
),
(
'has_icustay'
,
'ICUSTAYS'
),
(
'has_inputevents_cv'
,
'INPUTEVENTS_CV'
),
(
'has_inputevents_mv'
,
'INPUTEVENTS_MV'
),
(
'has_labevent'
,
'LABEVENTS'
),
(
'has_microbiologyevent'
,
'MICROBIOLOGYEVENTS'
),
#('has_noteevent','NOTEEVENTS'),
(
'has_outputevent'
,
'OUTPUTEVENTS'
),
(
'has_prescription'
,
'PRESCRIPTIONS'
),
(
'has_procedureevents_mv'
,
'PROCEDUREEVENTS_MV'
),
(
'has_procedures_icd'
,
'PROCEDURES_ICD'
),
(
'has_service'
,
'SERVICES'
),
(
'has_transfer'
,
'TRANSFERS'
)
],
...
(
'MICROBIOLOGYEVENTS'
,
'spec_itemid'
):
[
(
'specimen_item'
,
'D_ITEMS'
,
'itemid'
)
],
(
'MICROBIOLOGYEVENTS'
,
'org_itemid'
):
[
(
'organism_item'
,
'D_ITEMS'
,
'itemid'
)
],
(
'MICROBIOLOGYEVENTS'
,
'ab_itemid'
):
[
(
'antibody_item'
,
'D_ITEMS'
,
'itemid'
)
],
(
'OUTPUTEVENTS'
,
'itemid'
):
[
(
'item'
,
'D_ITEMS'
)
],
(
'PROCEDUREEVENTS_MV'
,
'itemid'
):
[
(
'item'
,
'D_ITEMS'
)
]
}
For this, I set up the dictionary such that the keys are a tuple consisting of
the source node label as well as the attribute within the source node:
(SOURCE_NODE_LABEL, ATTR)
. The values are a list of all the relationships
from the source node as tuples consisting of the relationship and the target
node label: (RELATIONSHIP, TARGET_NODE_LABEL)
. It is assumed that the name
of the attribute to be matched in the target node is the same as the source
node. As you can see with the microbiology events, there is a third item in
the relationship tuple. In this situation, the third item is the
attribute/property name within the target node.
So, the very first relationship we will be creating is
(PATIENTS)-[has_admission]->(ADMISSIONS)
and “joined” by the subject_id
property.
Similarly, for the microbiology events, we have the relationship
(MICROBIOLOGYEVENT)-[specimen_item]->(D_ITEMS)
and “joined” by spec_itemid
in MICROBIOLOGYEVENT
and itemid
in D_ITEMS
.
To execute this and actually create the relationships, I have created a simple function that constructs and executes the query against the database as shown in Example 4-8.
Example 4-8. Function for creating Neo4j relationships
def
merge_relationship
(
source
,
sattr
,
rel
,
target
,
tattr
):
(
f
"Creating: (
{
source
}
:
{
sattr
}
)-[
{
rel
}
]->(
{
target
}
:
{
tattr
}
)"
)
query
=
f
"""
CALL apoc.periodic.iterate(
"MATCH (s:
{
source
}
), (t:
{
target
}
)
WHERE s.
{
sattr
}
= t.
{
tattr
}
RETURN s, t",
"MERGE (s)-[r:
{
rel
}
]->(t)
RETURN s,r,t",
{{
batchSize:1000, parallel: true
}}
)
"""
(
query
)
result
=
db
.
query
(
query
)
As you can see, I am using one of the APOC functions apoc.periodic.iterate
,
which requires that we break up the query into two parts—the query and the
action. We also specify the batch size to avoid building up one massive
transaction. We will end up creating several million edges/relationships and
trying to do that in a single transaction can be problematic. Lastly, we also
ask Neo4j to parallelize the operation using a Java ThreadPoolExecutor
.
Now that we have our data loaded into Neo4j, let’s start querying it.
Querying data in Neo4j
So, back to our heparin example—we can query for all prescriptions that contain “heparin” (case-insensitive), and then we can immediately connect that to individual patients, as shown in Example 4-9.
Example 4-9. Cypher queries for Heparin
MATCH
(
p
:
PRESCRIPTIONS
)
WHERE
toLower
(
p
.
drug
)
CONTAINS
"heparin"
OR
toLower
(
p
.
drug_name_generic
)
CONTAINS
"heparin"
OR
toLower
(
p
.
drug_name_poe
)
CONTAINS
"heparin"
RETURN
DISTINCT
p
.
drug
MATCH
(
pt
:
PATIENTS
)
-[
r
:
has_prescription
]->
(
p
:
PRESCRIPTIONS
)
WHERE
toLower
(
p
.
drug
)
CONTAINS
"heparin"
OR
toLower
(
p
.
drug_name_generic
)
CONTAINS
"heparin"
OR
toLower
(
p
.
drug_name_poe
)
CONTAINS
"heparin"
RETURN
DISTINCT
pt
The Cypher queries look very similar to their SQL counterparts, though that isn’t surprising given the depth and complexity of the queries. Next, we will take the queries one step further, focusing on those NDCs that we are actually interested in.
Harmonizing data with Neo4j
Now, let’s look at how we handle this situation where we want to filter the prescriptions to ignore heparin flushes. We follow the same process as in the previous section. However, we will add the additional step of capturing this within the database itself, as shown in Example 4-10.
Example 4-10. Creating a new heparin concept
CREATE
(
parent
:
Drug
:
Knowledge
{
date_created
:
"2022-01-01"
,
drug
:
"Heparin (non-flush)"
,
description
:
"MIMIC-III, Heparin, excluding flushes"
purl
:
"http://some-ontology.org/12345"
})
WITH
parent
MATCH
(
p
:
PRESCRIPTIONS
)
WHERE
p
.
ndc
IN
[
"63323026201"
,
"00641040025"
,
"00338055002"
,
"00074779362"
,
"00264958720"
,
"00409779362"
,
"63323054207"
]
MERGE
(
p
)
-[
r
:
for_drug
{
derived
:
true
}
]->
(
parent
)
RETURN
DISTINCT
parent
,
p
There are essentially three parts to this query:
-
Create the new concept to which we will link prescriptions.
-
Match the prescriptions of interest.
-
Link the matched prescriptions to the new heparin concept.
While this is the simplest and most straightforward way to link the new concept for “Heparin (non-flush)” to the prescriptions, the downside is that we don’t have intermediate concepts for each of the seven individual NDCs. However, this is a theoretical downside; that is, academically, we may want to track concepts for each NDC, but the reality is that we may never use that information or knowledge. This is an example of the sort of decision we need to make when working with graphs—do we want to track things we may not ever use? Generally, the answer is no, but it’s not always obvious to us what we might need in the future. On the other hand, how much should we invest in a future that may not exist?
Another point of note is the use of the purl
property. PURL is a persistent
uniform resource locator and is a way to uniquely identify resources
accessible via the web. Many use PURLs simply as a way to create unique
identifiers. However, the intent is that the PURL is a valid URL that
leverages HTTP response codes to communicate to consumers the type and status
of a particular resource.7 The Open Biological and Biomedical Ontology (OBO) Foundry is another example of how PURLs are used for identifiers. Here, we are using it simply to create an identifier for the purpose of the example. The hostname used is not real, and the PURL will not actually resolve to anything.
So, given this new concept, how can we query our database for all patients on heparin, excluding flushes? We can basically follow our nodes and relationships using Cypher as we see in Example 4-11.
Example 4-11. All patients connected to heparin concept (Neo4j)
MATCH
(
pt
:
PATIENTS
)
-[
rx
:
has_prescription
]->
(
p
:
PRESCRIPTIONS
)
-[
:
for_drug
]->
(
d
:
Drug
{
purl
:
"http://some-ontology.org/12345"
})
RETURN
pt
As you can see, our query is quite simple and intuitive. Depending on our
internal governance process, we can add as many properties as we need to the
new concepts we create as well as the relationships linking them. This will
allow us to filter and follow only those nodes and relationships pertinent to
our given question. For example, I included a date_created
property that
captures when the relationship was actually created in the database. This
would allow us to track how the relationship might change over time (if we
create a new relationship each time) but use the latest one only when
performing queries such as the previous one.
Given this ability to connect patients to prescriptions, and prescriptions to
our new drug concept, you might be wondering why we don’t have other drug
concepts. If we look at the PRESCRIPTIONS
table within the demo database, we
would actually see that there are 2,489 unique entries if we consider only the
columns with drug information (excluding the prescription dates and links back
to patients, admissions, or ICU stays).
We would want to introduce a new type of node to capture just the drug details
and move the start and end date, admission, and ICU stay details to the
has_prescription
relationship. If we continue remodeling the data, we may
also want to connect any given prescription to a hospital admission or ICU
stay. This is where the property graph approach begins to show its
limitations. If we model a prescription as a relationship, there would be no
direct way to link this relationship to a hospital admission node. Given the
details of the MIMIC data, we could work around this by storing the hospital
admission ID (hadm_id
) in both the admission node as well as the prescription
relationship and make sure that our Cypher query filters on both. In
Example 4-12, we would be querying all prescriptions
connected to emergency department admissions for patient with subject_id
12345.
Example 4-12. Connecting an admission to a prescription relationship
MATCH
(
pt
:
Patients
{
subject_id
:
12345
})
-[
:
has_admission
]->
(
a
:
ADMISSIONS
{
admission_type
:
"EMERGENCY"
}),
(
pt2
:
PATIENTS
)
-[
rx
:
has_prescription
]->
(
d
:
Drug
)
WHERE
a
.
hadm_id
=
rx
.
hadm_id
RETURN
pt
,
rx
,
d
While this certainly makes the queries easier as compared to SQL, the trade-off is that you now need to manage the knowledge stored within the database and make sure that your queries are using the right concepts. What if you someone changes the underlying mappings that we created in the previous section? How would this affect your query? How would you even know that such changes had been made so that you could retest your queries to make sure they still made sense? This highlights that while this approach is “clean” in a semantic sense, it does highlight some of the key challenges around knowledge management, regardless of your chosen technological solution.
In the next section, we will walk through the heparin example again with TypeDB. In addition to the hypergraph aspects, we will also look at the use of an inferencing or reasoning engine. This adds another element to how we can maintain data harmonization mappings through the use of rules.
Hypergraph with TypeDB
We continue with TypeDB but will also look at a feature specific to TypeDB—their built-in reasoning engine, which allows us to manage our data harmonization mappings with two different approaches:
-
Similar to property graphs and RDF triple stores, we can create new relationships that connect the various heparin concepts together, embedding this knowledge directly in the graph.
-
We take advantage of the reasoning engine and create rules that handle the mappings, loading and unloading the rules based on our query needs.
There is no clear “best approach” between our two options, and the choice will largely depend on the details of the specific use case. For example, if your mappings are unlikely to change often and will be frequently queried, it is probably best to just write them directly into the database and manage them as you would other ETL jobs for modifying the contents of the database.
On the other hand, if your mappings are likely to change frequently, constantly updating your database will be challenging to manage and can increase the probability of inconsistent knowledge. These types of “bugs” can be difficult to track down because they are difficult to reproduce. In this situation, you could create the mappings as rules, managing them as you would any other code. Once loaded into the database, your queries would benefit from the most up-to-date mappings.
Before we get into the nuts and bolts of working with the data, we need to create and load the TypeDB schema, just as we did in Chapter 3 when working with the UMLS. Example 4-13 shows the schema for Patient
, which corresponds to the PATIENTS
table definition within MIMIC-III.
Example 4-13. TypeDB Schema for Patients
define patient sub entity, plays prescription:patient, owns row_id, owns subject_id, owns gender, owns dob, owns dod, owns dod_hosp, owns dod_ssn, owns expire_flag; row_id sub attribute, value long; subject_id sub attribute, value long; gender sub attribute, value string; dob sub attribute, value string; dod sub attribute, value string; dod_hosp sub attribute, value string; dod_ssn sub attribute, value string; expire_flag sub attribute, value long;
We have defined each column of the PATIENTS
table as an attribute within
TypeDB along with an associated data type. I have made everything a string to
keep the code and parsing simple though I kept the various _id
fields as long
data types. You can also see that the patient
entity corresponds to the
PATIENTS
table and links each of the columns via the owns
keyword. While
this may be a different syntax compared to other database definition languages,
it is conceptually equivalent. The one key difference is the plays prescription:patient
line, which highlights the fact that a patient “plays”
the role of a patient within the prescription relation.8
Example 4-14 shows another snippet of the schema definition
specific to how we will be modeling prescriptions and drug instances.
Example 4-14. TypeDB schema for prescriptions
prescription sub relation, relates patient, relates prescribed_drug, owns startdate, owns enddate; druginstance sub entity, plays prescription:prescribed_drug, owns startdate, owns enddate, owns drug_type, owns drug, owns drug_name_poe, owns drug_name_generic, owns formulary_drug_cd, owns gsn, owns ndc, owns prod_strength, owns dose_val_rx, owns dose_unit_rx, owns form_val_disp, owns form_unit_disp, owns route;
I’ve split the original definition of the PRESCRIPTIONS
table into an entity
and a relation. The druginstance
entity captures all of the drug information
(e.g., name, dose, route, etc.), while the prescription
relation captures the
details of the prescription itself. Within the relation, we define two
roles: patient and prescribed_drug. Roles allow TypeDB to capture and
model an additional layer of context—the role that an entity plays within a
particular relation.
Import data into TypeDB
As I’m sure you are used to by now, we need to make sure our Docker container is running. You can use the same configuration as shown in Example 3-14 from the previous chapter.
The overall skeleton of our code remains the same as with the previous examples for SQLite and Neo4j. This is pretty straightforward though TypeDB does require the explicit use of transactions to manage our interactions. Before we start importing the actual data, we need to first load our TypeDB schema as we saw in Example 3-16.
Now that our database is primed and loaded, we will need to iterate through each of the CSV files and load them into TypeDB as we have done previously. Unfortunately, there is not yet a library that provides convenient loading of dataframes into TypeDB. Additionally, we need to do some processing of the data first so that we can extract unique drug instances that we can then connect using prescription relations. Example 4-15 contains our top-level code that iterates through the CSV files and loads them.
Example 4-15. Loading the MIMIC-III data into TypeDB
csv_files
=
[
f
for
f
in
os
.
listdir
(
mimic_path
)
if
f
.
endswith
(
".csv"
)]
csv_files
.
sort
()
(
f
"Found
{
len
(
csv_files
)
}
files, expected 26"
)
db
=
"test2"
with
TypeDB
.
core_client
(
"localhost:1729"
)
as
client
:
if
client
.
databases
()
.
contains
(
db
):
client
.
databases
()
.
get
(
db
)
.
delete
()
client
.
databases
()
.
create
(
db
)
with
TypeDB
.
core_client
(
"localhost:1729"
)
as
client
:
with
client
.
session
(
db
,
SessionType
.
SCHEMA
)
as
session
:
with
session
.
transaction
(
TransactionType
.
WRITE
)
as
tx
:
with
open
(
"../data/mimic-schema.tql"
)
as
f
:
q
=
f
.
read
()
tx
.
query
()
.
define
(
q
)
tx
.
commit
()
with
TypeDB
.
core_client
(
"localhost:1729"
)
as
client
:
with
client
.
session
(
db
,
SessionType
.
DATA
)
as
session
:
for
f
in
csv_files
:
path
=
os
.
path
.
join
(
mimic_path
,
f
)
table_name
=
f
.
split
(
"."
)[
0
]
(
f
"Importing:
{
path
}
into table:
{
table_name
}
"
)
data
=
pd
.
read_csv
(
path
,
dtype
=
str
)
data
=
data
.
replace
({
np
.
nan
:
None
})
queries
=
processors
[
table_name
](
data
)
with
session
.
transaction
(
TransactionType
.
WRITE
)
as
tx
:
for
q
in
queries
:
tx
.
query
()
.
insert
(
q
)
tx
.
commit
()
We have wrapped most of the import in transactions and other TypeDB-specific
code compared to our previous examples. The other notable differences compared
to Example 3-17 are the lack of batching and the use of more
generic “processors.” Each of our processor functions takes a dataframe as
input and returns a list of query strings to be executed against the database.
For straightforward tables such as PATIENTS
, the processor function does a
straight conversion. For prescriptions, we do a bit more processing, as we can
see in Example 4-16. I have left out the definitions of the
template functions from the example since you saw some in
Example 3-18.
Example 4-16. Patient processor function
def
process_patients
(
df
):
queries
=
[
patient_template
(
x
)
.
strip
()
for
x
in
data
.
itertuples
()]
return
queries
def
process_prescriptions
(
df
):
queries
=
[]
# Extract unique drug instances
drugs
=
df
[[
x
[
0
]
for
x
in
drug_fields
]]
.
drop_duplicates
()
queries
.
extend
([
druginstance_template
(
x
)
.
strip
()
for
x
in
drugs
.
itertuples
()])
queries
.
extend
([
prescription_template
(
x
)
.
strip
()
for
x
in
df
.
itertuples
()])
return
queries
processors
=
{
"PATIENTS"
:
process_patients
,
...
"PRESCRIPTIONS"
:
process_prescriptions
}
However, since we are separating our prescriptions into a druginstance
entity
and a prescription
relation, we will need to do a bit more processing. First,
we need to extract only those columns needed for a druginstance
, and then we
need to construct two sets of queries—one to insert druginstances
and
another to insert the prescription
relations.
Example 4-17 highlights the processor function as
well as the associate template functions that generate the query strings.
Example 4-17. Prescription processor function
drug_fields
=
[
(
"drug_type"
,
str
),
(
"drug"
,
str
),
(
"drug_name_poe"
,
str
),
(
"drug_name_generic"
,
str
),
(
"formulary_drug_cd"
,
str
),
(
"gsn"
,
str
),
(
"ndc"
,
str
),
(
"prod_strength"
,
str
),
(
"dose_val_rx"
,
str
),
(
"dose_unit_rx"
,
str
),
(
"form_val_disp"
,
str
),
(
"form_unit_disp"
,
str
),
(
"route"
,
str
)
]
def
druginstance_template
(
p
):
return
f
"""
insert $p isa druginstance,
{
", "
.
join
(
filter
(
None
,
[
has_clause
(
p
,
f
)
for
f
in
drug_fields
]))
}
;
"""
def
prescription_template
(
p
):
return
f
"""
match
$pt isa patient, has subject_id
{
getattr
(
p
,
'subject_id'
)
}
;
$drug isa druginstance,
{
", "
.
join
(
filter
(
None
,
[
has_clause
(
p
,
f
)
for
f
in
drug_fields
]))
}
;
insert $prescription (patient: $pt, prescribed_drug: $drug) isa prescription,
has row_id
{
getattr
(
p
,
'row_id'
)
}
,
has startdate "
{
getattr
(
p
,
'startdate'
)
}
",
has enddate "
{
getattr
(
p
,
'enddate'
)
}
";
"""
def
process_prescriptions
(
df
):
queries
=
[]
# Extract unique drug instances
drugs
=
df
[[
x
[
0
]
for
x
in
drug_fields
]]
.
drop_duplicates
()
queries
.
extend
([
druginstance_template
(
x
)
.
strip
()
for
x
in
drugs
.
itertuples
()])
queries
.
extend
([
prescription_template
(
x
)
.
strip
()
for
x
in
df
.
itertuples
()])
return
queries
The most important difference is the match/insert query in
prescription_template()
. Since relations assume that the associated entities
are already loaded in the graph, we start with a match
clause to find the
patient and drug that form the prescription. Given the MIMIC-III data, we are
matching the patient by subject_id
and the drug by all the drug fields and then
creating the relation in which the start and end dates of the prescription are
captured.
Now that we have our data loaded into TypeDB, let’s run some of our standard queries.
Querying data in TypeDB
First, we want to see which drugs contain heparin using basic string matching. Note that, unlike previous examples, we are querying for our newly created drug instances and not prescriptions as we did for SQLite and Neo4j. Then, all we would need to do is add an additional clause to match prescription relations if we wanted to return all prescriptions related to our drugs containing the string “heparin”. Both of these queries are shown in Example 4-18.
Example 4-18. TypeQL queries for heparin
# Query just the drug instances match $drug isa druginstance; {$drug has drug contains "heparin";} or {$drug has drug_name_generic contains "heparin";} or {$drug has drug_name_poe contains "heparin";}; get $rx; # Add clause for prescriptions match $drug isa druginstance; {$drug has drug contains "heparin";} or {$drug has drug_name_generic contains "heparin";} or {$drug has drug_name_poe contains "heparin";}; $rx (prescribed_drug: $drug) isa prescription; get $rx;
Now, let’s move on to how we might harmonize our drug instances with our heparin concept.
Harmonizing data in TypeDB
Now that we have our data loaded into TypeDB, the next step in our heparin example is to create a new drug concept for the subset of heparin in which we are interested. There are two approaches we can take when harmonizing the data. The first (Example 4-19) follows the same pattern as we used with property graphs—to just directly update the database.
Example 4-19. Example TypeDB rule
match $d isa druginstance, has ndc "63323026201"; $c isa concept, has purl "http://some-ontology.org/12345"; insert (parent: $c, child: $d) isa hierarchy;
The second approach we will take involves creating rules. As you can see in the following query, it looks similar to the TypeQL query for inserting directly into the database. When using rules, the TypeDB reasoning engine dynamically “adds” new data to the database upon rule execution without actually persisting anything. This allows queries to interact with the data as if the data was in the database, essentially providing a mechanism that inserts data, runs the query, and then removes the data.
Example 4-20 shows a single rule that matches a particular
druginstance
(indexed by NDC) to our “Heparin (non-flush)” concept. Rules
can infer relations and attributes but not new entities, so we would have
previously inserted this concept into the database. Once the rule is
triggered, the graph is effectively the same as if we had inserted the new
relation as the previous example.
Example 4-20. Example TypeDB rule
define rule heparin-non-flush-subset-ndc-63323026201: when { $d isa druginstance, has ndc "63323026201"; $c isa concept, has purl "http://some-ontology.org/12345"; } then { (parent: $c, child: $d) isa hierarchy; };
This rule matches all drugs that have the specified NDC code and the concept to which we want to link them and creates a hierarchal relation. As you can see, we are specifying only a single NDC. We would need to create a separate rule for each NDC we want to match because TypeDB rule conditions are conjunctive. Disjunctive conditions can simply be created as separate rules.
OR Versus AND Conditions
TypeDB condition clauses (the “when” section) are conjunctive—they are combined using “AND” so all conditions must be met for the rule to trigger. If you need to specify disjunctive conditions (using “OR”), you would need to create separate rules for each condition.
So, to create the seven rules we need for the different NDC, we can either manually create the rules as we did earlier or can write a little block of code to generate the rules for us, loading them the same way we loaded schemas. We use the same template approach as before so our code is pretty straightforward as we see in Example 4-21.
Example 4-21. Autogenerate TypeDB rules
def
heparin_ndc_rule_template
(
ndc
):
return
f
"""
define
rule heparin-non-flush-subset-ndc-
{
ndc
}
:
when
{{
$d isa druginstance, has ndc "
{
ndc
}
";
$c isa concept, has purl "http://some-ontology.org/12345";
}}
then
{{
(parent: $c, child: $d) isa hierarchy;
}}
;
"""
ndcs
=
[
"63323026201"
,
"00641040025"
,
"00338055002"
,
"00074779362"
,
"00264958720"
,
"00409779362"
,
"63323054207"
]
with
TypeDB
.
core_client
(
"localhost:1729"
)
as
client
:
with
client
.
session
(
db
,
SessionType
.
SCHEMA
)
as
session
:
with
session
.
transaction
(
TransactionType
.
WRITE
)
as
tx
:
for
rule
in
[
heparin_ndc_rule_template
(
ndc
)
for
ndc
in
ndcs
]:
tx
.
query
()
.
define
(
rule
)
tx
.
commit
()
Now that we have loaded our rules to connect particular drugs to our new heparin concept, let’s query our database. As we see in Example 4-22, the syntax is a bit more complicated than what we say with Cypher (in Example 4-11), but the conceptual approach is the same.
Example 4-22. All patients connected to heparin (TypeDB)
match # Find all drug instances linked to http://some-ontology.org/12345 $p isa concept, has purl "http://some-ontology.org/12345"; $c isa druginstance; (parent: $p, child: $c) isa hierarchy; # Find all prescriptions related to those drugs (prescribed_drug: $c, patient: $pt) isa prescription; get $pt;
The first block of the match
statement finds our new heparin concept (using
the PURL), finds all connected drug instances, and then finds all of the
prescriptions connected to those drug instances. As you can see, we achieve
the same functionality as we did with property graphs. However, the link
between our new heparin concept and existing drug instances is inferred by the
reasoning engine and not persisted within the database. So, our database
contains the facts as taken from the electronic health record, but our use-case
specific grouping of the drugs is inferred in real time.
The main benefit to such an approach is that we don’t clutter our database with mappings that are specific to a subset of use cases. We can load and unload rules as our needs change and limit the database to storing our facts. Of course, while this may sound appealing, we still need to balance this against our business case and whether this technical solution is actually fitting our needs. TypeDB has become quite performant (compared to earlier versions when they were Grakn), but making inferences in real time may become a bottleneck. Ultimately, whether you load knowledge as rules in the reasoner or directly in the graph would largely depend on how often you plan on using the inference, how likely it might change, and how performant the particular rule(s) are.
As we saw with each previous solution, the medication normalization problem can be solved with any database. The choice of which approach largely depends on other requirements, many of which may be difficult to determine when a projects starts. It is easy to over-engineer a solution especially if we are trying to account for poorly defined requirements. So, it is important that we spend the time up front to really understand the context in which we are building our solution.
For example, are we working with data from a disease area that is being updated frequently? In other words, is our understanding of the underlying science changing on the order of months or years? Aside from the disease itself, we also need to look at the associated clinical practice. We may understand the underlying pathophysiology of the disease process, but clinical management may be highly complex or nuanced.
Generally speaking, anything that increases the frequency or complexity of the mapping process will increase the chance that we will need to remap the data at some point. We then need to weigh this against how long it would take for us to remap the data and then apply those mappings to the underlying data. This is a great example of the intersection of data and engineering.
We need to balance working with the data (creating and maintaining the mappings) with engineering and operational considerations (applying the mappings to the data). It may be that the rule-based approach discussed earlier is the best approach from a data perspective. However, what if it makes our query so slow that it is unusable? We may need to persist these mappings simply to make queries usable, thus forcing us to find ways to better manage the migrations of the mappings.
So, this concludes our whirlwind tour through our heparin example in the context of different types of databases. As we discussed briefly in Chapter 2, RDF triple stores are a form of graph database based on semantic web standards. We did not go into specific implementation details of RDF triple stores and SPARQL because the modeling approach would be conceptually the same as with property graphs and hypergraphs. However, unlike property graph databases, RDF triple stores treat everything as a triple of subject-predicate-object. As a result, they can appear a bit more complex. Please visit the GitLab repo for specific implementation examples.
Similar pros and cons exist with RDF graphs as with property graphs. However, given that RDF triple stores follow an established standard, this approach can make it easier for us to share our knowledge with other systems. For example, instead of creating triples capturing the relationship between our new concept and the data directly as represented by MIMIC, we could use standardized PURLs for the medication concepts. We could then send these triples to others for inclusion in their systems.
In the next section, we quickly discuss how we could connect the previous approaches with terminologies such as the UMLS or specific sources such as SNOMED CT.
Connecting to the UMLS
As you saw in the previous set of examples, we were able to connect the data from an electronic health record to structured concepts. Though our example, focused on a new concept that we created and assigned our own definition and semantics, this could easily be a concept within the UMLS or other terminology/ontology.
In the SQL example, if we don’t already have structured codes or concepts, we generally add this as a new column. Data scientists typically handle this within their code (e.g., R or Python) after they’ve exported the data from the database. An alternative would be to update the database schema and persist the structured concepts as a new column in the database or via more complex join tables (as we see with OMOP and the integration of the Athena vocabularies).
However, when working with graphs, you saw how quickly we could create a new connection between the data point and a structured concept that we created. If we have terminologies loaded into the same graph as our data, we can perform the same linkage with existing concepts from the terminology. When working with UMLS, we could look up concepts by CUI instead of a PURL. If we are integrating with the NCBO BioPortal, we could use the BioPortal PURL. Even if we are linking directly with a single terminology, we would simply use the source’s unique identifier (e.g., RXCUI, LOINC code, SNOMED CT identifier).
To make it a bit more concrete, we can combine our example in this chapter with the heparin example from Chapter 3 where we found the concepts corresponding to the heparin pharmacologic substance. We also identified the therapeutic or preventative procedure that was “heparin therapy.” With the MIMIC data, instead of creating a new heparin concept and linking the drugs, we could also just link the patient or admission directly to the “heparin therapy” concept. Not only would this cleanly link us to the UMLS, it may be more semantically correct. After all, it’s not the drug itself that is therapeutic or not; it’s the idea that it was used as a therapeutic to treat the patient.
After quite a bit of a technical deep dive, let’s take a step back and, in the next section, discuss some of the difficulties and challenges we have when normalizing structured medical data.
Difficulties Normalizing Structured Medical Data
The medication normalization scenario discussed in this chapter is a relatively simple situation. There are certainly some complexities and potential ambiguity in mapping something like heparin, but medications are generally objective data. Things become much more challenging when working with data that may be subjective or where there is no clear consensus.
For example, as we discussed earlier in this chapter, cancer staging is a complex topic. In addition to different approaches to staging (group versus TNM), there are also the AJCC staging guidelines that are routinely updated. Though the guidelines were originally created for human consumption (i.e., for oncologists to read, understand, and apply in their practice), they are also evolving to reflect the shift to a data and computationally-centric future.
So, how does this affect us from a data harmonization perspective? When we work with oncology data that has staging guidelines, we need to identify which edition/version was used and whether we will need to account for differences between the versions. In other words, a cancer staged using one version of the AJCC guidelines might actually result in a different stage using a later version of the guideline. So, if we are combining datasets or analyzing patients across many years (during which time there was an update to the guidelines), we need to be careful about how the data were coded. Of course, there are too many examples to list, but similar nuances exist across all medical specialties, so it is critical that we engage appropriate domain experts to help us identify such pitfalls.
Conclusion
In this chapter, we dove into working with electronic health record data, using the MIMIC dataset to illustrate many of the challenges. We also looked at Synthea, a synthetic data generator. The MIMIC dataset is a real dataset and thus preferable in many ways over synthetic data. However, it focuses on intensive care data so is not reflective of the majority of data you are likely to encounter when working with EHR data. Synthea provides more generic data though they are synthetically generated. With Synthea, we can also get data in different formats, allowing us to test data transformation pipelines.
The most obvious trade-off between the two is that we can use the MIMIC data to both develop data pipelines as well as generate actual insights—as long as our use case aligns with intensive care data from an academic medical center. On the other hand, the Synthea data allows us to create and test data pipelines but can’t be used for any actual analysis.
The MIMIC data also reflects true data quality issues that creep into EHR data because they were collected during actual patient care. Though the Synthea data generates pseudorandom data, the algorithms to generate the data are those that were engineered into the underlying software. So, while Synthea data can approximate many of the real-world data issues we see, it may still have many gaps.
Within the context of EHR data, we looked at some of the methods for storing and querying the data in the context of a medication harmonization example. This allowed us to compare and contrast SQL- and graph-based approaches for representing patient-level data, and Table 4-2 highlights the pros and cons to each type of database.
Database | Pros | Cons |
---|---|---|
PostgreSQL (SQL) |
|
|
Neo4j/ArangoDB (property graph) |
|
|
GraphDB (RDF triple store) |
|
|
TypeDB (strongly typed hypergraph) |
|
|
By now, you should have a foundational understanding of the many issues that arise when working with real-world data. There is no way to capture all of the many nuances and challenges that you might encounter, but I chose the medication harmonization challenge since it is representative of many common issues.
In the next chapter, we continue our deep dive into RWD but we shift gears a bit and look at claims data that are generated as part of the reimbursement process. You may think that claims data are specific to the United States given our heavy focus on private insurers. However, even those countries with single payer or nationalized health systems still generate claims data—essentially, the government becomes the sole insurance company.
1 Arvind Narayanan and Vitaly Shmatikov, “Robust De-anonymization of Large Datasets (How To Break Anonymity of the Netflix Prize Dataset),” University of Texas at Austin, 2008. https://arxiv.org/abs/cs/0610105.
2 See the Open Data Stack Exchange, “MIMIC-III inputevents MV or CV”.
3 ACDIS, “Q&A: Is ‘backward mapping’ from ICD-10-CM/PCS to ICD-9-CM appropriate?”
4 “Asthma Data Analysis Guidance: ICD-9-CM to ICD-10-CM Conversion”,” US Centers for Disease Control and Prevention.
6 Wikipedia, “Separation of Concerns”.
7 The DOI System uses PURLs and they provide a short discussion regarding how they use PURLs.
8 Please see the TypeDB documentation for additional information on entities, attributes, roles, and relations.
Get Hands-On Healthcare Data now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.