Chapter 1. Introduction

Data is recognized as an important driver of innovation in economic and research activities, and is used to improve services and derive new insights. Services are delivered more efficiently, at a lower cost, and with increased usability, based on an analysis of relevant data regarding how a service is provided and used. Insights improve outcomes in many facets of our lives, reducing the likelihood of fatal accidents (in travel, work, or leisure), getting us better returns from financial investments, or improving health-related outcomes by allowing us to understand disease progression and environmental influences, to name but a few examples. Sharing and using data responsibly is at the core of all these data-driven activities.

The focus of this book is on implementing and deploying solutions to reduce identifiability within a data pipeline, and it’s therefore important to establish context around the technologies and data flows that will be used in production. Example applications include everything from structured data collection to Internet of Things (IoT) and device data (smart cities, telco, medical). In addition to the advantages and limitations of particular technologies, decision makers need to understand where these technologies apply within a deployed data pipeline so that they can best manage the spectrum of identifiability. Identifiability is more than just a black-and-white concept, as we will see when we explore a range of data transformations and disclosure contexts.

Before we delve into the concepts that will drive the selection of solutions and how they’re deployed, we need to appreciate some concepts of privacy and data protection. These will help frame the scope of this book, and in particular the scope of reducing identifiability. While this is a book about anonymization, we divide the book up by different categories of identifiability that have been established by privacy and data protection laws and regulations. We will also demonstrate how to support proper anonymization through the concepts of these laws and regulations, and provide examples of where things went wrong because proper anonymization was not employed. Anonymization should, in practice, involve more than just removing people’s names from data.

Identifiability

Best practice recognizes that data falls on a spectrum of identifiability,¹ and that this spectrum can be leveraged to create various pipelines to anonymization. This spectrum is managed through technology-enabled processes, including security and privacy controls, but more specifically through data transformations and monitoring. We will explain how to objectively compare data sharing options for various data collection use cases to help the reader better understand how to match their problems to privacy solutions, thereby enabling secure and privacy-preserving analytics. There is a range of permutations in how to reduce identifiability, including where and when to provide useful data while meaningfully protecting privacy in light of broader benefits and needs.

While technology is an important enabler of anonymization, technology is not the end of the story. Accounting for risks in an anonymization process is critical to achieving the right level of data transformations and resulting data utility, which influences the analytic outcomes. Accordingly, to maintain usable outcomes, an organization must have efficient methods of measuring, monitoring, and assuring the controls associated with each disclosure context. Planning and documenting are also critical for any regulated area, as auditors and investigators need to review implementations to ensure the right balance is met when managing risks.

And, ultimately, anonymization can be a catalyst for responsibly using data, as it is privacy enhancing. There is a security component to responsibly using data that comes from limiting the ability to identify individuals, as well as an ethical component that comes from deriving insights that are broader than single individuals. Conceptually, we can think of this as using “statistics” (that is, numerical pieces of information) rather than single individuals, and using those statistics to leverage insights into broader populations and application areas to increase reach and impact. Let’s discuss some of the other terms you’ll need to know next.

Getting to Terms

Before we can dig in and describe anonymization in any more detail, there are some terms it would be best to introduce at the outset, for those not familiar with the privacy landscape. We will describe a variety of privacy considerations and data flows in this book based on potential data pipelines, and we will simply describe this as data sharing. Whether the data is released, as in a copy of the data is provided to another party, or access is granted to an external user of a repository or system internal to an organization, it’s all sharing to us! Sometimes the term disclosure is also used for sharing data, and in a very broad sense. In an attempt to keep things simple, we will make no distinction between these terms.

We will use the terms data custodian to refer to the entity (meaning person or company) sharing data, and data recipient to refer to the entity receiving data. For internal data sharing scenarios, the data custodian is the organization as an entity, and the data recipient is a functional unit within that organization. The organization maintains oversight on the data sharing to the functional unit, and ensures that functional unit is treated as a separate unit so it can be assessed and treated as a legitimate data recipient. We will discuss this scenario in more detail later in the book.

Note

In this book we have chosen to use the term identifiability, which pairs well with privacy laws and regulations that describe identifiable information, rather than speak of “re-identification risk.” Although our measures are probabilistic, nonexperts sometimes find this approach to be both daunting and discouraging due to the focus on “risk.” We hope that this change in language will set a more reasonable tone, and put the focus on more important apsects of building data pipelines that reduce identifiability and provide reasonable assurance that data is nonidentifiable.

We would struggle to describe anonymization, and privacy in general, without explaining that personal data is information about an identifiable individual. You may also come across the terms personal information (as it’s referred to in Canada), personally identifying information (used in the US), or protected health information (identifiable health information defined for specific US health organizations). Personal data is probably the broadest of these terms (and due to EU privacy regulations, also of high impact globally), and since our focus is on data for analytics, we will use this term throughout this book. In legal documentation, the term used will depend on which law applies. For example, personally identifying information mixed with protected health information would simply be called protected health information.

When personal data is discussed, an identifiable individual is often referred to as a data subject. The data subject is not necesarily the “thing under study” (that is, the “unit of analysis,” a term commonly used in scientific research to mean the person or thing under study). If data is collected about births, the thing under study may be the actual births, the infants, or the mothers. That is, the statistical analysis can focus on any one of these, and changing the thing under study can change how data is organized and how the statistical tools are used. For example, an analysis of mothers could be hierarchical, with infants at a different structural level. We will describe simple data structures with regard to statistical analysis in the next chapter.

For the purposes of this book, and most privacy laws and regulations, any individual represented in the data is considered a data subject. The thing under study could be households, where the adult guardians represent the individuals that are of primary interest to the study. Although the number of children a person has (as parent or guardian) is personal, children are also data subjects in their own right. That being said, laws and regulations vary, and there are exceptions. Information about professional activities may be confidential but not necessarily private. We will ignore these exceptions and instead focus on all individuals in the data as data subjects whose identity we endeavor to protect.

Laws and Regulations

Many of the terms that can help us understand anonymization are to be found in privacy laws and regulations.² Data protection, or privacy laws and regulations (which we will simply call laws and regulations, or privacy laws and regulations), and subsequent legal precedents, define what is meant by personal data. This isn’t a book about law, and there are many laws and regulations to consider (including national, regional, sectorial, even cultural or tribal norms, depending on the country). However, there are two that are notable for our purposes, as they have influenced the field of anonymization in terms of how it is defined and its reach:

Health Insurance Portability and Accountability Act (HIPAA): Specific to US health data (and a subset at that),³ HIPAA includes a Privacy Rule that provides the most descriptive definition of anonymization (called de-identification in the act). Known as Expert Determination, this approach requires someone familiar with generally accepted statistical or scientific principles and methods to anonymize data such that the risk of identifying an individual is “very small.”⁴
General Data Protection Regulation (GDPR): This very comprehensive regulation of the European Union has had far-reaching effects, in part due to its extraterritorial scope (applying to residents of the EU, regardless of where that data is processed, when a service is intentionally targeting the EU), and in part due to the severity of the fines it introduced based on an organization’s global revenue. The regulation is “risk based” (or contextual), with many references to risk analysis or assessments.⁵

As technology evolves, so do the emerging threats to anonymized data: more information may become publicly available, new techniques and methods become available to scrape and combine public information, and new methods emerge to launch attacks on data. Meanwhile, the technology that protects data, both cybersecurity and anonymization, will age and need updates and improvements. This means that the assessments of identifiability need periodic reviews and continous oversight to ensure the circumstances under which data was rendered nonidentifiable remain in place.⁶ Similar to cybersecurity, typically the assessments need to be redone every 12 to 24 months, on top of continuous monitoring.

Since we’ve introduced US and EU privacy regulations, we should also clarify some of the terms used in each of these jurisdictions to refer to similar concepts. We’re focusing on the two regulations mentioned above, although in truth there are also state-level privacy laws in the US (such as the California Consumer Protection Act, and LD 946 in Maine), as well as member-level privacy laws in the EU that add additional layers. For our purposes the terms in Table 1-1 should be sufficient. And, yes, you may notice that we’ve repeated the definition of personal data for the sake of completeness. The definitions are only basic interpretations in an attempt to bring the US and EU terms into alignment. This is only meant to provide some guidance on aligning the terms; be sure to discuss your particular situation with your legal and privacy team.

Table 1-1. Basic definitions based on the similarities between US and EU terms
US HIPAA	EU GDPR	Common definition
Protected health information	Personal data	Information about an identifiable individual
De-identification	Anonymization	Process that removes the association between the identifying data and the data subject
Covered entity	Data controller	Entity that determines the purposes and means of processing of personal data
Business associate	Data processor	Entity that processes personal data on behalf of the data controller
Data recipient	Data processor (for personal data)	Entity to which data is disclosed by the data custodian
Limited data set	Pseudonymized data	Personal data that can no longer be attributed to a specific data subject without the use of additional information

Now that you’re familiar with the terms used in these regulatory acts, let’s get back to the subject of what makes personal data identifiable, and how we can interpret the term “identifiable” for the purpose of defining anonymization. Guidance from authorities is almost exclusively contextual and driven by risk assessments, attempting to balance the benefits of sharing data against an interpretation of anonymization that will sufficiently reduce identifiability to appropriately manage risks. We won’t go through the various guidance documents available. Our previous work has helped influence guidance, and this book has been influenced by that guidance as well. We’re all in this together! Let’s consider various interpretations that have been put forward on what consitutes identifiable information, as shown in Table 1-2.

Table 1-2. Conditions on identifiability from various authorities (in alphabetical order)
Authority	Definition of identifiability
California Consumer Protection Act (US)	Directly or indirectly relates to or could reasonably be linked to a particular consumer or household
Federal Court (Canada)	Serious possibility that an individual could be identified through the use of that information, alone or in combination with other information
GDPR (EU)	Identifiability is defined by the “means reasonably likely to be used” to identify a data subject, taking into consideration objective factors (such as cost and time required to identify)
HIPAA (US)	Reasonable basis to believe identifiability, whereas not identifiable if an expert certifies that the risk of re-identification is “very small”
Illinois Supreme Court (US)	Not identifiable if it requires a highly skilled person to perform the re-identification
Office of the Privacy Commissioner of Canada	“Serious possibility” means something more than a frivolous chance and less than a balance of probabilities

As you can see from the table, authorities don’t usually provide explicit measures of identifiability. It’s more typical to find legal language than scientific norms in privacy laws and regulations, even when these terms are less than clear.⁷ Thankfully there is guidance and scientific norms available from experts that we draw on, some of which we will reference. We can at least divide identifiability into three well-known states.

States of Data

We mentioned the identifiability spectrum in the Preface, which is influenced by how authorities define personal data, as well as various sections in regulations and their interpretations and guidance. The identifiability spectrum is determined by accounting for:⁸

The identity of the data recipient (so that we know who is accessing the shared data)
Contractual controls (so that the data recipient knows their legal obligations)
Privacy and security controls (so that limits are imposed on accessing the shared data, and on the data recipients themselves)
Transformations of identifying information (which limit re-identifications even if the data recipients attempted to do so)

This book has been organized around a few points along the identifiability spectrum based on three main states of data: identified, pseudonymized, and anonymized. These are shown in Figure 1-1, and described in detail below.

Identified: We use this term to mean that there is directly identifying information in the data, such as names or addresses. We make a slight distinction between identified and identifiable. An individual in a data set is identifiable if it is reasonable to expect that the individual could be identified, either with the data already immediately available or in combination with other information (external or known to the attacker). Many points along the spectrum will be considered identifiable, and therefore personal. But identified means the identity is known and associated with the data, which is often the case when delivering a service to an exact person. Identified data carries the most risk, and the most privacy and data protection obligations.
Pseudonymized: The term pseudonymization was popularized with the introduction of the GDPR. Technically speaking, when pseudonymizing, the directly identifying information doesn’t need to be replaced with a pseudonym, it could just as well be a token or fake data or even suppressed entirely. The legal term pseudonymization simply means that direct identifiers have been removed in some way, as a data protection mechanism. Any additional information required to re-identify is kept separate and is subject to technical and administrative (or organizational) controls. This is how we will use the term pseudonymized in this book, while considering additional data transformations or controls that can reduce the legal obligations of working with personal data. Although pseudonymized data would therefore not be identi-fied data, it would still be identi-fiable data.
Anonymized: Anonymization is the process of removing direct and indirect identifiers for a given data sharing model, providing reasonable assurance that data is nonidentifiable. Anonymized data is therefore considered in the context of a data sharing scenario. Anonymization must be legally defensible—that is, it needs to meet the standards of current legal frameworks, and be presentable as evidence to governing bodies and regulatory authorities (i.e., data protection and privacy commissioners), to mitigate exposure and demonstrate that you have taken your responsibility toward data subjects seriously. Technically, “removing” indirect identifiers can mean various forms of generalization, suppression, or randomization, all of which will be determined by the relevant threats and prefered mitigation strategy to ensure data remains useful to the analytic needs.

Warning

The terms anonymization and de-identification are used interchangeably by some people, organizations, or even jurisdictions, but be careful as de-identification is sometimes used interchangeably with pseudonymization as well! Interpretations of each will also vary, in some cases in very substantial ways. We will consider a variety of general considerations throughout this book that should help explain most of these nuances, at least when definitions and guidance are considered more closely.

Other terms will be introduced throughout the book, but these are the ones that you need to start reading. Many of the terms we’ve just introduced necessarily include some discussion of regulations, so this section has served to introduce terms and regulations, at least to some degree. We will describe regulations where needed as you move through the book, in order to explain a concept or consideration. The next section will delve deeper into regulatory considerations as they relate to the process of anonymization.

Anonymization as Data Protection

There has been, and will continue to be, considerable debate around the term “anonymous,” often focusing on the output of anonymization alone (i.e., can someone be reasonably identified in the data, and what’s considered “reasonable”).⁹ We shouldn’t lose sight of the fact that anonymization is a form of data protection, and thus is privacy enhancing. To be effective at enhancing privacy, anonymization needs to be used, and that means it also needs to be practical and produce useful data. Barriers that discourage or limit the use of anonymization technology will simply drive organizations to use identified data, or simply not innovate at all. There are many benefits that can be extracted from sharing and using data, so let’s make sure that it’s done responsibly.

We keep mentioning the need to produce “useful data” from the process of anonymization. There is a reality here that we can’t escape, something we called the Goldilocks Principle in our previous book. The Goldilocks Principle is the idea that we need to balance risk against benefits, and in this case the benefits are the utility of the data and what insights may be drawn from it. It is possible to achieve a win-win situation by producing data that both serves a purpose and protects the identity of data subjects. But as data geeks, we have to be up front in saying that there is no such thing as zero risk. When we cross the road and look both ways, we are taking a measured risk. The risk we take when we cross the road can be quantified, but it’s statistical in nature and never zero unless we never cross the road. Yet, we cross roads every day of our lives. We consider probable risks, and aim to achieve very low probabilities.

Consider the rock and hard place we are caught between. In a data sharing scenario in which we wish to achieve private data analysis, there will always be a sender (the data custodian) and a recipient (the data analyst). But the recipient is also deemed an eavesdropper or adversary (using standard security language, in this case referring to an entity or individual that may re-identify data, whether intentionally or not, thus adding risk to the process). Compare this with encryption, in which the recipient gets to decrypt and gain access to the original data shared by the sender. The recipient in the encryption example is not considered an adversary, because the intended recipient is supposed to decrypt the data. Not so with anonymization. The intended recipient of anonymized data should be unable to re-identify the data, and if they can, it’s a problem.

Note

Our goal in anonymizing data is to balance the needs of the recipient by providing them with useful data, while minimizing the ability of an adversary, including the recipient, to extract personal information from the data. The two roles that the recipient plays, as an eventual user of the data and also a potential adversary, is what distinguishes anonymization from encryption (in which the adversary and recipient are mutually exclusive), and what makes producing useful and safe data so challenging.

A more practical and realistic approach than striving for zero risk is to focus on the process of minimizing risk, considering anonymization as a risk-management process. This is the approach taken, for example, by the HITRUST Alliance, which provides a framework allowing organizations to meet the privacy requirements of multiple regulations and standards.¹⁰ This is also the approach taken in data security, which is largely process based and contextual. We call this risk-based anonymization, which in our work has always included process- and harm-based assessments to provide a holistic approach to anonymization.¹¹ This approach informs the statistical estimators of identifiability and data transformations that are applied directly to data. Guidance on the topic of anonymization is almost always risk based, providing a scalable and proportionate approach to compliance.

Warning

If personal data is pseudonymized, or falls short of being considered anonymized, subsequent uses of the data must still be compatible with the original purpose for the data collection, and may require an additional legal basis for processing. Either way, pseudonymization reduces identifiability in data. We will therefore also consider methods to reduce identifiability that may fall short of anonymization, because they are both useful in their own right and are likely to build toward anonymization. We need to understand all the tools at our disposal.

We’ll explore the idea of risk-based anonymization a little later in the chapter, but first we need to understand what data subject approval or consent involves and why laws or regulations don’t typically require them for secondary uses of data.

Approval or Consent

As a form of data protection, anonymization itself does not normally require the approval of data subjects, although transparency is recommended and possibly required in some jurisdictions. As with other forms of data protection, anonymization is being done on behalf of data subject, to remove the association between them and the data. We use the term approval here rather than consent because under the GDPR, consent is more restrictive than in other jurisdictions (i.e., it must be “freely given, specific, informed, and unambiguous,” with additional details and guidance around the interpretation).

Getting approval of data subjects can be extremely difficult and impractical. Imagine asking someone going to a hospital for treatment whether they would allow their data to be anonymized for other purposes. Is it even appropriate to be asking them when they are seeking care? Would some people feel pressured or coerced, or answer in a reactive way out of frustration or spite? It would be different in other scenarios, where the stakes aren’t as high and the information not as harmful or sensitive. But timing and framing are important.

At the other extreme, approval to anonymize could be sought days, months, perhaps even years later. This could make for awkward situations when data subjects have moved on and acquaintances are asked for contact information. These acquaintances may not be on speaking terms with the data subjects or may be reluctant to share their contact information. Or the data subjects concerned may even be deceased. Contacting thousands of individuals for their approval is likely to be impractical, and unlikely to be fruitful.

But let’s assume data subjects are reachable. Some privacy scholars have argued that approval can be meaningless, either because the approval request is presented in impenetrable legalese, or because data subjects don’t understand the implications or simply don’t want to be bothered. Depending on how the approval is structured, they may give approval just to get access to something being offered, or elect not to be found and select the opt-out option. How is this preserving privacy?

In contrast, imagine a process in which approval is entirely voluntary and not required in exchange for a service. Government and the private sector would be forced to issue a potentially endless stream of requests to anonymize data for every use case and every service, hoping to improve operations or innovate. They would burden individuals with requests, to the point where individuals would simply ignore all requests. The concept of priming also suggests that even when cool heads prevail, people often only think about privacy when it’s brought to their attention. They become sensitive to the topic because they are now thinking of it, and perhaps unnecessarily so. Opt-in would be rare, even when opting in would benefit the data subjects themselves or a broader population.

The reality is that specific sectors or use cases may see different rates of approval. Certain socioeconomic groups may be more sensitive to privacy concerns, and services and insights would become biased to specific groups. Making opt-in the default for anonymizing data, provided the process meets guidance or standards, would ensure nonpersonal data is available to improve services and derive new insights. This is why regulations offer alternatives to approval, and focus on much more than the process of reducing identifiability. Which leads us to a discussion of purpose specification, which is of critical importance to regulators.

Purpose Specification

Debate regarding anonymization usually arises when data is shared for purposes other than for which the data was originally collected, especially since approval by data subjects is not normally required once the data is anonymized. Although the process of anonymization is important, the uses of anonymized data are what concern people. There have been too many examples of data misuse, in which people felt discriminated against or harmed in some way, although interestingly most are probably using identified data. Anonymization will not solve data misuse, although it can help mitigate concerns.

Personal data may, for example, be collected from banking transactions, but that personal data is then anonymized and used to generate insights, e.g., to determine age groups that use a banking app versus an ATM, and at what times and on what weekdays. Such data-driven insights from nonpersonal data can improve services based on current usage patterns for different age groups. Some people may take issue with this form of targeting, even when the intent is to improve services by age group. All organizations have to make decisions to ensure the return on investment is reasonable, otherwise they will cease to exist, and this will inevitably mean making trade-offs. However, if the targeting touches on sensitive demographic groups, it will enter the realm of ethical considerations, even for anonymized data. This is especially true with sensitive data in general, such as health data.

If data is to be used for other purposes, for which approval of data subjects is not explicitly sought, the organization using the data should reflect carefully to ensure that its use of the data is appropriate. Specifically, harms should be considered in the broader context of ethical uses of data, which we’ll discuss in more detail in later chapters. Although this may be deemed unrelated to anonymization, the reality is that it could set the tone for how a risk management approach to anonymization is evaluated. We consider framing anonymization within the broader context of data protection.

Reducing identifiability to a level in which it becomes nonpersonal is, by its very nature, technical, using a blend of statistics, computer science, and risk assessment. In order to engender trust, we must also look beyond the technical, and use best practice in privacy and data protection more broadly. Consider making the case for using anonymized data based on the purposes for which it will be used. For example, we can take a page from EU privacy regulations and consider “legitimate interests” as a way to frame anonymization as a tool to support the lawful and ethical reuse of data. That is, a data sharing scenario can consider how reusing the data (called “processing” in the regulatory language of GDPR) is legitimate, necessary, and balanced, so that it’s found to be reasonable for the specified purposes.

Legitimate: Data reuse should be something that is done now or in the very near future. The interests in reusing the data can be commercial, individual, or societal, but the reuse should avoid causing harm. It should also be possible to explain those interests clearly, and the reuse should seem reasonable in the hypothetical case explained to individuals.
Necessary: Data reuse should be somewhat specific and targeted to the use case, and minimized to what is required to meet the objectives that have been laid out in advance. Overcollection will be frowned upon by the public, so it’s best to ensure that needs are well laid out. Again, imagine the hypothetical case of trying to explain the reuse of all that data to individuals.
Balanced: Data reuse should have well-articulated benefits that outweigh residual risks, or data protection or privacy obligations. Consider potential negative impacts and how they can be mitigated. A form of risk–benefit analysis can help inform and support the choice of mitigation strategies. Hint: reduce identifiability!

Anonymization can help address two of the three requirements listed above: it can more clearly limit the data to what is necessary, at least in terms of information that may be identifiable, and it can make the data more favorably balanced toward the beneficial side by reducing the risks of reusing the data. This leaves the legitimacy of reuse to be explained. Anonymization will help ensure that only necessary data is used and will help the benefits of reuse outweigh the potential harms. But how the anonymized data is used needs to be considered to ensure it is appropriate.

Now this isn’t to say that we need to make the case for “legitimate interests” to use anonymized data, since being anonymized means that data protection laws and regulations no longer apply. What we are suggesting is that the privacy considerations above can help “legitimize” that use. We are simply drawing from some best practices to help frame the conversation and, ultimately, the reporting that takes place to explain anonymization.

Re-identification Attacks

To better understand the need for proper anonymization methods, let’s consider a few well-known examples of re-identification attacks in which the anonymity of data subjects was compromised. There is a small set of such attacks that are repeated at conferences, in academic publications, and by the media, often in an attempt to raise awareness around the field of anonymization. As in any scientific discipline, these data points serve as evidence to inform and evolve the field (and where there isn’t evidence, the field relies on scientific plausibility). They are what we call demonstration attacks, because they serve to demonstrate a potential vulnerability, although not its likelihood or impact. Demonstration attacks target the most “re-identifiable” individual to prove the possibility of re-identification. They are a risk in public data sharing, since there are no controls, and the attacker can gain notoriety for a successful attempt.

These well-known and publicized re-identification attacks were not attacks on what we consider to be anonymized data; the data would also not have been considered anonymized by experts in the field of statistical disclosure control (the field defined by decades of expert advice at national statistical organizations). Although the methods of statistical disclosure control have existed for decades, they were predominantly applied to national statistics and in government data sharing. Let’s consider a handful of demonstration attacks and the lessons we can extract.

AOL search queries

In 2006, a team at AOL thought it would be of value to researchers in natural language processing (a field that develops algorithms in computer science to understand language) to share three months of web searches—around 20 million searches by 657,000 pseudonymous users. AOL made the data publicly available, and it can still be found on the computers of researchers around the world and probably on peer-to-peer networks, even though AOL removed the search data from its site shortly after the release when a New York Times reporter published a story after having identified user 4417749.¹²

User 4417749’s searches included “tea for good health,” “numb fingers,” “hand tremors,” “dry mouth,” “60 single men,” “dog that urinates on everything,” “landscapers in Lilburn, GA,” and “homes sold in Shadow Lake subdivision Gwinnett County Georgia.” Pay close attention to the last two searches. Geographic information narrows the population in a very obvious way, in this case allowing a reporter to visit the user’s neighborhood and find a potential match. And this is how Thelma was found from the search queries.¹³

What’s more, others claimed they were able to identify people in the search data. Many search queries contained identifying information in the form of names based on vanity searches (in which you search for yourself to see what’s publicly available), or searches of friends and neighbors, place-names associated with a home or place of work, and other identifiers that could be used by pretty much anyone since the search data was public. And of course the searches also included sensitive personal information that people expected would be kept private. It’s a good example of the risks associated with sharing pseudonymous data publicly.

Netflix Prize

Again in 2006, Netflix launched a data analytics competition to predict subscribers’ movie ratings based on their past movie ratings. Better algorithms could, in theory, be used to provide Netflix users with targeted film recommendations so that users stay engaged and keep using the service. The competition was open to pretty much anyone, and by joining, participants would gain access to a training set of 100,480,507 ratings for 17,770 movies by 480,189 subscribers. Each rating in the training set included a pseudonym in place of the subscriber name, the movie name, the date of the rating, and the rating itself.¹⁴

A group of researchers demonstrated how they could match a few dozen ratings to the Internet Movie Database (IMDb), using a robust algorithm that would attempt to optimize the matches.¹⁵ They were limited to a few dozen ratings due to a limit imposed by the IMDb terms of service. They hypothesized that when Netflix users also rated movies on IMDb, the two sets of ratings would strongly agree with each other. The researchers claimed that subscribers in the Netflix dataset were unique based on a handful of ratings outside the top 500 movies and approximate rating dates (+/-1 week), and that they had found two especially strong candidates for re-identification. Based on the matching between the public IMDb movie ratings and the Netflix movie ratings, the researchers claimed to be able to infer political affiliation and religious views of these re-identification candidates by considering the nonpublic movies viewed and rated in the Netflix data.

Whether an adversary could know this level of detail, and confirm that their target was in the sample dataset, is debatable. However, given an appropriate database with names and overlapping information, the algorithm developed may be effective at matching datasets. It’s hard to know from a demonstration attack alone if this is the case. However, in the case of mobility traces, in which geolocation points are connected to create a path, researchers found the matching algorithm to have a precision of about 20%, given their overlapping data from the same population, even though they had found that 75% of trajectories were unique from 5 data points.¹⁶

State Inpatient Database

Both the AOL and Netflix examples involved data sets in which pseudonyms had replaced usernames. Let’s consider a different example, in which not only were names removed, but some information was also generalized, e.g., the user’s date of birth was replaced by their age. For this we can turn to the Healthcare Cost and Utilization Project (HCUP), which shares databases for research and policy analysis. In 2013, the State Inpatient Database (SID) of Washington State from 2011 was subject to a demonstration attack using publicly available news reports. Privacy experts had warned that these databases required additional protection, and since the demonstration attack, multiple improvements have been introduced.

In this attack, a team searched news archives for stories about hospital encounters in Washington State. One included a 61-year-old man, Raymond, from Soap Lake, who was thrown from his motorcyle on a Saturday and hospitalized at Lincoln Hospital. Raymond was re-identified in the SID based on this publicly available information, and from this, all his other hospital encounters in the state that year were available since the database was longitudinal.

A total of 81 news reports from 2011 were collected from news archives, with the word “hospitalized” in them, and 35 patients were uniquely identified in the SID of 648,384 hospitalizations.¹⁷ On the one hand, you could argue that 35 individuals out of 81 news reports is a significant risk, provided there’s public reporting of the hospitalization; on the other hand, you could argue that 35 individuals out of 648,384 hospitalizations is a very small number for the benefits provided from sharing the data. Regardless, public sharing is challenging given the risks of a demonstration attack, whereas controls can dramatically prevent such incidents. More importantly, however, is what we learn about information that can potentially be used to identify an individual, and how this information can be used to properly measure identifiability.

Lessons learned

We need to distinguish between what is possible and what is probable, otherwise we would spend our lives crossing the street in fear of a plane being dropped on our heads (possible, but not probable). Demonstration attacks are important to understand what is possible, but they don’t always scale or make sense outside of a very targeted attack on public data, where there are no controls on who has access and what they can do with the data. Our focus in this book is primarily on nonpublic data sharing, and how we can assess identifiability based on the context of that sharing.

Let’s draw some lessons from these demonstration attacks.

Pseudonymized data, in which names and other directly identifying information have been removed, are vulnerable (which is why they are considered personal data).
Data shared publicly is at risk of demonstration attacks, the worst kind since it only takes one re-identification for attackers to claim success. Notoriety is an important motivator for attackers, leading them to publish their results.
Contractual controls can discourage attempts at a demonstration attack (e.g., the IMDb terms of service), but will not be sufficient to eliminate all attacks. Additional controls and data transformations will be required.

With these lessons in mind, we can now make a distinction between re-identification attacks and what should constitute proper anonymization. None of the previous examples were of anonymized data in the sense that regulators use the term. We are now better positioned to discuss anonymization as it should be practiced.

Anonymization in Practice

Let’s turn our attention to what we mean by the term risk based, since we’ve used this term a few times already. An evaluation of risk implicitly involves careful risk assessments, to understand more precisely where there is risk and what the impact of different mitigation strategies might be. This drives better decisions about how to prioritize and manage these risks. The process also means that risk is evaluated in an operational context, using repeatable and objective assessments to achieve our data sharing goals.

We take a very scientific approach to anonymization. Besides being evidence based, so that the approach is reasonable and adaptive to a changing threat landscape, we also determine a statistical tolerance using a threshold that is independent from how we measure identifiability, used to provide reasonable assurance that data is nonidentifiable. Based on risk assessments to evaluate the context of a data sharing scenario, we compare identifiability measures to the threshold to determine how much we need to transform identifying information until the statistical measure of identifiability meets the predefined threshold. We will describe this process in detail in the following chapters, but an overview of the process can be seen in Figure 1-2 (which can be iterative).

The threshold itself is a probability derived from benchmarks that represents the cell-size rules, which determine the minimum number of individual contributions that need to be included in an aggregation of data. Consider when the entries from identifiable categories are grouped (known as a contingency table or cross tabulation in statistics). A simple example would be age and county, where the cell-size rule could be 10 (or a probability threshold of 1/10). This would mean that there need to be at least 10 people in any one age and county in the identifying information. So for the tabulation of age of 30 and county of Shire, there would need to be at least 10 people that are aged 30 in the Shire. Our measures of identifiability are more complex than this, taking into consideration the context of data sharing and also the complexity of data, but this gives you a conceptual understanding of what we mean.

Contrast this benchmark approach with a fixed, list-based approach that invokes making required data transformations. HIPAA, mentioned earlier in the chapter, includes in its Privacy Rule a method known as Safe Harbor that uses a fixed list of 18 identifiers that need to be transformed.¹⁸ This list includes many directly identifying pieces of information that need to be removed, such as name and Social Security number. Individual-level dates must be limited to year only, and the method also places limits on the accuracy of geographic information. Regardless of context, regardless of what data is being shared, the same approach is used.

The only saving grace to the HIPAA Safe Harbor approach is a “no actual knowledge” requirement that has been interpreted to be a catchall to verify that there are no obvious patterns in the data that could be used to identify someone, such as a rare disease. Although the Safe Harbor approach is simple, it does not provide very robust privacy protection and is only really useful for annual reporting. Also note that it’s only suitable under HIPAA, as it was derived using US census information, and no other jurisdictions have provisions in their regulations to use this specific list.

Another approach to anonymization involves heuristics, which are rules of thumb derived from past experience, such as what transformations to apply based on specific data or circumstances, and fixed cell-size rules. These tend to be more complicated than simple lists, and have conditions and exceptions. Buyer beware. The devil is in the details, and it can be hard to justify heuristics without defensible evidence or metrics. Heuristics may provide a subjective gut check that things make sense, but this will be insufficient in the face of regulatory scrutiny.

The purpose of a risk-based approach is to replace an otherwise subjective gut check with a more guided decision-making approach that is scalable and proportionate, resulting in solutions that ensure that data is useful while being sufficiently protected. This is why we described risk-based anonymization as a risk management approach. And one of the most important ways you can reduce risk in a repeatable way is through automation, as shown in Figure 1-3.

Creating automated risk management processes, in general, ensures that you capture all necessary information without missing anything, with auditable proof of what was done in case an issue arises that you need to correct for next time. This book will help you find areas for automation by introducing opportunities for technology-enabled processes in particular to reduce identifiability and build anonymization pipelines.

Final Thoughts

Recognizing that identifiability exists on a spectrum presents us with opportunities to meet privacy and data protection obligations in a multitude of use cases. Building an anonymization pipeline is not a linear path (pun intended!) from identified to anonymized. There are multiple points along this spectrum, as well as many criteria and constraints we need to consider to get to a solution that meets the needs of all parties and stakeholders involved. Whereas it is possible to have single-use anonymization of data to meet a specific need, this book takes a much broader view to consider how systems can be engineered with business and privacy needs in mind.

Whereas this isn’t a book about privacy laws and regulations, we need to understand the basics as they relate to anonymizing personal data. Hopefully, the brief introduction in this chapter will inspire you to learn more.¹⁹ But we will also highlight points of concern or confusion as they arise throughout the book. Importantly, we leverage three well-established states of data—identified, pseudonymized, and anonymized—to engineer anonymization pipelines.

There are many concerns with the practice of anonymization, so it pays to remember that anonymization is privacy preserving, and understand what those concerns are so that we can address them. By being more clear about the purposes for using anonymized data, we are better positioned to ensure responsible sharing and use. Whereas there have been many reported re-identifications, such “demonstration” attacks only serve to underline the importance of using generally accepted statistical or scientific principles and methods to properly anonymize data. And this book will provide strategies to properly anonymize data for a range of use cases, once we’ve understood the identifiability spectrum in more depth, and a practical risk-management framework.

¹ For an excellent summary of the identifiability spectrum applied across a range of controls, see Kelsey Finch, “A Visual Guide to Practical De-Identification,” Future of Privacy Forum, April 25, 2016, https://oreil.ly/siE1D.

² Generally speaking, laws are written by a legislative assembly to codify rules, and regulations are written by admistrative agencies and departments to put these rules into practice. Both are enforceable.

³ HIPAA applies to health care providers, health care clearinghouses, and health plans (collectively known as covered entities), as well as their business associates. Health data that does not fall into these categories is not covered.

⁴ Details and expectations are provided by Office for Civil Rights, “Guidance Regarding Methods for De-Identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule,” Department of Health and Human Services, 2015, https://oreil.ly/OGxxa.

⁵ To understand “risk based” in an EU context, see Article 29 Data Protection Working Party, “Statement on the Role of a Risk-Based Approach in Data Protection Legal Frameworks,” October 4, 2017, https://oreil.ly/A3Tpd.

⁶ See, for example, the motivation described in Simson Garfinkel, “De-Identification of Personal Information,” NISTIR-8053, National Institute of Standards and Technology, October 2015, https://oreil.ly/ebsSD.

⁷ An interesting example of how some of these terms are interpreted is provided in Andrew Mauboussin and Michael J. Mauboussin, “If You Say Something Is Likely, How Likely Do People Think It Is?” Harvard Business Review, July 3, 2018, https://oreil.ly/bdiIi.

⁸ Many of these are described in Khaled El Emam et al., “Seven States of Data: When Is Pseudonymous Data Not Personal Information?” Brussels Privacy Symposium: Policy and Practical Solutions for Anonymization and Pseudonymization, 2016, https://oreil.ly/Nn925.

⁹ For a good discussion of the debates about anonymization, and different viewpoints, we recommend everyone read Ira Rubinstein and Woodrow Hartzog, “Anonymization and Risk,” Washington Law Review 91, no. 2 (2016): 703–60, https://oreil.ly/Yrzj6.

¹⁰ “HITRUST De-Identification Framework,” HITRUST Alliance, accessed March 28, 2020, https://oreil.ly/wMxdF.

¹¹ Sophie Stalla-Bourdillon and Alison Knight, “Anonymous Data v. Personal Data—A False Debate: An EU Perspective on Anonymization, Pseudonymization, and Personal Data,” Wisconsin International Law Journal 34, no. 2 (2017): 284-322, https://oreil.ly/wctgn.

¹² Michael Barbaro and Tom Zeller Jr., “A Face Is Exposed for AOL Searcher No. 4417749,” The New York Times, August 9, 2006, https://oreil.ly/CnIBY.

¹³ AOL search data leak.

¹⁴ Netflix Prize.

¹⁵ The results are described in Arvind Narayanan and Vitaly Shmatikov, “Robust De-Anonymization of Large Sparse Datasets,” Proceedings of the 2008 IEEE Symposium on Security and Privacy (2008): 111-125, https://oreil.ly/y0oec.

¹⁶ Details can be found in Huandong Wang et al., “De-Anonymization of Mobility Trajectories: Dissecting the Gaps Between Theory and Practice,” 25th Annual Network and Distributed System Security Symposium (2018), https://oreil.ly/sq6NI.

¹⁷ Results are described in Latanya Sweeney, “Only You, Your Doctor, and Many Others May Know,” Technology Science, September 29, 2015, https://oreil.ly/0DTiH.

¹⁸ For details, see the de-identification guidance cited earlier by the Department of Health and Human Services.

¹⁹ An excellent resource to learn more about privacy and data protection, with newsletters, conferences, and courses, is the International Association of Privacy Professionals, https://iapp.org.

Get Building an Anonymization Pipeline now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Building an Anonymization Pipeline by Luk Arbuckle, Khaled El Emam

Chapter 1. Introduction

Identifiability

Getting to Terms

Note

Laws and Regulations

States of Data

Figure 1-1. The well-established states of data used to build anonymization pipelines.

Warning

Anonymization as Data Protection

Note

Warning

Approval or Consent

Purpose Specification

Re-identification Attacks

AOL search queries

Netflix Prize

State Inpatient Database

Lessons learned

Anonymization in Practice

Figure 1-2. Quantitatively evaluating identifiability can be iterative and will drive the transformation of identifying data.

Figure 1-3. Automation means replacing a gut check with repeatable processes and auditable proof of what was done.

Final Thoughts

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly