Chapter 4. Identified Data

If you’re in the business of working with identified data, with people’s names, addresses, and other unique identifiers, you should already have the tools in place to protect that data. If you’re thinking of working with identified data, well, you have a lot of standards and the like to learn. We won’t be going through all of these, as that’s not the focus of this book. The Five Safes of risk-based anonymization we discussed in the last chapter have provided us with the contextual elements we will need to dig into to manage risk, and in this case from the starting point of collecting identified data.

We want to provide you with some strategic privacy considerations in working with identifiable data, considerations that would fall within the realm of privacy engineering.¹ It just so happens that identifiability will play a key role in that, since it plays a key role in privacy laws and regulations in general. There are other considerations we’ll explore, and we want to arm you with some basic tools and provide you with an understanding of how they interact with one another. Just being aware of them can help you in your design of systems that will process personal data (Privacy by Design!), or in updating systems to be more privacy friendly.

For many organizations, anonymization will start from their own store of identified data. Think of this as pushing data out, from identified to anonymized (with a detour through pseudonymization, but we’re keeping that for another chapter). This sharing of data may be to another department within the same organization or to an entirely different organization. These pose different challenges, which we’ll explore in this chapter. This should be a natural progression, from privacy engineering in general to anonymization more specifically.

Requirements Gathering

As with any engineering project, we start with requirements gathering, but, in our particular context, privacy-related requirements gathering. This will primarily involve three broad categories: use cases, data flow and data use, and data and data subjects. Evaluating these three categories will help tease out the wants and needs from a privacy perspective, and a series of probing questions can be used to better understand the details of those wants and needs. Not all these questions need be asked and answered, but they can form the baseline of what needs to be understood to gather privacy-related requirements and define privacy objectives.

Tip

Many of the privacy considerations we will work through as requirements engineering can be motived by the process of a privacy impact assessment, or risk assessments in general. Although often done at the end of a design cycle, the criteria found in these assessments should make their way into design as more granular expectations of privacy design. This way we leverage accepted standards and frameworks to inform the design process.

Let’s work through this as though we were actually working to design a system (either from scratch or a retrofit). It could be our design, it could be someone else’s, we just want to work through the steps to capture as much detail as needed in the project definition phase. We will not be delving into system design and development, and will leave the implementation to another book. Some aspects will certainly require a privacy policy or legal analysis, but we can’t account for all privacy laws and regulations the world over, or the shifting privacy landscape, and we consider some of this to be material for the implementation phase.

Use Cases

For our use case, we’re going to attempt to understand how a system will be used so that we can scope the privacy issues and possible solutions. It’s critical to evaluate use cases in order to understand how a system will be used, so this is where we most often start gathering requirements. Thinking back to our discussion of purpose specification in “Safe Projects”, which you can think of as a concept definition in an engineering project, we should have a general idea of what sort of system we’re envisioning. But at this stage we go deeper, as we want to understand the interactions with the system to determine where privacy protections could be put in place.

A use-case analysis in the area of privacy engineering can be focused around the three main objectives a system should strive for to demonstrate a desired level of trustworthiness (echoing what are known as the fair information practice principles):²

Predictable: It should be possible to predict how a system will behave. This means meeting accountability requirements by ensuring that interactions and outcomes are expected. This can be achieved through purpose specification and use limitation, and a degree of transparency through forms of notice that will be provided to, or approval sought from, data subjects so that they can also predict what will happen with data about them.
Manageable: All systems will require controls on how personal data is handled, from ingestion to internal processing and export. How manageable a system is will be determined by the granularity of administration in handling data in the defined use cases. This can be achieved by supporting alteration, deletion, or selective disclosure of personal data (and this can incorporate individual control, if desired).
Disassociated: The different use cases supported by a system will require different levels of identifiability and data minimization. Direct interaction with data subjects will require names or other directly identifying information, whereas in other use cases they may be replaced with with tokens or pseudonyms. For analytic processing, identifiability may be further reduced to the point of being nonidentifiabile (anonymized) data.

The purpose of these objectives, summarized in Figure 4-1, is to meet the needs of more detailed privacy principles, be they enshrined in privacy laws/regulations or not, with measurable controls. We provide the above objectives to get you on your way to undestanding the basics of privacy engineering. Notice, however, that one of these principles (to disassociate individuals or groups from the data) is that of identifiability! As we said, this is a core element of privacy. Although the objectives of the system being predictable and manageable are broader, they can also be seen as supporting the objective of having data subjects disassociated from the data.

Of course, this is not enough information to drive a conversation and tease out privacy requirements, so we provide a series of “Probing Questions to Understand Use Cases”. Better yet, you can think of these as opportunities to integrate privacy into your systems. Whether you’re at the start or end of the design and development phase, or even revisiting a system in light of privacy and trust considerations, these questions can help you get to the bottom of which things your system truly needs to operate successfully in bringing privacy to the forefront of your operations.

Probing Questions to Understand Use Cases

What are the interactions between the users and the supporting systems themselves (i.e., the business/system use cases)?
- Who are the primary and secondary actors?
- Are there any preconditions before a use case will take place, what is the guaranteed outcome, and what is the trigger for the use case?
- Provide a use case diagram, if possible, even a rough draft of plans.
What are the assumptions that support the use case/data flow to ensure the processing of personal information is predictable (and thereby meets accountability requirements)?
- Purpose specification and use limitation?
- What forms of notice will be provided to, or approval sought from, data subjects (i.e., the degree of transparency)?
What degree of control over personal information is required in the use case/data flow to ensure it is manageable (i.e., the granularity of administration)?
- Will there be mechanisms in place to support alteration, deletion, or selective disclosure of personal information?
- Is there a data life cycle management plan to define and automate the stages of data from cradle to grave?
What identifiable information is operationally required for the use case/data flow, i.e., can the data be disassociated from an individual or group?
- Could the direct identifiers be replaced with a token or pseudonym, could an irreversible key be used while also eliminating the ability to single out a data subject based on public identifiability, or should the personal data be anonymized?
- Are there other data attributes that could be minimized to avoid disclosures of identity or associated activities, based on use cases and neeeds?
- What technical and administrative (organizational) controls will be in place (e.g., controlling access, disclosure, retention, and disposition of personal data, safeguarding personal data, and ensuring accountability and transparency in the management of personal data)?

The privacy-engineering objectives should each contribute in some way to enhancing privacy, but it’s not all or nothing. The point is to find a balance between the objectives that is driven by the use case, as shown in Figure 4-2. In this example, the right balance was found with less need to be predictable but more need for manageable data even though the data is largely disassociated from data subjects. This is why use-case analysis is so important, to tease out wants and needs.

Say the use case was to aggregate reports that are used internally. By their very nature, these reports have limited use since the data is just a summary of information. The aggregated information may be sufficiently disassociated from data subjects that there isn’t much need for a system using these aggregate reports to be predictable, by specifying purpose and use limitation, since there is little chance they can be misused. However, aggregate information can still result in unwanted disclosures unless there are certain rules in place, such as avoiding any small aggregate counts (directly or by overlapping pieces of information). Therefore there may still be a need to ensure that a system using this information is manageable.

Data Flows

To go even deeper, and drive that conversation forward, we can discuss data flows in detail. We suggested a review of data flows in the context of the “Safe Projects”, with the purpose of identifying the data sharing scenarios as they specifically relate to identifiability. Data flows are really a continuation of our use case analysis. In our discussion of Safe Projects, we were primarily concerned with the criteria and constraints of a system, whereas now we are concerned with legal and ethical boundaries. We need to plan for the possibility of our different sharing scenarios from a legal perspective but also in order to design the appropriate sharing mechanisms:

Mandatory sharing: If law enforcement or public health officials require access or copies of data, how will this be provided to them? There are also privacy laws and regulations that require that data subjects themselves have access to personal data, to know what is collected about them and give them rights to make corrections or amendments. Access may be interpreted broadly, and does not necessarily mean that data subjects have the ability to directly go into a system and make changes, which could be impractical and even damaging in some cases. Data subjects may also have a right to get copies of data about them in the name of data portability. The data will need to be identified in these cases.
Internal sharing: The use cases that have been planned or developed may require various forms of internal access, or even that copies be transferred to a different department or unit of the organization. Here we are assuming that the sharing of identifiable data is permitted, as a primary purpose that supports the interactions with data subjects, or secondary purposes that privacy laws and regulations allow. We’ll need to know who will have access and for what purpose to ensure the sharing is truly permitted, and whether the stated levels of identifiability described in the use case analysis are truly required, or whether greater degrees of dissociation would be acceptable given the concerns of privacy and trust.
Permitted sharing: Just because it’s allowed doesn’t mean you will want to share identifiable data. If you are sharing with a third party for a secondary purpose, ask yourself if data subjects would be surprised or upset with third-party access to information about them. This will depend on the third party, cultural norms, and, ultimately, trust (between data subjects and the third party, but also between data subjects and your organization). Transparency and anonymization will greatly improve the trust relationship.
Other sharing: Every other scenario in which personal data sharing is not expressly permitted by privacy laws or regulations will need anonymization. Your anonymization pipeline may start with any of the above scenarios, but under this scenario we are referring to anonymization that will ensure the data is no longer identifiable, so that it can be responsibly shared. Understanding data flows is still critical here, so that we can apply a risk-based approach that ensures the most granular and useful data is made available.

We provide a list of “Probing Questions to Understand Data Flows” to help tease out the necessary details needed to understand legal and ethical boundaries. You’ll notice questions involving geographic considerations, because privacy laws and regulations vary across the world, and you will need to consider cross-border data transfers and data localization laws or regulations (which require that personal data be hosted and remain within the country, unless properly anonymized to meet the highest standard of privacy protection).

Probing Questions to Understand Data Flows

What are the data flows, from source to recipient access or use, including all data transfers and points where data transformation may occur?
- Please provide a data flow diagram, if possible, even a rough draft of plans.
Who owns or is the custodian of the data in question?
- Is the client the custodian or recipient?
- Is the custodian aware of the plan for risk mitigation?
Where are the recipients, i.e., from what jurisdiction or geographic location will data be accessible?
- What other data will the recipient have access to that could be associated or linked to the data that is intended to be accessible or shared?
- How frequently will the data be accessed or shared?
- How will the data be accessed or shared (e.g., accessed via a portal, delivered to recipient)?

Data and Data Subjects

We’ve figured out the use cases and data flows. Now we need to consider the data itself, and who is represented in that data. The type and structure of data may define the practicality of solutions to mitigate privacy risks, and various properties of data need to be understood to determine both identifiability and the potential invasion of privacy. This evaluation will also focus on who the data subjects are, and the expectations associated with processing of data about these subjects. Again we provide a series of “Probing Questions to Understand Data and Data Subjects” to help you through this process.

Data subjects

Consider the parameters or criteria for individuals being included in the data, and any information about other individuals that come along for the ride, such as their relatives or neighbors. Where they are from can change legal requirements, especially if they were collected intentionally. For example, if a product or service is targeting a country other than the one in which the data is actually being stored, the privacy laws and regulations of the data subject’s country of residence will apply.

Warning

While it’s true that it may be difficult for regulators to enforce the extraterritorial reach of privacy laws and regulations, they have and will do so as deemed necessary. Some laws and regulations have been designed with this in mind (e.g., the GDPR in the EU), while others have been interpreted as such by the courts. Either way, it’s best to stay on the right side of the law and plan accordingly.

Be sure to document details of how personal data was collected, stored, protected, and used. To be auditable and defensible requires documented proof of data protections to demonstrate the reasonable measures taken to respect legal obligations and respect the expectations of data subjects. One aspect to meet those obligations is data minimization, which means understanding the data to be shared.

Structure and properties of the data

In considering the structure and properties of the data that’s collected or shared, we again need to consider the stakeholder wants and needs for a system (the mantra of requirements gathering). In designing a system with privacy in mind, we need to repeatedly review and ensure that the collection and sharing supports purpose specification. Otherwise it’s all too easy to slip into the habit of getting all the data that’s possibly available, and one basic privacy principle is that of data minimization.

Tip

Only collect or share what you need, when you need it, for as long as you need it, and for the purposes that were specified. In the spirit of transparency, you will most likely be letting data subjects know that you’re using data for a specific purpose, and your system should stick to that purpose. But even if you’re not letting them know directly, it should be easy for them to understand what data a system is using based on what it does.

You may recall from the “Safe Outputs” that we use a subjective criterion to select a threshold. This same approach can be used to consider data collection and sharing in general, since it defines identifiability tolerance. Ultimately, if you’re collecting personal data, you want to reduce the potential to invade the lives of the people whose data you’re entrusted with. We repeat the categories we consider in this subjective assessment of risk tolerance, and provide some additional detail to consider in defining what data you truly need. These can form part of a privacy impact assessment.

Data sensitivity: Consider the level of detail that’s needed: how many variables of information, the granularity and precision of that data, how many domains of information will be collected, and whether those domains need to be joined, etc. Also consider the sensitivity of the information collected. Certain privacy laws or regulations single out certain categories of data as being particularly sensitive, such as health information, genetic or biometric data, race or ethnicity, political opinions or religious beliefs, and a person’s sexual activity or orientation.
Potential injury: Breach notification laws or regulations can provide an indication of how regulators set the bar on potential injury to data subjects if the data is lost or stolen, or processed inappropriately. You will also want to consider how such incidents may cause direct and quantifiable damages, and measurable injury to the data subject. And consider your ability, as an organization, to enforce contracts or data sharing agreements, for internal or external data sharing.
Appropriateness of approval: Data subjects can provide approval to participate, implicitly or explicitly, in the collection and sharing of data. They should have a basic understanding of the data collected or shared about them based on the interaction with the organization or their systems. They may even have volunteered the data, or been consulting in how the data was to be used. However, their approval is not always required, as we’ve seen in discussing mandatory or permitted sharing.

With that in mind, we can consider how different categories of information can affect privacy or confidentiality. What data is needed needs to be put in this context.

Categories of information

We’ve already mentioned directly and indirectly identifying information in the previous chapters. But now that we’re working with identified data, we need to spell this out clearly and go the extra step of classifying data for the purposes of making decisions about it, so that we can determine what tools can be used to protect it accordingly. We’ll revisit tools and techniques later, and focus here on the types of data that may be collected and used.

Directly identifying

Attributes that can essentially be used alone to uniquely identify individuals or their households, such as names and known identifiers. These should only be kept for identified data, and even then you may choose to separate directly identifying attributes into a separate dataset that is linkable to the other personal data. When we want to reduce identifiability, these attributes are always removed and replaced with fake random data or with pseudonyms or tokens. The techniques used need to be robust and defensible. This is often called masking or pseudonymization (e.g., in the EU).

Indirectly identifying

Attributes that can be used in combination with one another to identify individuals, such as known demographics and events, may need to be modified or transformed to reduce risk. These are the attributes used to measure identifiability, and are not immediately removed from the shared data because they are extremely useful for analytics. This is where all the heavy lifting takes place in terms of anonymization, because we want to minimize information loss to maintain analytic utility. We can divide these into two classes, which generally have different levels of risk:

Knowable to the public, such as fixed demographics
Knowable to an acquaintance, such as encounter dates and longitudinal characteristics or events

Confidential or target data

Attributes that are not identifiable but would be learned from working with the data, such as behaviors and preferences. Target data may still be found in data that is anonymized, and can pose ethical considerations regarding its use. Often, when classifying personal data as identifying or not, everything that is not identifying is considered target data. Not everything is identifying, but probably all personal data will be considered a target or confidential. There are some approaches to anonymization that will try to transform confidential data, but this can have a very negative impact on data utility, as this is the information where there is a lot to learn.

Nonpersonal data

Attributes that are not about the data subjects, such as machine data, and therefore not personal in nature. It’s worth classifying nonpersonal data as it is sometimes mixed with personal data and therefore incorrectly classified as target data. However, in the context of device data, for example, it’s worth separating this out (sometimes both literally and figuratively) from personal data. You are likely to want to better protect personal data in your care, given the potential impacts on trust and regulatory oversight. The nonpersonal data will still be of value, however, for analytical purposes. You just won’t need all the auditing and oversight for it.

Probing Questions to Understand Data and Data Subjects

Describe the parameters or criteria for individuals being included in the database.
- Who are the data subjects and does any information collected about them include other individuals (e.g., relatives or household members)?
- Where are the data subjects from, i.e., from what jurisdictions or geographic locations will data be collected?
Describe the structure and properties of the data.
- Wants versus needs for analysis and research using the data collected?
- Expected data retention period?
- Do the attributes collected support the purpose for collection and processing?
- Is the data highly detailed, is it highly sensitive and personal in nature?
- What is the potential injury to individuals from an inappropriate processing of the data?
- What is the appropriateness of approval by data subjects for disclosing the data?
Describe the identifiability of attributes, including from inferences, and what may not actually constitute personal data.
- What is directly identifying versus indirectly identifying?
- What is confidential, or a target, besides the identifiable data?
- What is nonpersonal, i.e., not about a data subject?

Our focus in this chapter is on the project definition stage, namely requirements gathering and defining generic elements of a system architecture as it relates to privacy. Concept definition was captured in the Safe Projects of Chapter 3. We can use this knowledge in assisting with the transition from privacy to secondary uses of data, which moves us from privacy requirements gathering to privacy design and development.

From Primary to Secondary Use

Now that we have scoped out our project with various privacy considerations, but have also dug into specifics related to identifiability, we’re ready to plan out options for building an anonymization pipeline for secondary uses. We’ve touched on the differences between primary and secondary use in previous chapters. But in our experience this bears repeating.

Primary purpose: When you offer a service, people have expectations about what data you need to collect to effectively provide the service, and they have expectations that the collected data will only be used for the direct purpose of providing that service. It’s really as simply as that: a primary purpose is the main reason for the service. It defines the minimum data needed to offer the service, and the way the collected data should be used to provide that same service. You can use that data for those direct primary purposes, but not for anything else.
Secondary purpose: Everything that is not a primary purpose is a secondary purpose. Or, put differently, secondary purposes are the indirect uses of data that were collected for a primary purpose. Some may be mandatory (e.g., reporting to law enforcement), whereas some may be permitted (e.g., for the benefit of society). Building analytical models from data collected from several data subjects is, for example, generally considered a secondary purpose, whereas applying already built analytical models to subject data for the direct purpose of delivering an expected service to that individual is a primary purpose. Reducing identifiablity is mostly applied to secondary purposes.

There are different ways to parse data from the primary to secondary purposes of collecting identified data:

A system that operates on top of identified data, providing a primary use (a form of access control through the use of pseudonymized data)
An analytics engine, although it might be better to have such a system operate on top of pseudonymized data
A separate pipeline that does not affect primary use

Note

We’ll discuss pseudonymized data in Chapter 5, when we look at how direct identifiers are removed or replaced with pseudonyms. Most anonymization pipelines will start from the production environment only for the purposes of extracting data, and will not operate directly on identified data. The last thing we want is to impact a primary use of data (i.e., we don’t want to impact the services provided to data subjects), or to have a leak of direct identifiers (the worst kind).

Since we will be anonymizing data starting from identified data, we’ll consider direct identifiers and indirect identifiers. We separate these out because the tools we use are different and, as you will remember from previous chapters, indirect identifiers are where the magic happens in terms of measuring identifiability.

We’ll also work through use cases that either start specifically with identified data, or involve identified data in some way. There’s a mixed bag of complications to work through, such as controlled re-identifications (?!), mixing anonymized data with identified data, or anonymized outputs with identified data. If everything in this space was easy, we wouldn’t have written a book! Hopefully, this highlights the importance of the project-definition phase we just worked through at the start of this chapter.

Dealing with Direct Identifiers

Ridding yourself of direct identifiers is the first (but far from only) step to producing anonymized data. It is far from sufficient, but in most cases you only need a linking variable to keep records and data sources connected so that you know what data belongs to what data subject (also known as referential integrity). This is why we will push this discussion to the chapter on collecting pseudonymized data, even though in many cases an agent of the data custodian may be engaged to produce anonymized data. However, if we are building a system from scratch, we really would prefer to anonymize from the pseudonymized data first.

But there are two use cases we need to highlight when there is a need to create realistic-looking data from direct identifiers. Wait, what?! Rest assured that we will replace the direct identifiers with fake data, but that data should represent the variety of data originally collected.

Realistic direct identifiers

A very common use case for producing anonymized data is to conduct functional and performance tests of software. Organizations developing applications that process personal data need to get that data from production environments, but this data must be anonymized before being shared with a testing group. Not only is this a secondary purpose (i.e., it is not for the purpose of delivering the service that a data recipient expects when the data was first collected), more often than not the data environment of the test group has fewer mitigating controls in place to protect data.

Another, although less common, use case is a design jam or hackathon, in which the use cases may include writing apps or software that would otherwise use identified data when deployed. This is actually very similar to the software testing use case, although it starts from a slightly different point of motivation. The concerns are similar, though, and perhaps even more extreme depending on the circumstances in which data will be shared (for example, participants may be able to copy data to personal computers, and maintain copies at the end of the exercise).

The reason we put these use cases here, in a chapter about collecting identified data rather than pseudonymized data, is that we actually need (masked) direct identifiers to produce realistic-looking data. If the use cases envisioned are not focused on analytics, our objective will only be to ensure that properties of the data, namely data quality, are similar to allow for robust testing of applications.

So, if the names collected were stored in a 256-character string, we will want to respect that and include names there that are of a similar length to the original. We wouldn’t match name length between the identified and anonymized data, as that could leak information (especially for names of rare lengths, such as very short or very long names). But somewhere in those names we will need something similar.

Masking of this sort has to be done correctly, as we do not want to leak any identifying information. One common way to break a privacy-preserving scheme is a frequency attack, in which the frequency of occurrence is used to extract information from a system or even reverse engineer results. The length of names would be one example. The distribution of name length could be used to match against external dictionaries of names by country to learn where the data was collected, or to find min/max lengths that narrow down possible names.

Dealing with Indirect Identifiers

Ridding yourself of indirect identifiers, in the same fashion as direct identifiers, would mean eliminating all risk (sounds good!) as well as all analytic utility from data rendered anonymous (oh my, that’s terrible!). We’ve described the methods of measuring identifiability in a previous book.³ And we’ve walked you through the basic concepts of measuring identifiability in Chapter 2. No matter which technological approach we use, these concepts will apply.

Rather than removing the indirect identifiers, we will transform the data/outputs to ensure the level of identifiability achieves a defensible threshold used to provide reasonable assurance that data is nonidentifiable. But we’ve already provided a framework for doing this in Chapter 3.

The Five Safes, operationalized through risk-based anonymization, are both a governance framework and the basis for evaluating identifiability in the context of sharing data. That’s because changes to any one of the Safes can change our assessment of identifiability. They are intimately linked! Consider all the factors that affect the data-sharing context, shown in Figure 4-3.

We will still transform data to achieve the defined risk tolerance, as determined from our Safe Outputs. But we are saving this for the chapter on pseudonymized data, ultimately because our current chapter is focused on working with identified data. An anonymization layer should be applied to pseudonymized data whenever possible. Secondary purposes from the original data collection should not operate directly from identified data that is in a production environment.

Having considered different types of identifiers, we can consider how we work with both identified and anonymized data, starting with how we produce anonymized data from identified data.

From Identified to Anonymized

The subject of anonymizing data may seem straightforward, in the sense that you either do it, or you don’t. We will consider anonymizing for an external data recipient, who may anonymize the data, and circumstances around re-identifying for legitimate purposes. As previously described in Chapter 1, we use the term “shared” broadly to mean sharing a copy of, or sharing access to, data/outputs.

Sharing a copy of data means that we assess identifiability when the anonymized data and outputs will be managed by another group. So it’s the recipient’s environment for the data that is being assessed (the Safe Settings at the recipient site), since that’s where the data will be used.
Sharing access to data means that we assess identifiability when the anonymized data/outputs will be managed by the data custodian, with controls around access by data recipients. In this case it’s the custodian’s environment for hosting the data that is assessed (the Safe Settings are always evaluated where the data will be hosted and used).

Tip

Anonymization should be separated from a production environment in which the primary purposes for data collection are carried out, regardless of whether you’re sharing a copy or access to data and outputs. The last thing anyone wants is a failure in the anonymization to affect primary use, or a security incident in this environment. Rather, split these up. Either pipe the data out of the production environment and apply anonymization in this pipe, or pipe it into another production environment in which the anonymization will take place. In the latter case, the experts doing the anonymization will need permission to access identified (or preferably pseudonymized) data.

Once anonymized, data and outputs can be shared with the data recipients. The easy version of this is shown in Figure 4-4, in which the data recipients are external to the organization. We’ll get into more complicated pipelines in subsequent chapters.

Data (anonymization) processors

Once data has been removed from the production environment, if it’s not anonymized in the pipeline itself (through the use of automated anonymization tools, be they transforming data or outputs), it will need to be anonymized somewhere. In some cases, this is done by a data processor, an agent acting on behalf of the data custodian, and the appropriate agreements will need to be in place to ensure they have legal authority to work with (process) personal data. This can also be thought of as a pipeline, with personal data going to the processor, and anonymized data/outputs coming from the processor.

Note

Data processing agreements are used to set up a legal relationship between the data custodian (the controller) and the data processor. The processor essentially becomes an extension of the data custodian, taking on the same responsibilities for a specified processing activity using personal data. They have no more rights than the custodian, but they do have requirements in that they are processing personal data. When the relationship ends, so does any use of that personal data, as it needs to be destroyed by the processor. These agreements should also specify if/how anonymized data/outputs (derived from personal data) may be used.

Tools and some training can certainly provide the means for a data custodian to anonymize data. The reason for using a data processor to anonymize data is that the expertise may not (yet) exist in-house, or anonymization may be a rare occurrence that doesn’t make the business case for the cost of training and certification required to anonymize data. Or the data custodian may simply want someone else to take responsibility for both anonymization and the sharing mechanism that is decided on (including managing the feeds to different organizations).

Controlled re-identification

Imagine that a data recipient learns something of interest from the data regarding an anonymized data subject. This could be something that will affect treatment or care of a patient, fraudulent activities, or any number of things that are learned from the confidential or target data. These insights could be shared with the data custodian, who may then have a desire or need to re-identify the anonymized data. For example, the data custodian may have kept a key to a pseudonymized linking variable that would allow them to tie those specific insights back to the original data subject.

A controlled re-identification would need to be compatible with the original purpose for which the identified data was originally collected, or some form of legally permissible secondary use. It could only be done by the data custodian (who already has the original identified data), in a secure environment, by individuals with permission to access identified data.

Warning

Although a reasonableness argument is normally included in privacy laws and regulations when describing identifiability, guidance from some regulators has suggested that anonymization should be irreversible. Guidance is not law, and court rulings have described the reasonabless found in laws and regulations. However, that guidance does set expectations of what those regulators are looking for, and courts may turn to guidance if they feel it is relevant and reasonable. You can therefore consider controlled re-identifications to be a business risk in some jurisdictions, and decide how important it is to maintain the ability to reverse a pseudonym or token in your use cases, with the appropriate legal basis to support that activity.

Another, perhaps less controversial option, would be to share the analysis that led to the results of interest, in other words, the statistical methods that could be used by the data custodian on the identified data to yield the same outputs. This may not always be possible if those statistical methods, including AI/ML algorithms, are proprietary.

We started by considering how we would share anonymized data with external data recipients. And from that arose several considerations, namely the use of anonymization processors and controlled re-identifications. Now we consider a slightly more complex use case, in which we need to share data with internal data recipients.

Mixing Identified with Anonymized

Imagine a data custodian, such as an academic medical institution, that wants to share anonymized health data, collected from providing medical care to patients, with internal researchers. These researchers are not treating the patients in the health data, and the envisioned purposes are not in relation to the direct treatment of those patients. In other words they are considering secondary purposes only. This means the same organization will have identified data, used for treating patients, and anonymized data, used for research.

Note

Regulators want to encourage the responsible use of data, to drive efficiencies and innovation. But some struggle with scenarios in which there is identified data used for primary purposes on one side of a Chinese wall and its anonymized counterpart used for secondary purposes on the other side of the Chinese wall. It would seem the organization has the ability to step from one side of the wall to the other whenever they please. The concern is with the separation between the identified and anonymized data, since in theory it would be possible to mix the two and render the anonymized data identified.

In theory, it would be much easier to re-identify the anonymized data since the same organization has the identified data. However, in practice, the organization has no need or desire to re-identify when it has identified data. The motives are simply not there at an organizational level, and the analogy of having the key doesn’t really hold since the identified data is ever present and being used for those primary purposes. The separation between identified and anonymized data does, however, need to be real, demonstrable, and well documented with auditable proof and enforcement.

Functionally anonymized

There are advantages to the internal reuse of anonymized data, since the data custodian can in practice have more direct oversight of the controls and uses. That isn’t to say there aren’t risks, since making the case that data held within an organizational function is anonymized, and will remain so, means there needs to be a true separation between the anonymized and the identified data. And regulators recognize that anonymization is privacy preserving, above simply removing direct identifiers (i.e., pseudonymization).

There are obvious desires to drive efficiencies and innovate with data, while maintaining primary uses generally. But there are also (nonprivacy) regulatory requirements to maintain historical records with identified data. This serves to emphasize the point that there are practical reasons to maintain identified data and provide ways for the data custodian to serve both primary and secondary uses. For example:

Banks are required to maintain certain records for designated periods of time, and the designated periods can vary by type of information. This can include information needed to reconstruct transactions, loan information, and evidence of compliance for any disclosures or actions taken regarding loans, savings, and fund transfers. The required retention periods can span multiple years.
Government departments and public bodies need to comply with a variety of laws and regulations depending on the primary uses they serve. Types of data vary greatly, and can include information about civil rights, disabilities, employment, health, social services, etc. There may be requirements to maintain information for the purposes of reconstructing transactions or supporting decision making. Again, this will vary greatly based on the primary uses they serve.
Sponsors of clinical trials are required to retain trial records for multiple years after the completion of a trial. This is to ensure accurate reporting, interpretation, and verification. Trials that are used for “marketing authorization” (the process of evaluating and granting a license for a product to be sold) have much longer retention periods, and some information needs to be retained for as long as the product is authorized.

We give data anonymized in this scenario, where identified and anonymized exist under the same legal entity, a special name:

Functionally anonymized: Data which is transformed and protected with strong privacy, security, and contractual controls in place to ensure that identifiability is sufficiently low, within an organizational function that does not have access to the keys or additional data needed to re-identify.

There are situations in which it may be desirable, if not necessary, for an organization to work with functionally anonymized data while maintaining the keys to reverse the process (or at least the pseudonyms). The rules of engagement would need to follow the process of controlled re-identification described previously. Namely, intentional re-identification by the data custodian needs to be for a compatible purpose or a permissible secondary purpose.

Five Safes as an information barrier

To engender public and regulatory trust, we need to ensure that there is a clear separation between functionally anonymized and identified data. And this is especially true when the same organization is mixing both, as shown in Figure 4-5. We don’t want anyone to think the data custodian is having their cake and eating it too.

Let’s consider the Five Safes we presented in Chapter 3 and see how we can engender that trust:

Safe Projects: A clear separation of purposes, and an ethics review, would certainly help set the project of creating safe use on the right path.
Safe People: Consider that our data recipients work for the same organization. There will need to be a clear separation of those who work on the functionally anonymized data from those who work with identified data. Otherwise, the risk of them inadvertently recognizing someone would be much higher.
Safe Settings: The data environment for the functionally anonymized data will need to be independent of the identified data, with no mixing. This implies that the data recipients, including the administrators, not have access to identified data, and even that the physical access to the functionally anonymized data be in a separate area than where other employees access identified data (i.e., to avoid accidentally looking over someone’s shoulder).
Safe Data: With the Safe People and Safe Settings clearly defined, so that we separate functionally anonymized from identified data, the usual threat modeling can take place to eliminate residual risk.
Safe Outputs: Risk tolerance would be the same, but there would be little to no room for excuses for misusing outputs. The trust of service users would be seriously eroded if any misuse impacted those same users from which the data was derived.

This may seem like overkill to some, and it may seem to fly in the face of our framework to evaluate how safe these constraints are. But do not take this lightly, as it is a serious concern of regulators. The use of data can have many benefits, and this is recognized, but trust can only be built and maintained by having clear boundaries.

Now we can summarize the above considerations into three constraints for creating a defensible information barrier between identified and functionally anonymized data:

Different people
In different physical and virtual areas
Supported by different system administrators

Warning

Some would go so far as to recommend creating separate legal entities as an option when there is a chance of mixing identified data with functionally anonymized data, to limit regulatory concerns and oversight. We have worked with organizations that have spun off new companies that would work only from anonymized data they would provide. That should tell you how serious a subject this is, but also the value that anonymized data can have (so much so that a company can turn a profit from the insights it will generate from said data, while reducing regulatory risks to ensure those profits are protected). That’s serious business.

As if anonymizing data wasn’t hard enough, we’ve now seen some of the many complicating factors to building a few, somewhat straightforward, pipelines. We’ve gone from identified to anonymized data, for external or internal data recipients, and considered how we can build the appropriate conditions to ensure we maintain appropriate oversight around the anonymized data and how it is used. We treated these data assets as distinct entities, completely seperate from one another. But what happens when identified and anonymized data overlap in some way?

Applying Anonymized to Identified

Regardless of the provenance of the anonymized data, there will be circumstances in which you may want to mix it with identified data, or apply model outputs from the anonymized data to the identified data. This will obviously raise eyebrows, since it may seem like a form of re-identification (even if that’s not the case!), so let’s consider some possibilities. To do this in a meaningful way, we need to compare populations between the anonymized data and the identified data, as shown in Figure 4-6. We assume the defined population (based on identifiability) is the same for both. We’ve ordered these from least to most concerning.

No overlap in populations: In this case there are no concerns, as the insights from an anonymized group are being applied to an entirely different population. You can imagine having a consumer group in one sector that provides insights into buying patterns that can be applied elsewhere. There are no risks of re-identification when the population groups don’t overlap, but there are still interesting things to learn about behaviors and outcomes.
Some overlap in populations: Once we start mixing anonymized with identified data when there are data subjects that overlap, concerns may be raised about potential re-identification. In this case, however, the overlap is uncertain. We don’t know which data subjects overlap, just that they share some identifiable features in common, but these have already been managed in terms of clustering based on identifiability. There would be considerable uncertainty in attempting to re-identify, depending on the extent of the overlap between the two populations.
Subsample of populations: When the anonymized data is a subset of the identified data, the adversary will know there are matching data subjects, but not which ones. The same is true when the identified data is a subset of the anonymized data. There is less uncertainty than the previous case of overlapping populations. Concerns would arise if there are any overlapping nonidentifiable attributes, as these would now represent a potential risk to matching between anonymized and identified, especially as the sample size increases.
Complete population: At this point there is a significant risk of attribute disclosure, i.e., associating sensitive information to a group of individuals. The datasets must represent the defined population in its entirety, otherwise any sampling would prevent an adversary from knowing of this overlap. This is also a perfect example of a prosecutor attack (see “Safe Data” for a refresher), since it’s known who’s in the anonymized data (although not which records belong to them). That means that identifiability is definitely higher than in the previous examples, and you may want to consider the ethics of how these attributions will be used.
Exact data subject: This would occur if a linking variable was used to match anonymized data to the identified data subject. (Linking can be done in a privacy-preserving way, but that’s a different subject.) This would enhance the identified profile, but would likely raise significant concerns since the anonymized data is now re-identified, possibly by someone other than the original data custodian of the personal data that was anonymized.

The overlapping and subsample cases are somewhat common when you consider census data, even in aggregate form, and inferences. Outputs on specific geographic regions can be applied to identified data to enhance analytical modeling. For example, knowing that 80% of people in a region love chocolate cake would certainly be helpful if you were modeling consumption patterns. But there’s uncertainty since the populations don’t perfectly match, which means at best we can make inferences.

Tip

To reduce concerns over privacy risks and impacts, it would be best if models, outputs, and insights were applied to anonymized data rather than the identified data themselves. This isn’t strictly necessary, but would certainly be easier to explain to regulators. At the very least, we would advise avoiding potential attribute disclosures through the use of sampling or subsampling, and an impact assessment or ethics review.

Although the above considerations certainly seem to complicate matters, it’s actually the overlap between identified and anonymized data that creates these complications. It’s important to understand potential risks so that they can be mitigated, and to explain risks and mitigations to regulators. As mentioned previously, details of this nature need to be documented to ensure approaches to working with both types of data are auditable and defensible.

Final Thoughts

We started with the collection of identified data, and the concerns and considerations in designing privacy into systems that manage personal data. One of the most effective privacy tools is to disassociate data subjects from the data, or reduce identifiablity, wherever and whenever possible. This chapter was meant to help you work through the project-definition phase, collecting as many requirements and concerns as possible while thinking through various use cases, starting with the collection of identified data.

There are many resources that work through the phases of privacy engineering, and this chapter was not intended to cover all of them. Our goal was to set you on the right path to building anonymization pipelines. For that we don’t need to consider every aspect of privacy related to personal data, since our goal is to eliminate identities from data in a manner that is comprehensive, repeatable, and defensible.

As we described, you are likely to create a pipeline from the identified data into a new feed of pseudonymized data (since building anything that operates directly on top of the identified data would put the primary data collection and services at risk). Since this was already a hefty chapter, that was our excuse for separating the discussion of anonymization technologies from the next chapter, in which we work from the perspective of collecting pseudonymized data.

¹ Privacy engineering is systems engineering focused on integrating privacy objectives and privacy risk assessment into implementation requirements, in which it is understood that there is no such thing as zero risk.

² These objectives can be found in Sean W. Brooks et al., “An Introduction to Privacy Engineering and Risk Management in Federal Information Systems,” NIST Interagency/Internal Report (NISTIR)-8062 (2017), https://oreil.ly/bM0kS.

³ El Emam and Arbuckle, Anonymizing Health Data: Case Studies and Methods to Get You Started.

Get Building an Anonymization Pipeline now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Building an Anonymization Pipeline by Luk Arbuckle, Khaled El Emam

Chapter 4. Identified Data

Requirements Gathering

Tip

Use Cases

Figure 4-1. Our objectives in implementing measurable controls can be summarized by the privacy-engineering triad.

Figure 4-2. Meeting the privacy-engineering objectives is a balancing act driven by the wants and needs identified by the use case.

Data Flows

Data and Data Subjects

Data subjects

Warning

Structure and properties of the data

Tip

Categories of information

From Primary to Secondary Use

Note

Dealing with Direct Identifiers

Realistic direct identifiers

Dealing with Indirect Identifiers

Figure 4-3. There are many factors that affect the context in which data is shared, all of which should be factored into a rigorous assessment of identifiability.

From Identified to Anonymized

Tip

Figure 4-4. The original data used for primary purposes, and anonymized data used for secondary purposes, are managed by separate legal entities.

Data (anonymization) processors

Note

Controlled re-identification

Warning

Mixing Identified with Anonymized

Note

Functionally anonymized

Five Safes as an information barrier

Figure 4-5. An information barrier between the original data used for primary purposes and functionally anonymized data used for secondary purposes.

Warning

Applying Anonymized to Identified

Figure 4-6. Comparing the populations between identified data and anonymized data will help us work through possible privacy pitfalls.

Tip

Final Thoughts

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly