Chapter 3. Building Privacy into Data Pipelines

Now that you’ve evaluated different approaches to pseudonymization and anonymization, let’s explore how you integrate these approaches directly into normal data workflows. Pipelines and other large-scale data infrastructure are a sustainable and extensible approach to designing privacy into your data architecture. When you scale privacy methods, defined by a multidisciplinary team of experts and implemented by a group that understands not only the privacy technologies but also the infrastructure of the company, you move from piecemeal, one-off operation to maintainable Privacy by Design.

In this chapter, you’ll learn about how to incorporate privacy technologies into the data engineering infrastructure and software.1 You’ll also learn tips for working with data engineering teams (in case you aren’t in one already!). Finally, you’ll learn how to engineer privacy into your data collection methods and how differential privacy looks as part of your data collection pipeline.

How to Build Privacy into Data Pipelines

In Chapter 1, you looked at data governance basics and how to apply basic privacy protections. In Chapter 2, you learned anonymization and differential privacy methods. Now that you understand the basic building blocks of privacy, it’s time to experiment with them and then automate and scale them into real data infrastructure.

Before you begin building privacy into these workflows, you need to have properly outlined the risks, ...

Get Practical Data Privacy now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.