Preface

Welcome to the wonderful world of data privacy! You might have some preconceived notions around privacy—that it is a nuisance, that it is administrative and therefore boring, or that it’s a topic that interests only lawyers. What this book will show you is just how technically challenging and interesting data privacy problems are and will continue to be for years to come. If you entered the field of data science because you liked challenging mathematical and statistical problems, you will love exploring data privacy in data science. The topics you’ll learn in this book will expand your understanding of probability theory, modeling, and even cryptography.

Learning how to solve data privacy problems is increasingly critical for data science practitioners today. You’ll be able to solve real-world problems in fields like cybersecurity, healthcare, and finance, and you’ll be able to advance your career in a patchwork world of privacy regulations, policies, and frameworks. Since 2018 when the General Data Protection Regulation (GDPR) went into effect in Europe, the global landscape has become more complicated, and that complexity will increase as regulatory agencies and lawmakers continue to change the rules about how, where, why, and when you store data. Building up your data privacy and data security skill set now is an investment in your career.

Additionally, taking the time to learn new privacy skills means you are contributing to the field of data science—enhancing trust, accountability, understanding, and social responsibility. Currently, there is fear and backlash against the use of machine learning to solve real-world problems. This response is based on real issues and actual deployments, where data, models, and systems were not used in a trustworthy manner and where justice and fairness come into question. For example, Clearview AI scrapes faces from social media sites and sells the facial recognition model built from those faces to law enforcement,1 raising questions regarding data ownership, privacy, and accountability. To help counter this reputational damage and to create pathways for responsible and trustworthy data, the industry needs data scientists and machine learning engineers who understand the tasks at hand, the risks involved, and who can competently address these issue when designing systems. Privacy can help guide you to fairer, more ethical, and more responsible systems, where the user has power and input and is at the center of your design. Use this book as you navigate these challenges, finding ways forward with practical, hands-on guidance.

I hope this book can contribute to new data science by expanding familiarity with how to appropriately implement privacy for sensitive data. Worldwide, apprehension around digitizing personal data—even for responsible government use—is so prevalent that it obstructs the use of data to provide assistance with social problems such as climate change, financial auditing, and global health crises. Building privacy into data science creates new pathways for data use in critical decisions for our societies and for our world.

What Is Data Privacy?

In a simple sense, data privacy protects data and people by enabling and guaranteeing more privacy for data via access, use, processing, and storage controls. Usually this data is people-related, but it applies to all types of processing. This definition, however, doesn’t fully cover the world of data privacy.

Data privacy is a complex concept—with aspects from many different areas of our world: legal, technical, social, cultural, and individual. Let’s explore these aspects and how they overlap so you get an idea of the vast implications of the topics and practices you will learn in this book.

In Figure P-1, you can see the different categories of definitions of privacy, and I’ve tried to represent their respective size in the figure. Let’s walk through them, starting with legal definitions.

In a legal context, data privacy involves the regulations, case law, and policies that declare what efforts are needed and what constitutes data privacy in a particular state or jurisdiction. As you’ll learn in Chapters 1 and 9, this is an ever-changing understanding and landscape that in recent years has changed dramatically. It is important to familiarize yourself with the legal aspects of data privacy because they can directly impact your work. For example, what happens when your organization is subject to an audit, data breach, or consumer complaint? These legal definitions also impact your personal life: what rights do you have as a data citizen?

A venn diagram of privacy definitions, with three circles of different sizes that overlap. The largest circle is social and cultural definitions. The second largest circle is legal definitions. The smallest circle is scientific definitions. Inside the social and cultural definitions circle, there is a smaller circle labeled individual definitions.
Figure P-1. Privacy definitions

The scientific or technical definitions of privacy and their implementations in your daily work are the focus of this book. You will learn these definitions, how to deploy scientific privacy technologies at scale, and how to make technical decisions about privacy. With the tools in this book, you will learn state-of-the-art best practices that might not yet be well-known at your organization as they are only recently available in production systems. Staying up-to-date on these practices will be part of your job—should you decide to focus on this area. As a technical expert on the topic, you will be asked to support business and legal decisions on privacy and translate them into working software and systems. This is a significant role as many of the other stakeholders will not have a technical and up-to-date understanding of privacy.

The social and cultural aspects of privacy are best explained by danah boyd’s work in data privacy. She studied teenage girls and their interaction with social media to understand how technology impacted their understanding of concepts like privacy. Her definition is as follows:

Privacy is not about control over data nor is it a property of data. It’s about a collective understanding of a social situation’s boundaries and knowing how to operate within them. In other words, it’s about having control over a situation. It’s about understanding the audience and knowing how far information will flow. It’s about trusting the people, the situating, and the context.

danah boyd, “Privacy and Publicity in the Context of Big Data”

boyd shows us a new aspect of privacy in this definition that poses significant changes in how to design privacy into systems. In contrast to technical and legal definitions, boyd puts social and cultural understanding, context, and individual choice and understanding in the center. When you read her work or see her speak, you hear truths you have often felt but never clarified around how we as humans and as society understand privacy and information.

For example, when I lower my voice to a whisper and lean in to tell you something, you understand that information is not meant to be shared. When I shout it in a public square and ask people to listen, you understand that I want as many people to hear it as possible. How a person decides and changes the people they communicate with, and the way in which they communicate, are greatly influenced by how that person defines and views privacy, shown in Figure P-1 as the individual definition. The ability for someone to experiment with and shift their communication with others has significantly changed over time. Technology and the internet have allowed everyone to expand their communication and resulting privacy choices to contexts that are not physical. In doing so, you have new possibilities for connection, community, and information sharing—​which are wonderful!

What this shift from the physical world to the online world has also done, however, is obfuscated our ability to reason about what context we are operating in. What are the rules of this space? Who can see me and hear me? Am I talking to you or to a group, and how big is that group? Helen Nissenbaum’s work on contextual integrity demonstrates that technology has changed how perceptible and transparent these lines are—not only via user interfaces but in the fundamental ways systems and software are designed. Choices for application defaults end up affecting privacy for potentially millions of people at once. Decisions on security and encryption make private conversations open for law enforcement and state surveillance. Data warehouses can take sensitive information meant for only one person and create access paths for employees and third-party data services. When the context is lost or obfuscated and the system design does not take the social and cultural definitions of privacy into account, the technology has essentially ignored the human aspect of privacy.

This book will show you opportunities to take these social understandings and build them into practical systems. There will be many difficult decisions you’ll need to make—but giving users ways to navigate their privacy context in digital spaces and safe defaults are invaluable gifts that the world needs more of. As you read through this book and learn more about the technical aspects of privacy, keep the social and legal definitions in mind—they are and will be forever entwined.

Who Should Read This Book

This book is for data scientists who want to upskill themselves with a focus on data privacy and security. You could have many reasons, such as:

  • You’d like to pursue a specialization (data privacy) that you care about, which has a long future in the industry.

  • You want to move into a more regulated industry like finance or healthcare, and these skills will set you up as a promising candidate in these sectors.

  • You work with research data, and you’d like to get faster approval from ethics board reviews and publications.

  • You are a data science freelancer or consultant and want to expand your customer base by ensuring that you know how to manage sensitive data.

  • You manage a data team and want to be able to design products and architect solutions with attention to data privacy.

  • You would like to use “AI for good” and think privacy is an important human right.

  • Your team has been told that privacy is important, but you aren’t sure of what that means or how to go about implementing it.

  • You work with sensitive data and want to ensure you are following best practices.

  • You’d like to become a privacy engineer and focus on engineering privacy into data products.

  • Privacy and security are neat topics, and you just enjoy learning more about them.

I could go on and on, actually, and I have met different folks from all of these backgrounds. One thing I can tell you with certainty is that demand for these skills is increasing rapidly, driven by much more than new regulations. Companies are investing in these skills so they can build a secure future for data management. By investing in privacy, companies not only avoid expensive incidents but also create a trusted brand and company culture when managing data, benefiting their recruitment, marketing, and liability.

Note

Familiarity with Python, Jupyter notebooks, math, and statistics will help you follow along with all sections, but this book can also be read without those deeper theoretical and implementation-focused sections as long as you understand the overarching concepts.

Don’t worry if you haven’t worked on math in a while. I’ve included information about each of the examples to help explain them—and reading through slowly will help.

In writing this book, I’ve gotten feedback from software engineers, security specialists, and even privacy lawyers who found this book useful. Although these people are not my target audience, I do hope this book can help anyone who has an interest in privacy and technology and their intersection in data systems.

As you read this book and work through exercises, you’ll see how aspects of data privacy highlight the wonders of data science you already know and love. As with other challenging areas of data science, this book will take you from simple methods for solving privacy into more difficult ones, some of which aren’t completely solved yet. Just like when linear regression “just works,” you want to start with simple and obvious solutions. But when you need something more than the simple solution, you will need to ask deeper questions that have technical and ethical implications. Finding these questions and exploring them and their answers will make you a better data scientist, technologist, statistician, and mathematician.

This book may be all you require to become a technologist with some extra skills around data privacy. That’s fine! You might also decide this book is the first of several in a path that takes you farther into the field. In case that’s enticing to you, let me introduce you to the concept of privacy engineering.

Privacy Engineering

In the next 10 years,2 I foresee that the field of privacy engineering will continue to grow in importance. The skills you gain in this book by working through the exercises and applying this new knowledge to your work will prepare you for this role.

At companies where data science is an important product, a privacy engineer is part data scientist and part engineer. This means that, unlike some roles in data science, you are actively engineering and architecting solutions rather than exploring data or testing an idea in a lab setting. This could mean working directly with the data engineering teams, the software or applications teams, or even the architects at your company to ensure privacy is built into the product as well as the internal applications. This covers all consumer and employee data flows, software used in data management, and internal and external data use cases. You’ll need to understand engineering and architecture basics as part of this work, especially as it pertains to designing systems and integrating systems with one another. Some related titles you can pick up on these topics are:

Determining what tooling and software works for an organization requires a sophisticated architecture, so simply implementing privacy policy via plug-and-play vendors is often too naive to address these problems. That said, the growing space of privacy technology companies means that you become a decision maker for evaluating technologies to build, or buy, and use for data privacy management. In doing so, you’ll be using concepts learned from this book to put together evaluation criteria, ask probing questions on the implementation, and analyze the flexibility, support, and product features. In this role, you will determine how well potential vendors can meet your company needs as the dependence on private, sensitive, and confidential data grows.

A privacy engineer is not just another data scientist or architect who cares about privacy but is given no authority, time, or budget to make decisions about privacy. Although it is great that advocacy has become part of the data science role, privacy engineering is about building and applying privacy techniques as data is ingested, collected, transformed, stored, and then used in data science applications. Advocacy is a nice side job, but implementation proves these technologies work.

Nor is a privacy engineer just a data engineer who thinks about privacy. While privacy engineers can work alongside data engineers and often might embed in a team for a project or a proof of concept, they must work with different parts of the organization and will be pulled into many projects where their expertise is relevant. They are specialists and are not locked into a single project or use case for too long. Instead, their knowledge is a tremendously valuable resource that should be applied to the most pressing business problems affected by data privacy.

The position of a privacy engineer is still being defined and continues to evolve. Although larger technology companies are actively hiring actively hire for these roles now, its emergence reminds me of the rise of the term machine learning engineer in 2018. Privacy engineering as a practice is a relatively new skill set in data science that is emerging because of industry needs and demands. I am excited to see how privacy engineers shift 2 or 10 years from now—and hope that this book inspires a few new people in the field.

Why I Wrote This Book

When I first became interested in data privacy, it felt like a maze. Most of the material was beyond my comprehension, and introductory guides were often written by folks trying to sell me software. Luckily, I knew a few folks in the data privacy community who helped shepherd me to a deeper and broader understanding of privacy. It took many hours of study and several helping hands to get me from curious data scientist to someone who had command of the topics you’ll find in this book—​and I continue learning new things and diving deeper into the field every year.

I am convinced the skills you will learn in this book are essential for data scientists today and in the future. The steep learning curve I experienced is unnecessary, and that’s what this book will help you avoid. I wrote this book to provide a welcoming, fast-paced, and practical environment for you to learn, ask questions, find helpful advice, and begin to dive deeper into the challenging topics.

This book is meant to be a useful overview—leading you from zero knowledge to actively integrating data privacy into your work. You’ll learn popular strategies, like pseudonymization and anonymization methods, and newer approaches, like encrypted computation and federated data science. If this book acts as a springboard for your academic career or leads you to a research role, that would be terrific. The field needs intelligent and curious folks working on the unsolved problems in this space. But at its core, this book is a practical-minded overview providing pointers along the way should you want to learn more.

Data scientists and technologists who need to integrate data privacy and security topics as part of their daily work will find this book helpful. There are several chapters that work as quick references for you as you navigate data privacy. While a cover-to-cover read will help you create your knowledge base and teach you how to solve new and unknown data privacy challenges, a quick search provides straightforward advice on how to manage specific data privacy emergencies that come up in your day-to-day work.

Navigating This Book

This book is organized into chapters with a practical approach to data privacy and a mixture of theory, exercises, and use cases:

  • Chapter 10, “Frequently Asked Questions (and Their Answers!), summarizes frequently asked questions and use cases as a handy reference for data privacy emergencies, allowing you to confidently move forward and ensure data privacy is baked into each project and your normal workflow. It also opens up the social and personal aspects of privacy to integrate them into your life outside of work.

  • Chapter 11, “Go Forth and Engineer Privacy!”, is the book’s conclusion, and provides support and motivation for using your newly acquired data privacy skills to push the field and your own path forward!

Links in this book are shortened for your convenience to O’Reilly URLs. These URLs have minimal tracking and have been reviewed for GDPR compliance and privacy. Should you want to opt out of this minimal tracking, you can find the full list of URLs at https://practicaldataprivacybook.com.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs, to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.

Tip

This element signifies a tip or suggestion.

Note

This element signifies a general note.

Warning

This element indicates a warning or caution.

Using Code Examples

Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/kjam/practical-data-privacy.

If you have a technical question or a problem using the code examples, please send email to .

This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Practical Data Privacy by Katharine Jarmul (O’Reilly). Copyright 2023 Kjamistan, Inc., 978-1-098-12946-0.”

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at .

O’Reilly Online Learning

Note

For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.

Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com.

Please address comments and questions concerning this book to the publisher:

  • O’Reilly Media, Inc.
  • 1005 Gravenstein Highway North
  • Sebastopol, CA 95472
  • 800-998-9938 (in the United States or Canada)
  • 707-829-0515 (international or local)
  • 707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/practicalDataPrivacy.

Email to comment or ask technical questions about this book.

For news and information about our books and courses, visit https://oreilly.com.

Find us on LinkedIn: https://linkedin.com/company/oreilly-media

Follow us on Twitter: https://twitter.com/oreillymedia

Watch us on YouTube: https://youtube.com/oreillymedia

Acknowledgments

I would like to first thank my partner, Aaron Glenn, for the long coffee walks, discussions, and daily support that led to the creation and writing of this book. If you want to learn about open source, community-driven, and software-defined computer networking, or you are just curious about how the internet actually works, please find his work at Predicted Paths.

My experience in privacy technology has exposed me to people who taught me more than I can imagine. Most prominently, not only was my time with the “PETs” team at Dropout Labs/Cape Privacy: (Morten Dahl, Jason Mancuso, and Yann Dupis) one of the best working experiences of my life, but I also learned everything I know about encrypted computation. Morten, thank you for your articles that inspired new thinking around encryption and machine learning, the countless hours of Jamboarding and answering questions, and generally being the best non-professor-but-actually-could-be-a-professor I’ve had a chance to learn from in my life. Jason, I miss hearing your thoughts about new breakthroughs in multitask learning and what is on your mind that will revolutionize Privacy Preserving Machine Learning (PPML) next. Yann, your pragmatic let’s-build-it-and-see and countless explanations showed me and our customers how these technologies can lead not only to better outcomes but also to true privacy guarantees. My time with you all is something I will always cherish.

My journey in building privacy technology started with cofounding KIProtect with Dr. Andreas Dewes. Andreas, thank you for being my sparring partner, business partner, and thinking partner in those years! I would not be where I am today without everything we built and learned.

A special thank-you to Damien Desfontaines, who put me through differential privacy boot camp when I started this book. Damien, thank you for the many conversations, for your contributions to the field, and for being a humble and awesome human. Your openness to share your knowledge, your work to make open source differential privacy usable in the field, and your amazing blog are invaluable. Keep up the good fight!

To the woman technologists and good friends in my life who help keep me sane, motivated, and happy: Dr. Nakeema Stefflbauer, Dr. Carma Lüdtke, Ellen König, Christine Cheung, and Sandy Strong. I am so lucky to know you all—thank you for being there with me through all the peaks and valleys of life in this crazy world. I wouldn’t have the chutzpah to write such a book if it weren’t for being inspired by your work.

To my mom and tireless unpaid editor, thank you for muddling through my words and spending your retirement time correcting my passive voice. I bet you never thought you’d still be correcting that 30 years later! Learning German didn’t really help; sorry about that. I could never put into words all the things I am thankful for, but here I can at least thank you for the book edits.

To my dad and Cathy, thank you for cheering me on and believing in my work. Sitting on the porch watching the river go by helped clear my mind while writing some of the most challenging sections of this book. Taking the appropriate breaks to play with the puppies, go for a walk, and have a glass of wine helped too!

To Dai and Rhys, you always are there to pump me up—both on social media and in real life! It’s so nice to have the positive energy during the times when projects like this book seem daunting.

To my editors at O’Reilly: Rita Fernando and Andy Kwan. Rita, thank you for so much input, guidance, and patience as I learned how to write this book and what this book was about. I will miss our check-ins, and I hope to get a chance to say hello sometime in real life. Andy, you were the first believer in this book—thank you for taking the chance on it!

To my technical reviewers: Natalie Beyer, Clarence Chio, and Timothy Yim. Natalie, thank you so much for giving me the data science lens and feedback. Your feedback helped make the unclear parts of this book easier and ideally will help many data scientists along the way. Clarence, I’ve been such a fan of your work on adversarial ML; it was an honor to have your thoughtful input and years of expertise in this book as well. Timothy, your expertise helped clarify early chapters’ advice on governance and consent workflows, thank you!

To my fellow Thoughtworkers, who supported me by listening to me think out loud, kept me thinking via interesting questions and new ideas, helped me keep learning and working by giving me encouragement and feedback along the way, and helped me evolve my ideas into what they are in this book. Special thanks to Chris Ford, who was also a technical reviewer, and Enrico Massi and Lisa Junger, whose regular chats and expertise helped make the security concerns in this book real and accurate. Additional kudos to Clara Brünn, for such helpful feedback and interesting insights from your own data science work and experience, and Mitchell Lisle and Menghong Li, whose interest in privacy engineering sparked new ideas and led to the book repository’s database reconstruction attack—thank you! To my “nonboss” Emily Gorcenski, who gave me support and time to write and encouraged my thinking in how privacy and strategy intertwine. And the warmest of thanks to Sowmya Ganapathi Krishnan, Nimisha Asthagiri, and Erin Nicholson—whose own passion for security and privacy technology and truly amazing new friendships helped me on the long road to get this book from idea to print.

To my technical writers group, for motivating me and sharing your ideas, feedback, and own journeys—thank you! Although our crazy schedules meant that the group met only a few times, it helped me through the initial growing pains to get back into a normal writing flow.

To Freddie Hubbard and Beyoncé, whose tracks helped me through the early mornings and late nights.

To my niece Charlotte, to my godson Neorth, to Ragnar and Horik, I hope this book is one small drop in the wave of change. I hope you grow up to see a world where privacy is a fundamental right for everyone, no matter who they are or where they live.

1 For a full URL list without shorteners, please see https://practicaldataprivacybook.com.

2 Disclaimer: I generally avoid predictions, as they are often wrong; however, I am offering this one, based on hard-won experience in the industry for the past 6 years.

Get Practical Data Privacy now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.