Chapter 10. Differentially Private Synthetic Data
Privacy regulations often restrict how data can be accessed and used. Since there is more and more valuable personal data being generated every day, researchers need alternative approaches to learn from the data while not running afoul of these privacy regulations. Synthetic data (SD), generated through algorithms rather than real-world measurements, offers a compelling solution. Differentially private SD aims to mimic the distribution of sensitive data while ensuring the privacy of individuals who are in the sensitive data.
In this chapter, you will dive into SD, explore its unique advantages, and learn to apply it to diverse applications. You will also learn relevant algorithms for generating SD and understand the potential problems that may arise during the data generation process.
Defining Synthetic Data
Synthetic data sets and “real” data sets are distinguished by their origins. While real data is collected from measurements of the world (for example, human population data or users of an application), synthetic data sets are generated using algorithms. These algorithms focus on closely matching the distribution of sensitive real data so that the SD provides similar insights to the real data while also protecting privacy.
Synthetic data sets are particularly valuable in scenarios involving microdata. Privatized microdata, which includes individual-level data, cannot be achieved with the techniques introduced in this book thus ...
Get Hands-On Differential Privacy now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.