Chapter 10. Differentially Private Synthetic Data
Privacy regulations often restrict how data can be accessed and used. Since there is more and more valuable personal data being generated every day, researchers need alternative approaches to learn from the data while not running afoul of these privacy regulations. Synthetic data (SD), generated through algorithms rather than real-world measurements, offers a compelling solution. Differentially private SD aims to mimic the distribution of sensitive data while ensuring the privacy of individuals who are in the sensitive data.
In this chapter, you will dive into SD, explore its unique advantages, and learn to apply it to diverse applications. You will also learn relevant algorithms for generating SD and understand the potential problems that may arise during the data generation process.
Defining Synthetic Data
Synthetic data sets and “real” data sets are distinguished by their origins. While real data is collected from measurements of the world (for example, human population data or users of an application), synthetic data sets are generated using algorithms. These algorithms focus on closely matching the distribution of sensitive real data so that the SD provides similar insights to the real data while also protecting privacy.
Synthetic data sets are particularly valuable in scenarios involving microdata. Privatized microdata, which includes individual-level data, cannot be achieved with the techniques introduced in this book thus ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access