Chapter 7. Practical Data Synthesis

Real data is messy. When data has been cleaned up and heavily curated, then data synthesis methods (and for that matter any data analysis methods) become much easier. But the actual requirement in practice is to synthesize data that has not been curated.

This chapter presents a number of pragmatic considerations for handling real-world data based on our experiences delivering synthetic datasets and synthetic data generation technology. While our list is not comprehensive, it covers some of the more common issues that will be encountered. We highlight the challenges as well as provide some suggestions for addressing them.

At this point, we do not make explicit assumptions about the scale of the data that will be synthesized. For example, some datasets, such as financial transactions or insurance claims, can have a few variables (tens or maybe even hundreds) but a very large number of records. Other datasets can have few individuals covered but a large number of variables (thousands or tens of thousands). These narrow and deep versus wide and shallow datasets present different challenges when processing them for data synthesis. In some cases, the challenges can be handled manually, and in other cases full automation is a necessity.

Managing Data Complexity

The first set of items that we want to cover pertains to how to manage data complexity. If you work with data then you are used to handling data challenges. In the context of synthesis there ...

Get Practical Synthetic Data Generation now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.