Chapter 4. Evaluating Synthetic Data Utility
To achieve widespread use and adoption, synthetic data needs to have sufficient utility to produce analysis results similar to the original data’s.1 This is the trust-building exercise that was discussed in Chapter 1. If we know precisely how the synthetic data is going to be used, we can synthesize the data to have high utility for that purpose—for example, if the specific type of statistical analysis or regression model that will be performed on the synthetic data is known. However, in practice, synthesizers will often not know a priori all of the analyses that will be performed with the synthetic data. The synthetic data needs to have high utility for a broad range of possible uses.
This chapter outlines a data utility framework that can be used for synthetic data. A common data utility framework would be beneficial because it would allow for the following:
-
Data synthesizers to optimize their generation methods to achieve high data utility
-
Different data synthesis approaches to be consistently compared by users choosing among data synthesis methods
-
Data users to quickly understand how reliable the results from the synthetic data would be
There are three types of approaches to assess the utility of synthetic data that have been used:
-
Workload-aware evaluations
-
Generic data utility metrics
-
Subjective assessments of data utility
Workload-aware metrics look at specific feasible analyses that would be performed on the ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access