Chapter 4. The Future of Data Synthesis

While significant progress has been made over the last few years in making synthetic data generation practical and scalable, we need some additional requirements for future work and improvements to the current state of practice. This chapter is a summary of the key issues that need to be worked on. It does not present a research and development agenda, but rather a set of items to consider when developing such an agenda.

We cover four main issues. First, we need to develop a data utility framework. Such a framework would make it easier to benchmark various data synthesis techniques. The second issue, which is coming up more frequently, is the need to remove certain relationships from synthetic data for commercial or security reasons. Third, data watermarking will become increasingly important as more synthetic data is generated and shared. Finally, simulators that can generate different types of synthetic data would provide powerful capabilities.

Creating a Data Utility Framework

As discussed in Chapter 1, data utility is important for the adoption of synthetic data. The higher the data utility of synthetic data, the greater the number of use cases where it would be a good tool to accelerate AIML efforts, and the more likely that analysts will be comfortable using it.

In practice, we are seeing that a significant dataset, the 2020 decennial US census, is being shared as synthetic data and derivatives from synthetic data. The question of whether the utility of synthetic data is good enough or not may no longer be the right one to ask. We have entered the era of large-scale synthetic data, and the utility levels that are available today may be sufficient for many practical problems.

The question now is how do we demonstrate this data utility to analysts and data users so that they are confident and comfortable using synthetic data? The answer has two parts (at least):

  1. Data utility is defined as the ability to get substantively similar results on synthetic data as on real data.

  2. Data utility is defined relative to an alternative method of getting access to data, such as de-identified data.

It is good practice to perform a utility assessment for every synthesized dataset, and this is where a data utility framework would be of value. The availability of a validation server would be a plus. Over time, using synthetic data as a proxy for real data will become more accepted, especially as synthesis methods continue to improve.

De-identified data is generally considered a good proxy for real data. How does the utility of synthetic data compare to the utility of de-identified data? This remains an empirical question and, over time, evidence will be accumulated to inform this issue. However, we have argued earlier in this report that, practically, the economics of de-identification are potentially unfavorable compared to those of data synthesis.

The use cases that we discussed in this report can be expanded upon if it is possible to manipulate the synthetic data. This means that instead of generating data that has high fidelity to real data, we want to represent something different. In the next section, we consider the need to remove relationships or information from generated synthetic data.

Removing Information from Synthetic Data

Interesting applications emerge when we start looking at hybrid synthetic data. This data is generated from real data, but then is also manipulated to exhibit characteristics that were not in the original data. This section examines the removal of information from synthetic data to hide sensitive information.

In domains such as law enforcement and intelligence, there is a need to build AIML models, which means that there is a need to get access to data. These models can, for example, characterize determinants of crime and predict adversary activities. But the data owners may want to hide certain attributes or relationships to ensure that they are not exhibited in the generated data. These hidden attributes or relationships pertain to highly sensitive or classified information that should not be known more broadly; for example, those that reveal data surveillance capabilities or sources.

Another scenario requiring specific attributes or relationships to be hidden comes up in commercial settings. For example, a financial services company may want to create a synthetic version of a dataset but not reveal specific commercially sensitive information in that data. Therefore, there is a need to partially synthesize the data or mask parts of it after synthesis.

In the next section, we discuss how data watermarking can be a useful capability as the adoption of synthetic data grows. Watermarking of data has been used historically to establish data provenance; for example, in the case of a data breach. Establishing a synthetic data signature would be a new application of these capabilities.

Using Data Watermarking

Imagine a future whereby synthetic data is around every corner and is commonly used as a key component of the data analytics and secondary processing ecosystem. Some concerns that have been expressed include the ability to tell the difference between real data and synthetic data.

Data watermarking methods can address this concern. One type of watermark would be a unique data pattern that is deliberately embedded within the synthetic data and that is recoverable. Alternatively, a watermark can be computed algorithmically from the existing patterns in the data, effectively being a signature characterizing the data.

Whenever there is a question about the status of a dataset, it would be compared to known watermarks to determine whether it is real or synthetic. Given that synthetic data is generated through a stochastic process, every instance of a dataset will have a unique pattern to it.

The difficulty with practical data watermarks is that they need to be invariant to data subsets. For example, would the watermark still be detectable for a subset of the variables or for a subset of the rows in the dataset?

As our understanding of specific processes improves over time, it becomes easier to build plausible models and simulators of these processes. The simulators can act as data synthesizers as well. We discuss this topic in the next section.

Generating Synthesis from Simulators

Within the context of data synthesis, a simulator is a statistical or a machine learning model, or a set of rules that characterize a particular process embedded in a software application. When the application is executed, it generates data from these models or rules. We saw some of that in the context of gaming engines in Chapter 3, which are used to generate data for training robots and training and testing autonomous vehicle systems. In the same chapter, we looked at microsimulation as another example of a simulation capability. However, the concept can be implemented more broadly and in other domains.

Generating data from simulators raises the possibility of setting the desired heterogeneity of the synthetic data. For example, a simulator can effectively oversample rare events or catastrophic events to ensure that the trained models are robust against a larger domain of inputs. However, these events need to be somewhat plausible. For example, when generating images for training autonomous vehicles, we would not want to have scenes with cars on top of buildings or floating in air. Plus, how would one validate the trained models in practice since the real situations are unlikely to occur (or occur rarely in the real world)?

Some domains are more amenable to simulators than others. As our understanding of health systems and biological systems improve, they can plausibly be modeled more accurately, and these models can be used to generate data. This will start off being done at the macro level, but would increase in granularity over time.

In addition to being another source of synthetic data, simulators allow us to manipulate synthetic data. For example, if we want to test a new AIML technique to see if it can detect the genetic and other characteristics of patients who respond particularly well to a drug, we can use a simulator to create datasets with signals of different strengths.

The list of items for consideration as part of the future of data synthesis did not cover new techniques for data synthesis. However, that is also an area of active development, with innovations in genetic algorithms and deep learning models. A deep dive into specific algorithms for synthesis is a specialized topic for a different publication.

Conclusions

Synthetic data represents an exciting opportunity to solve some practical problems related to accessing realistic data for numerous significant use cases. The demand for data to drive AIML applications, the greater availability of large datasets, and the increasing difficulty in getting access to this data (because of data protection regulations and concerns about data sharing) have created a unique opening for data synthesis technologies to fill that gap.

As we discussed, data access problems span multiple industries, such as manufacturing and distribution, healthcare and health research, financial services, as well as transportation and urban planning (including autonomous vehicles). The techniques and methodologies that have been developed over the last few years have achieved substantial data utility milestones. The number of use cases for which data synthesis provides a good solution is increasing rapidly.

In this report, we have looked at industries in which synthetic data can be applied in practice to solve data access problems. Again, a characteristic of these use cases is their heterogeneity and the plethora of problems that synthesis can solve. They are not a comprehensive list of industries and applications, but do highlight what early users are doing and illustrate the potential.

While we did not discuss the privacy benefits of synthetic data much in this report, it is important to highlight that in our closing. The current evidence suggests that the risk of matching synthetic data to real people and learning something new from that matching is very small. This is an important factor when considering the adoption of data synthesis.

Once a decision has been made to adopt data synthesis, the implementation process must be considered. As data synthesis becomes more programmatic across the enterprise, a center of excellence becomes an appropriate organizational structure as opposed to running individual projects. Depending on whether the demand for data synthesis is discrete for specific datasets or a continuous dataset, an architectural decision needs to be made on the implementation of a pipeline and its integration within a data flow. A data pipeline architecture would help with synthesis technology implementation.

Exciting advances in synthetic data generation are in development today that will help with broader adoption of this approach and type of technology. It was already noted some time ago that the future of data sharing, data dissemination, and data access will utilize one of two methods: interactive analytics systems or synthetic data.1

1 Jerome P. Reiter, “New Approaches to Data Dissemination: A Glimpse into the Future (?),” CHANCE 17, no. 3 (June 2004): 11–15. https://oreil.ly/x89Vd.

Get Accelerating AI with Synthetic Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.