Chapter 1. Defining Synthetic Data
Interest in synthetic data has been growing quite rapidly over the last few years. This has been driven by two simultaneous trends. The first is the demand for large amounts of data to train and build artificial intelligence and machine learning (AIML) models. The second is recent work that has demonstrated effective methods to generate high-quality synthetic data. Both have resulted in the recognition that synthetic data can solve some difficult problems quite effectively, especially within the AIML community. Groups and businesses within companies like NVIDIA, IBM, and Alphabet, as well as agencies such as the US Census Bureau, have adopted different types of data synthesis to support model building, application development, and data dissemination.
This report provides a general overview of synthetic data generation, with a focus on the business value and use cases, and high-level coverage of techniques and implementation practices. We aim to answer the questions that a business reader would typically ask (and has typically asked), but at the same time provide some direction to analytics leadership seeking to understand the options available and where to look to get started.
We show how synthetic data can accelerate AIML projects. Some problems that can be tackled by using synthetic data would be too costly or dangerous (e.g., in the case of training models controlling autonomous vehicles) to solve using more traditional methods, or simply cannot be done otherwise.
AIML projects run in different industries, and the multiple industry use cases that we include in this report are intended to give you a flavor of the broad applications of data synthesis. We define an AIML project quite broadly as well, to include, for example, the development of software applications that have AIML components.
The report is divided into four chapters. This introductory chapter covers basic concepts and presents the case for synthetic data. Chapter 2 presents the data synthesis process and pipelines, scaling implementation in the enterprise, and best practices. A series of industry-specific case studies follow in Chapter 3. Chapter 4 is forward-looking and considers where this technology is headed.
In this chapter, we start by defining the types of synthetic data. This is followed by a description of the benefits of using synthetic data—the types of problems that data synthesis can solve. Given the recent adoption of this approach into practice, building trust in analysis results from synthetic data is important. We therefore also present examples supporting the utility of synthetic data and discuss methods to build trust.
Alternatives to data synthesis exist, and we present these next with an assessment of strengths and weaknesses. This chapter then closes with an overview of methods for synthetic data generation.
What Is Synthetic Data?
At a conceptual level, synthetic data is not real data but is data that has been generated from real data and that has the same statistical properties as the real data. This means that an analyst who works with a synthetic dataset should get analysis results that are similar to those they would get with real data. The degree to which a synthetic dataset is an accurate proxy for real data is a measure of utility. Furthermore, we refer to the process of generating synthetic data as synthesis.
Data in this context can mean different things. For example, data can be structured data (i.e., rows and columns), as one would see in a relational database. Data can also be unstructured text, such as doctors’ notes, transcripts of conversations among people or with digital assistants, or online interactions by email or chat. Furthermore, images, videos, audio, and virtual environments are also types of data that can be synthesized. We have seen examples of fake images in the machine learning literature; for instance, realistic faces of people who do not exist in the real world can be created, and you can view the results online.
Synthetic data is divided into two types, based on whether it is generated from actual datasets or not.
The first type is synthesized from real datasets. The analyst will have some real datasets and then build a model to capture the distributions and structure of that real data. Here, structure means the multivariate relationships and interactions in the data. Then the synthetic data is sampled or generated from that model. If the model is a good representation of the real data, the synthetic data will have similar statistical properties as the real data.
For example, a data science group specializing in understanding customer behaviors would need large amounts of data to build its models. But because of privacy or other concerns, the process for getting access to that customer data is slow and does not provide good enough data when it does arrive because of extensive masking and redaction of information. Instead, a synthetic version of the production datasets can be provided to the analysts for building their models. The synthesized data will have fewer constraints put on its use and would allow them to progress more rapidly.
The second type of synthetic data is not generated from real data. It is created by using existing models or by using background knowledge of the analyst. These existing models can be statistical models of a process (for example, developed through surveys or other data collection mechanisms) or they can be simulations. Simulations can be created, for instance, by gaming engines that create simulated (and synthetic) images of scenes or objects, or by simulation engines that generate shopper data with particular characteristics (say, age and gender) of people who walk past the site of a prospective store at different times of the day.
Background knowledge can be, for example, a model of how a financial market behaves based on textbook descriptions or based on the behaviors of stock prices under various historical conditions, or it can be knowledge of the statistical distribution of human traffic in a store based on years of experience. In such a case, it is relatively straightforward to create a model and sample from it to generate synthetic data. If the analyst’s knowledge of the process is accurate, the synthetic data will behave in a manner that is consistent with real-world data. Of course, this works only when the phenomenon of interest is truly well understood.
As a final example, when a process is new or not well understood by the analyst and there is no real historical data to use, an analyst can make some simple assumptions about the distributions and correlations among the variables involved in the process. For example, the analyst can make a simplifying assumption that the variables have normal distributions and “medium” correlations among them, and create data that way. This type of data will likely not have the same properties as real data but can still be useful for some purposes, such as debugging an R data analysis program or for some types of performance testing of software applications.
For some use cases, having high utility will matter quite a bit. In other cases, medium or even low utility may be acceptable. For example, if the objective is to build AIML models to predict customer behavior and make marketing decisions based on that, high utility will be important. On the other hand, if the objective is to see if your software can handle a large volume of transactions, the data utility expectations will be considerably less. Therefore, understanding what data, models, simulators, and knowledge exist as well as the requirements for data utility will drive the specific approach to use for generating the synthetic data.
Table 1-1 provides a summary of the synthetic data types.
Type of synthetic data | Utility |
---|---|
Generated from real (nonpublic) datasets |
Can be quite high |
Generated from real public data |
Can be high, although limitations exist because public data tends to be de-identified or aggregated |
Generated from an existing model of a process, which can also be represented in a simulation engine |
Will depend on the fidelity of the existing generating model |
Based on analyst knowledge |
Will depend on how well the analyst knows the domain and the complexity of the phenomenon |
Generated from generic assumptions not specific to the phenomenon |
Will likely be low |
Now that you have an understanding of the types of synthetic data, we will look at the benefits of data synthesis overall and for some of these data types specifically.
The Benefits of Synthetic Data
In this section, we present several ways that data synthesis can solve practical problems with AIML projects. The benefits of synthetic data can be dramatic. It can make impossible projects doable, significantly accelerate AIML initiatives, or result in material improvement in the outcomes of AIML projects.
Improving Data Access
Data access is critical to AIML projects. The data is needed to train and validate models. More broadly, data is also needed for evaluating AIML technologies that have been developed by others, as well as for testing AIML software applications or applications that incorporate AIML models.
Typically, data is collected for a particular purpose with the consent of the individual; for example, for participating in a webinar or for participating in a clinical research study. If you want to use that same data for a different purpose, such as for building a model to predict what kind of person is likely to sign up for a webinar or who would participate in a study, then that is considered a secondary purpose.
Access to data for secondary analysis is becoming problematic. The US Government Accountability Office1 and the McKinsey Global Institute2 both note that accessing data for building and testing AIML models is a challenge for their adoption more broadly. A Deloitte analysis concluded that data access issues are ranked in the top three challenges faced by companies when implementing AI.3 A recent survey from MIT Technology Review reported that almost half of the respondents identified data availability as a constraint to the use of AI with their company.4 At the same time, the public is getting uneasy about how their data is used and shared, and privacy laws are becoming more strict. A recent survey by O’Reilly highlighted the privacy concerns of companies adopting machine learning models, with more than half of companies experienced with AIML checking for privacy issues.5 In the same MIT survey mentioned previously, 64% of respondents note that “changes in regulation or greater regulatory clarity on data sharing” is a development that would be most likely to lead to more data sharing.
Contemporary privacy regulations, such as the US Health Insurance Portability and Accountability Act (HIPAA) and the General Data Protection Regulation (GDPR) in Europe, impose constraints or requirements to using personal data for a secondary purpose. An example is a requirement to get an additional consent or authorization from individuals. In many cases, this is not practical and can introduce bias into the data because consenters and nonconsenters differ in important characteristics.6
Data synthesis can give the analyst, rather efficiently and at scale, realistic data to work with. Given that synthetic data would not be considered identifiable personal data, privacy regulations would not apply, and obligations of additional consent to use the data for secondary purposes would not be required.7
Improving Data Quality
Given the difficulty in getting access to data, many analysts try to just use open source or public datasets. These can be a good starting point, but they lack diversity and are often not well matched to the problems that the models are intended to solve. Furthermore, open data may lack sufficient heterogeneity for robust training of models. For example, they may not capture rare cases well enough.
Sometimes the real data that exists is not labeled. Labeling a large number of examples for supervised learning tasks can be time-consuming, and manual labeling is error prone. Again, synthetic labeled data can be generated to accelerate model development. The synthesis process can ensure high accuracy in the labeling.
Using Synthetic Data for Exploratory Analysis
Analysts can use synthetic data models to validate their assumptions and demonstrate the kind of results that can be obtained with their models. In this way, the synthetic data can be used in an exploratory manner. Knowing that they have interesting and useful results, the analysts can then go through the more complex process of getting the real data (either raw or de-identified) to build the final versions of their models.
For example, an analyst who is a researcher could use their exploratory models on synthetic data to then apply for funding to get access to the real data, which may require a full protocol and multiple levels of approvals. In such an instance, work with synthetic data that does not produce good models or actionable results would still be beneficial because analysts would have avoided the extra effort required to get access to the real data for a potentially futile analysis.
Another valuable use of synthetic data is for training an initial model before the real data is accessible. Then when the analyst gets the real data, they can use the trained model as a starting point for training with the real data. This can significantly expedite the convergence of the real data model (hence reducing compute time), and can potentially result in a more accurate model. This is an example of using synthetic data for transfer learning.
Using Synthetic Data for Full Analysis
A validation server can be deployed to run the analysis code that worked on the synthetic data on the real data. An analyst would perform all of their analysis on the synthetic data, and then submit the code that worked on the synthetic data to a secure validation server that has access to the real data, as illustrated in Figure 1-1. Because the synthetic data would be structured in the same way as the original data, the code that worked on the synthetic data should work directly on the real data. The results are then sent back to the analyst to confirm their models.
This is not intended to be an interactive system. The output from the validation server needs to be checked to ensure that no revealing information is being sent out by the code output. Therefore, it is intended to be used once or twice by the analyst at the very end of their analysis. It does provide a way to provide assurance to the analysts that the synthesis results are replicable on the real data.
When the utility of the synthetic data is high enough, the analysts can get similar results with the synthetic data as they would have with the real data, and no validation server is required. In such a case, the synthetic data plays the role of a proxy for the real data. This scenario is playing out in more and more use cases: as synthesis methods improve over time, this proxy outcome is going to become more common.
Replacing Real Data That Does Not Exist
In some situations, real data may not exist. The analyst may be trying to model something completely new, or the creation or collection of a real dataset from scratch may be cost prohibitive or impractical. Synthesized data can cover edge or rare cases that are difficult, impractical, or unethical to collect in the real world.
Synthetic data can also be used to increase the heterogeneity of a training dataset, which can result in a more robust AIML model. For example, unusual cases in which data does not exist or is difficult to collect can be synthesized and included in the training dataset. In that case, the utility of the synthetic data is measured in the robustness increment it gives to the AIML models.
We have seen that synthetic data can play a key role in solving a series of practical problems. One critical factor for the adoption of data synthesis, however, is trust in the generated data. It has long been recognized that high data utility will be needed for the broad adoption of data synthesis methods.8 This is the topic we turn to next.
Learning to Trust Synthetic Data
Initial interest in synthetic data started in the early ’90s with proposals to use multiple imputation methods to generate synthetic data. Imputation in general is the process of replacing missing data values with estimates. Missing data can occur, for example, in a survey if some respondents do not complete a questionnaire.
Accurate imputed data requires the analyst to build a model of the phenomenon of interest by using the available data and then use that model to estimate what the imputed value should be. To build a valid imputation model, the analyst needs to know how the data will be eventually used. With multiple imputation, you create multiple imputed values to capture the uncertainty in these estimated values. This process can work reasonably well if you know how the data will be used.
In the context of using imputation for data synthesis, the real data is augmented with synthetic data by using the same type of imputation techniques. In such a case, the real data is used to build an imputation model that is then used to synthesize new data.
The challenge is that if your imputation models are different from the eventual uses of the data, the imputed values may not be very reflective of the real values, and this will introduce errors in the data. This risk of building the wrong synthesis model has led to historic caution in the application of synthetic data.
More recently, statistical machine learning models have been used for data synthesis. The advantage of these models is that they can capture the distributions and complex relationships among the variables quite well. In effect, they discover the underlying model in the data rather than having that model prespecified by the analyst. And now with deep learning data synthesis, these models can be quite accurate in that they can capture much of the signal in the data—even subtle signals.
Therefore, we are getting closer to the point where the generative models available today are producing datasets that are becoming quite good proxies for real data. There are also ways to assess the utility of synthetic data more objectively.
For example, we can compare the analysis results from synthetic data with the analysis results from the real data. If we do not know what analysis will be performed on the synthetic data, a range of possible analysis can be tried based on known examples of uses of that data. Or an “all models” evaluation can be performed in which all possible models are built from the real and synthetic datasets and compared.9
The US Census Bureau has, at the time of writing, decided to leverage synthetic data for some of its most heavily used public datasets, the 2020 decennial census data. For its tabular data disseminations, the agency will create a synthetic dataset from the collected individual-level census data and then produce the public tabulations from that synthetic dataset. A mixture of formal and nonformal methods will be used in the synthesis process.10 We provide an overview of the synthesis process in Chapter 2. This, arguably, demonstrates the large-scale adoption of data synthesis for one of the most critical and heavily used datasets available today.
As organizations build trust in synthetic data, they will move from exploratory analysis use cases, to the use of a validation server, and then to using synthetic data as a proxy for real data.
A legitimate question is what are the other approaches that are available today to access data for AIML purposes, in addition to data synthesis? We discuss these approaches, as well as their advantages and disadvantages relative to data synthesis in the following section.
Other Approaches to Accessing Data
When real data exists, two practical approaches, in addition to data synthesis, are available today that can be used to get access to the data. The first is de-identification. The second is secure multiparty computation.
Practical risk-based de-identification involves applying transformations to the data and putting in place additional controls (security, privacy, and contractual) to manage overall re-identification risks. A transformation can be, for example, generalizing a date of birth to a year of birth or a five-year range. Another transformation to data is to add noise to dates of events. Examples of controls include access controls to data and systems, performing background checks and training of analysts on privacy, and the use of encryption for data in transit and at rest. This process has worked well historically with clearly defined methodologies.11
As the complexity of datasets that are being analyzed increases, more emphasis is being put on the use of controls to manage the risk. The reason is that additional transformation would reduce the value of the data. Therefore, to ensure that the overall risk is acceptable, more controls are being put in place. This makes the economics of this kind of approach more challenging.
Data synthesis requires less manual intervention than de-identification, and there is no hard requirement for additional controls to be implemented by the synthetic data users.
The second approach that can be applied to get access to the data is to use secure multiparty computation. This technology allows computations to be performed on encrypted or garbled data; typically, multiple independent entities perform the computation collaboratively without sharing or leaking any raw data among themselves. There are multiple ways to do this, such as using secret sharing techniques (the data is randomly split among the collaborating entities) or homomorphic encryption techniques (the data is encrypted, and computations are performed on the encrypted values).
In general, to use secure computation techniques, the analytics that will be applied need to be known in advance, and the security properties of each analysis protocol must be validated. A good example is in public health surveillance: the rate of infections in long-term care homes was aggregated without revealing any individual home’s rate.12 This works well in the surveillance case where the analysis is well defined and static, but setting up secure multiparty computation protocols in practice is complex.
Perhaps more of an issue is that few people understand the secure computation technology, the methods underlying many of these techniques, and can perform these security proofs. This creates key dependencies on very few skilled resources.
Once you have made a decision to generate and use synthetic data, you can turn to the next section for an overview of specific techniques to do so.
Generating Synthetic Data from Real Data
In this section, we consider methods for generating synthetic data from real data. Other approaches—for instance, using simulators—are discussed in Chapter 3 since they are more specific to the application domain.
At a general level, two classes of methods generate synthetic data from real data. Both have a generation component followed by a discrimination component. The generation component builds a model of the real data and generates synthetic data from that model. The discrimination component compares the generated data with the real data. If this comparison concludes that the generated data is very different from the real data, the generation parameters are adjusted and then new synthetic data is generated. The process iterates until acceptable synthetic data is produced.
An acceptable synthetic dataset is largely indistinguishable from the real data. However, we must be careful not to build a model that exactly replicates the original data. Such overfitting can create its own set of problems—the key problem being that the synthetic data can have nontrivial privacy problems.
The first approach to generating synthetic data is illustrated in Figure 1-2. Here the input to synthesis is real data. Various techniques can be used for the generator.
One set of techniques fits the distributions of all the variables in the real data (such as the type of distribution, the mean, and variance), and computes the correlations among the variables. With that information, it is then possible to sample synthetic data by using Monte Carlo simulation techniques while inducing the empirically observed correlations.
There are more advanced techniques that consider more complex interactions among the variables than just pairwise correlations (such as multiway interactions). For example, some studies have compared parametric, nonparametric, and artificial neural network techniques for data synthesis.13 These empirical evaluations generate many synthetic datasets and evaluate the data utility of these to determine the extent to which the synthetic data produces analytics results that are comparable to the real data.
These evaluations have generally concluded that overall nonparametric statistical machine learning methods, such as decision trees, produce the best results. They are also simple to use and tune.
Deep learning synthesis techniques, such as autoencoders, have not been rigorously compared to nonparametric generators. However, they would be a good alternative to decision trees and can also work well in practice for data synthesis.
Other iterative techniques have been utilized, such as iterative proportional fitting (which is discussed in Chapter 3). These are suitable for certain types of real data, such as when the source consists of aggregate statistics rather than only individual-level or transactional data.
The second approach to generating synthetic data is illustrated in Figure 1-3. Here, instead of real data being the input to the generator, random data is provided as input. This is the configuration of generative adversarial networks and similar architectures. The model learns how to convert the random input into an acceptable synthetic dataset that passes the discriminator test.
Things start to get quite interesting when some of these methods are combined; for example, by creating ensembles to generate the synthetic data or by using the output of one method as the input to another method. An ensemble would have more than one data generation method, and, for example, would select the best synthesized records to be retained. Opportunities certainly exist for further experimentation and innovation in data synthesis methodologies.
Conclusions
This chapter provided an overview of what synthetic data is, its benefits, and how to generate it, as well as some of the trends driving the need for synthetic data. Both businesses and government alike are utilizing synthetic data, as you’ll see in the use cases later in the report. In the next chapter, we look at the processes, data pipelines, and structure within an enterprise for data synthesis.
1 Government Accountability Office, “Artificial Intelligence: Emerging Opportunities, Challenges, and Implications,” GAO-18-142SP (March 2018). https://oreil.ly/Cpyli.
2 McKinsey Global Institute, “Artificial Intelligence: The Next Digital Frontier?” (June 2017). https://oreil.ly/zJ8oZ.
3 Deloitte Insights, “State of AI in the Enterprise, 2nd Edition” (2018). https://oreil.ly/l07tJ.
4 MIT Technology Review Insights, “The Global AI Agenda: Promise, Reality, and a Future of Data Sharing” (March 2020). https://oreil.ly/FHg87
5 Ben Lorica and Paco Nathan, The State of Machine Learning Adoption in the Enterprise (O’Reilly).
6 Khaled El Emam, et al., “A Review of Evidence on Consent Bias in Research,” American Journal of Bioethics 13, no. 4 (2013): 42–44. https://oreil.ly/SiG2N.
7 However, one should follow good practices, such as providing notice to individuals about how the data is used and disclosed, and having ethics oversight on the uses of data and AIML models.
8 Jerome P. Reiter, “New Approaches to Data Dissemination: A Glimpse into the Future (?),” CHANCE 17, no. 3 (June 2004): 11–15. https://oreil.ly/x89Vd.
9 A review of utility assessment approaches can be found in Khaled El Emam, “Seven Ways to Evaluate the Utility of Synthetic Data,” IEEE Security and Privacy (July/August 2020).
10 Aref N. Dajani, et al., “The Modernization of Statistical Disclosure Limitation at the U.S. Census Bureau,” Census Scientific Advisory Committee Meeting (2017). https://oreil.ly/OL4Oe.
11 Khaled El Emam and Luk Arbuckle, Anonymizing Health Data: Case Studies and Methods to Get You Started (O’Reilly 2014).
12 Khaled El Emam, et al., “Secure Surveillance of Antimicrobial Resistant Organism Colonization or Infection in Ontario Long Term Care Homes,” PLOS ONE 9, no. 4 (2014). https://oreil.ly/9dzJ4.
13 Jörg Drechsler and Jerome P. Reiter, “An Empirical Evaluation of Easily Implemented, Nonparametric Methods for Generating Synthetic Datasets,” Computational Statistics & Data Analysis 55, no. 12 (December 2011): 3232–3243. https://oreil.ly/qHQK8; Ashish Dandekar, et al., “A Comparative Study of Synthetic Dataset Generation Techniques,” National University of Singapore, TRA6/18 (June 2018). https://oreil.ly/qLh0b.
Get Accelerating AI with Synthetic Data now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.