Chapter 3. Getting Started: Distribution Fitting

A straightforward way to think about the process of data synthesis is that we are trying to model both the distributions of the real data and the structure of the real data. Based on that model we can then generate synthetic data that retains the characteristics of the original data. In this chapter we cover the first step in that process—modeling distributions. Once you know how to do that, we’ll move on to modeling the structure of the data in Chapter 5.

The starting point of modeling distributions is understanding how to fit individual variables to known distributions (or “classical” distributions, such as the normal and exponential). Once we are able to do that, we can generate data from these distributions that have the same characteristics as the original data.1

The next step will be to enable the modeling of nonclassical distributions. Some real-world data or real-world phenomena do not follow a classical distribution. We still want to be able to synthesize data that does not follow classical distributions. Therefore, we outline how machine learning models can be used to fit unconventional data distributions.

Framing Data

Any data analysis task begins with a pile of data that needs to be transformed into a data frame. A data frame is a table of data in which each row, also known as a record, is a complete, self-contained example of the data being represented. Each column, also known as a variable or field, is a detail about ...

Get Practical Synthetic Data Generation now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.