Appendix A. Synthetic Data Generation Tools

For domain-specific data generation tools, you have a variety of options.

In the spirit of GANs and flow-based models, there are plenty of projects that train generative models on real-world data and then use the generators as the source of synthetic data. Table A-1 lists several GAN-based methods.

Table A-1. Data-driven methods and toolsa
Methods and tools Description Further reading Type

CTGAN

A GAN-based data synthesizer that can generate synthetic tabular data with high fidelity

“Modeling Tabular Data Using Conditional GAN”

Tabular

TGAN

Outdated and superseded by CTGAN

Tabular

gretel

Creates fake synthetic datasets with enhanced privacy guarantees

Tabular

WGAN-GP

Recommended for training the GAN; suffers less from mode-collapse and has a more meaningful loss than other GAN-based data generation tools

“On the Generation and Evaluation of Synthetic Tabular Data Using GANs”

Tabular

DataSynthesizer

“Generates synthetic data that simulates a given dataset and applies DP techniques to achieve a strong privacy guarantee”

Tabular

MedGAN

“[A] generative adversarial network for generating multilabel discrete patient records [that] can generate both binary and count variables (i.e., medical codes such as diagnosis codes, medication codes, or procedure codes)”

“Generating Multi-label Discrete Patient Records Using Generative Adversarial Networks”

Tabular

MC-MedGAN (multi-categorical GANs)

Produces synthetic data instances ...

Get Practicing Trustworthy Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.