Log-synth is open source software for generating synthetic data that can mimic the performance of real data, useful especially in situations involving restricted access to sensitive data. This chapter is a detailed technical description of the general purpose and implementation of log-synth, and it should be considered as a how-to guide more than a conceptual discussion. As such, the chapter has some overlap with the technical descriptions of the specific use cases covered in Chapter 5 and Chapter 6 but also goes beyond those examples.
For convenience, here is a link to the Github repository where we make log-synth freely available for your use. This repository also contains pre-packaged samplers and some documentation: https://github.com/tdunning/log-synth.
As a package, log-synth has fairly simple goals:
Facilitate the creation of realistic random data by non-specialists
Be fast enough to generate big data–scale datasets quickly
Allow schemas to be defined that combine various building blocks flexibly
Make it easy to extend log-synth with new samplers
Keep the system and the user experience really simple
In order to meet these goals, log-synth has been designed with a minimalist point of view in terms of overhead and structure, but with a very generous attitude toward the variety of built-in samplers. These goals have meant that while log-synth contains a wide variety of primitive generators for things like names, addresses, ...