Chapter 5. Synthetic Data
Chapter 4 covered the strategies for collecting and cleaning real-world data. But what happens when you’ve exhausted those strategies and still don’t have enough coverage to train a model effectively? Maybe your domain is so specialized that public datasets don’t exist for it, or perhaps the data you need is locked behind privacy regulations (such as real patient questions or financial transactions), and you can’t use it for training even if you do have it!
You’ve tried with the data you have, and you’ve found that it just doesn’t work. Exploring different architecture might work, but if you’re honest with yourself, if you don’t have enough data, what happens next?
This is where synthetic data enters the picture.
The core idea is simple: use a large, capable model (the “teacher”) to generate training examples that a smaller, cheaper model (the “student”) will learn from. The teacher already knows how to answer medical questions, write legal analyses, or debug code. ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access