Chapter 4. Strategies for Collecting Datasets
In Chapter 3, you built a full DSPy optimized program end-to-end. You saw how a small dataset combined with an optimizer and evaluation metric can dramatically improve performance. Context engineering is not just about providing context for your AI program at inference time –– often the biggest performance gains come from providing context in the training data you use for optimizing and evaluating your program.
But where do these example datasets come from? How many examples do you actually need? And what if you don’t have any labelled data? Data collection is where many AI projects stall. Creating datasets feels like a chicken-and-egg problem –– you need data to know if your system works, but you need a working system to know what data is most valuable to collect.
Many times I have worked with clients who have built working prototypes they are proud of, but struggle to produce even a handful of curated examples of the correct outputs given a range of inputs. Building ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access