Chapter 2. Data Readiness and Accessibility
In Chapter 1, we explored how GenAI applications represent a fundamentally different class of AI systems compared to traditional machine learning models for classification and regression tasks. The generative, probabilistic nature of GenAI systems introduces unique complexities across the entire lifecycle, and as we’ll see, this starts with the data.
“Wait, I thought LLMs were all about the models?” a CTO once asked Sarita after their team had spent months fine-tuning parameters but still couldn’t match their prototype’s performance in production. This is perhaps the most common misconception we encounter when working in the field. While the models get the spotlight, it’s really the quality, accessibility, and governance of your data that ultimately determine whether your GenAI application succeeds or fails in the real world.
When our team works with organizations transitioning from prototype to production, we typically find they’ve underestimated the data challenges by an order of magnitude. Industry analysis suggests that data preparation can consume up to 80% of the total effort in AI projects. With GenAI, this becomes even more pronounced, particularly when building systems that need to be reliable, accurate, and trustworthy.
The Amplified Importance of Data for GenAI
“Garbage in, garbage out” (GIGO) takes on an entirely new dimension with GenAI, as illustrated in Figure 2-1. For traditional ML models, the consequence of poor data ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access