book

AI-Ready Data Blueprints

by Navnit Shukla, Kien Pham, Srikanth Sopirala, Harsha Tadiparthi

May 2026

Intermediate to advanced

292 pages

8h 5m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
What You’ll Find InsideWho This Book Is ForConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction to AI-Ready Data Foundation
Introduction and Market ContextWhat Makes Generative AI DifferentThe Transformer Architecture: The Technical Foundation of GenAIEnterprise Data as the Key DifferentiatorContextual Intelligence: A Simple ExampleRepresenting Meaning in Vector SpaceEnterprise Example: Cross-Document Understanding in Customer SupportThe Evolution of GenAI ApplicationsFrom Assistants to Agents: The Four Stages of GenAI EvolutionThe Increasing Complexity of Data RequirementsWhat Leading Organizations Are Building TodayThe Production Reality CheckEmerging GenAI Architectural PatternsFrom Patterns to PracticeDevelopment Cycles: ML Versus GenAITraditional Machine Learning Development CycleGenAI Development CycleData Infrastructure ImplicationsArchitecture Evolution in Practice: From Traditional ETL to GenAI-Ready PipelinesKey Differences Between Traditional and GenAI Data FoundationsReal-World Example: Evolving from Kimball to Medallion to GenAI-Ready ArchitecturePreparing for the Agent-Driven Future“Jeeves Does the Shopping”: The Agent-Mediated Consumer ExperienceSearch Engine Optimization: From Keywords to Agent OptimizationAgents Have No Allegiance: Preparing for Radical TransparencyBusiness Autopilot: Autonomous Operations and Decision MakingData Readiness Checklist for the Agent-Driven FutureBlueprint: Implementation GuidanceWhere to StartThe Decision Tree: Which Pattern First?Your Action PlanSummary
2. Data Framework for GenAI and Agentic AI Applications
Introduction: Building the Foundation for AI-Ready DataThe Evolution of Data FrameworksEarly Days (2015–2019)Transition Period (2020–2022)GenAI Era (2022–Present)Agentic AI Era (2024–Present)The Need for a New Approach: Core Requirements for AI-Ready DataCapturing Business Logic and ContextEnsuring Data Quality and ConsistencyManaging Complexity and DiversityMaintaining Security, Compliance, and PrivacyEnabling Information Sharing and CollaborationSupporting Scale and PerformanceManaging Data as a Strategic ProductEmpowering Users with Documentation and GuidanceA Core Framework for AI-Ready DataCapturing Business Logic and ContextEnsuring Data Quality and ConsistencyManaging Complexity and DiversityMaintaining Security, Compliance, and PrivacyEnabling Information Sharing and CollaborationSupporting Scale and PerformanceManaging Data as a Strategic ProductEmpowering Users with Documentation and GuidanceAI-Ready Data Blueprints for the Data Framework: Practical Implementation GuideBlueprint 1: Business Context Intelligence EngineBlueprint 2: Adaptive Data Quality OrchestrationBlueprint 3: Orchestrating Data Diversity and ComplexityBlueprint 4: Security-First AI Data PlatformSummary
3. Data Wrangling and Data Preparation for GenAI and Agentic AI Applications
The Enterprise ChallengePurpose and AudienceThe Market ImperativeThe Strategic AdvantageTransformation Journey PreviewUnderstanding the Paradigm ShiftThe Established Machine Learning Data PipelineThe Tabular Data Paradigm and Its ConstraintsThe Conceptual Revolution in Data ProcessingNew Processing Paradigms and Technical RequirementsThe Business Impact of the Paradigm ShiftSelf-Assessment: Where Is Your Organization Today?Building Blocks of GenAI-Ready DataSemantic Understanding FundamentalsKnowledge Graphs and Relationship ModelingReal-Time Data Processing RequirementsGenAI Data Preparation Maturity ModelCase Study: Retail Organization TransformationThe Semantic Layer ArchitectureArchitectural Overview and Core PrinciplesData Foundation Layer: Building on Lakehouse ArchitectureMetadata and Ontology Management: Creating Semantic UnderstandingTransformation and Enrichment Pipeline: Adding Intelligence to DataVectorization and Indexing Infrastructure: Enabling Semantic SearchAPIs and Reasoning Layer: Enabling AI ConsumptionComponent Interactions and Data FlowsImplementation Considerations and Common PatternsAWS Reference Architecture and ImplementationAWS Ecosystem Overview for GenAI Data PreparationKey AWS Services and CapabilitiesIntegration Patterns and Service SelectionGovernance, Quality, and ObservabilityExpanded Quality Dimensions for GenAI DataGovernance Framework EssentialsA Governance and Quality Monitoring FrameworkMonitoring and Observability ApproachesRegulatory Compliance and AssuranceCase Study: Financial Services Governance ImplementationAdvanced Topics: Agentic AI Data RequirementsAgentic AI Fundamentals and Data ImplicationsEnterprise Implementation ExamplesThe Future: Autonomous and AI-Assisted Data WranglingAI-Powered Data Discovery and Intelligent ClassificationConversational Data Preparation and Natural Language InterfacesAutonomous Quality Assurance and Self-Healing SystemsFuture Directions and Research FrontiersAction PlanKey PrinciplesImmediate Next Steps by Maturity LevelLong-Term Strategic ConsiderationsSummary
4. Data Governance, Security, Compliance, and Orchestration for GenAI
Data Governance and Data Security for AI ApplicationsThe AI Data Governance Operating ModelDifferences in Governance of Structured and Unstructured DataData Stewardship and Metadata ManagementMaking Data Accessible for Humans and Agentic AITesting Our ToolsData QualityResponsible AIFairnessTransparencyAccountabilityPrivacy and SecurityReliabilityHuman OversightData Privacy and Security for GenAI ApplicationsSensitive Information ProtectionTopic Restriction and Word FilteringFiltering ToolsEnd-to-End Data ProtectionLLMOps: AI Workflow OrchestrationOrchestration Patterns for AI ApplicationsAgentic PatternsSummary
5. Knowledge Bases and Vector Databases
The GenAI Data ChallengeKnowledge Bases: Data Organization and StorageVector Databases: Data Representation and Similarity SearchRetrieval-Augmented Generation: Data Retrieval and Context AssemblyWhy These Technologies Are EssentialKnowledge Base Fundamentals: From Data to KnowledgeData Architecture for Knowledge BasesData Preparation Pipeline for Knowledge BasesData Types and Management StrategiesData Quality: The Make-or-Break FactorVector Database Fundamentals: Data RepresentationEmbeddings and Data QualityVector Database InfrastructureOpen Source SolutionsData Indexing Algorithm ComparisonData Compression TechniquesData Governance Feature ComparisonSelecting Based on Data RequirementsRAG: The Data Retrieval LayerWhy RAG? The Data Access ProblemThe RAG Architecture: Data FlowData Chunking: Critical for RAG PerformanceData Efficiency and Cost OptimizationEnd-to-End Data Flow: KB → Vector DB → RAGSummary
6. AI Application Optimization for Production Readiness
The Journey from Prototype to ProductionPreparing Your AI Application for Production SuccessThe Twin Pillars of AI OptimizationEngineering Efficient Data Workflows for Production AI SystemsData Quality Assessment and Preprocessing FundamentalsUnderstanding Context Windows and Their ImplicationsHandling Large Input ContextsOptimizing Inference Quality Using Automated ReasoningMetadata Readiness for Agentic AI SystemsThe Challenges of Scale and ComplexityPattern: Leveraging Raw Files DirectlyBuilding Data-Aware Agents Through Intelligent Semantic MetadataExample Implementation: Leveraging Raw Files DirectlyFrom Raw Content to Semantic Intelligence: Constructing the Metadata LayerKey Capabilities for Production Readiness for Agentic AI PlatformsLeading AI Agent Platform ChoicesCore Runtime and Deployment StrategyFramework, Tool Integration, and Protocol SupportMemory and Knowledge Management CapabilitiesSecurity and Compliance FeaturesObservability and Quality Assurance SupportSummary
Index
About the Authors

Content preview from AI-Ready Data Blueprints

Foreword

In my lab at the University of Rochester, we spent over a decade building AI systems that listen to a patient’s voice and watch their facial movements to detect early signs of Parkinson’s disease and autism, often before a clinician ever sees them. The models we built were sophisticated. The algorithms were sound. But the hardest problem was never the model. It was the data.

We learned this lesson the way most researchers do: painfully. Our early systems would perform beautifully on curated datasets and then fall apart in the real world—not because the neural networks were wrong, but because the data feeding them was incomplete, inconsistent, or stripped of the context that gave it meaning. A voice recording without metadata about the patient’s medication timing was just noise. A facial expression without the conversational context was ambiguous at best, misleading at worst. The signal was always there—the data just wasn’t ready to reveal it.

That experience, repeated across clinical studies, national-scale health AI deployments in Saudi Arabia, and advisory work with the National Academies, has given me a deep conviction: the organizations that will lead in the AI era are not the ones with the most powerful models. They are the ones with the most deliberately architected data.

This is precisely the argument that Navnit, Kien, Srikanth, and Harsha make in AI-Ready Data Blueprints, and they make it with a clarity and practical depth that is rare in technical writing.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9798341631786Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

AI-Ready Data Blueprints

by Navnit Shukla, Kien Pham, Srikanth Sopirala, Harsha Tadiparthi

Foreword

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.