Chapter 3. Loading Data
About 80% of enterprise information is unstructured and distributed across presentations, documents, emails, and media files. Figure 3-1 shows some of the common data types.
Figure 3-1. A common distribution of data in companies
This chapter shows how to turn common data sources into text that you can embed and retrieve. Most RAG retrievers work with text embeddings, so a practical first step is to convert different formats into a consistent text representation.
Figure 3-2 summarizes the main components of a RAG system, showing both the indexing pipeline (loading and processing documents) and the runtime retrieval flow (retrieving relevant chunks and generating answers).
Figure 3-2. The components of a RAG system
This chapter also explains the loading process for various document types.
Warning
This book builds core components from scratch to clarify the underlying concepts. In production, orchestration frameworks such as LangChain or LlamaIndex can accelerate development, but they also introduce moving parts like frequent breaking changes, fast-evolving APIs, and additional abstractions.
If you use these frameworks, pin dependency versions, follow their upgrade guides, and isolate framework-specific code behind small adapters. For the most stable foundation, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access