Chapter 4. Data Preparation
When feeding longer texts into RAG systems, preprocessing breaks the document into smaller chunks. Each chunk is converted into an embedding vector that captures its semantic meaning. During retrieval, the system measures similarity by calculating distances between these vectors to find relevant text.
This works best when each chunk contains exactly one piece of information and can be understood independently. The key challenge is preparing chunks that stand alone without relying on surrounding context. A robust processing pipeline cleans raw text and splits it at the right points to produce meaningful chunks.
Figure 4-1 illustrates common preprocessing techniques:
- Text preparation
-
Replace abbreviations and clean the text.
- Metadata collection
-
Store page numbers, source, and author.
- Text splitting
-
Apply character, recursive, semantic, or agentic chunking.
The goal is producing clear, unambiguous chunks that don’t require surrounding context.
Figure 4-1. A simplified RAG indexing pipeline including potential data processing techniques
Store useful metadata with each chunk to enable filtering during retrieval, making the process both faster and more accurate.
You can find all the code examples for this chapter in the book’s GitHub repository.
4.1 Adding Metadata to Enable Metadata Filtering
Problem
You want to store metadata alongside the text ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access