Chapter 2. Foundation Models
Foundation models—including LLMs and multimodal models—form the backbone of modern RAG systems. These models are used both to generate answers for users and to prepare content before it’s stored and retrieved.
In the generation step, foundation models analyze retrieved context and user questions to produce grounded responses. In the preparation step, foundation models extract text from images, transcribe audio, summarize long documents, and enrich content with metadata that improves retrieval quality.
This chapter focuses on the language models and multimodal models used in both the preparation step (also called the ingestion phase)—where content is processed, transformed, and prepared for storage—and the generation step—where models analyze retrieved information and generate answers for users.
Figure 2-1 shows a typical multimodal workflow for processing video content:
-
Use a vision model to analyze video frames.
-
Use speech-to-text to transcribe audio.
-
Embed the resulting text by using an embedding model.
-
Retrieve relevant context when users ask questions.
-
Generate answers with a language model.
Figure 2-1. Multimodal models can interpret and generate text, images, audio, and video
Every RAG system needs a generation model that interprets the retrieved content and generates the required output—whether that’s answering a user question, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access