book

Designing Large Language Model Applications

by Suhas Pai

March 2025

Intermediate to advanced

366 pages

9h 31m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Includes Quizzes

Who This Book Is ForHow This Book Is StructuredWhat This Book Is Not AboutHow to Read the BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Defining LLMsA Brief History of LLMsEarly YearsThe Modern LLM EraThe Impact of LLMsLLM Usage in the EnterprisePromptingZero-Shot PromptingFew-Shot PromptingChain-of-Thought PromptingPrompt ChainingAdversarial PromptingAccessing LLMs Through an APIStrengths and Limitations of LLMsBuilding Your First Chatbot PrototypeFrom Prototype to ProductionSummary
Ingredients of an LLMPre-Training Data RequirementsPopular Pre-Training DatasetsSynthetic Pre-Training DataTraining Data PreprocessingData Filtering and CleaningSelecting Quality DocumentsDeduplicationRemoving Personally Identifiable InformationTraining Set DecontaminationData MixturesEffect of Pre-Training Data on Downstream TasksBias and Fairness Issues in Pre-Training DatasetsSummary
VocabularyTokenizersTokenization PipelineNormalizationPre-TokenizationTokenizationByte Pair EncodingWordPieceSpecial TokensSummary
PreliminariesRepresenting MeaningThe Transformer ArchitectureSelf-AttentionPositional EncodingFeedforward NetworksLayer NormalizationLoss FunctionsIntrinsic Model EvaluationTransformer BackbonesEncoder-Only ArchitecturesEncoder-Decoder ArchitecturesDecoder-Only ArchitecturesMixture of ExpertsLearning ObjectivesFull Language ModelingPrefix Language ModelingMasked Language ModelingWhich Learning Objectives Are Better?Pre-Training ModelsSummary
Navigating the LLM LandscapeWho Are the LLM providers?Model FlavorsOpen Source LLMsHow to Choose an LLM for Your TaskOpen Source Versus Proprietary LLMsLLM EvaluationLoading LLMsHugging Face AccelerateOllamaLLM Inference APIsDecoding StrategiesGreedy DecodingBeam SearchTop-k SamplingTop-p SamplingRunning Inference on LLMsStructured OutputsModel Debugging and InterpretabilitySummary
The Need for Fine-TuningFine-Tuning: A Full ExampleLearning Algorithms ParametersMemory Optimization ParametersRegularization ParametersBatch SizeParameter-Efficient Fine-TuningWorking with Reduced PrecisionPutting It All TogetherFine-Tuning DatasetsUtilizing Publicly Available Instruction-Tuning DatasetsLLM-Generated Instruction-Tuning DatasetsSummary
Continual Pre-TrainingReplay (Memory)Parameter ExpansionParameter-Efficient Fine-TuningAdding New ParametersSubset MethodsCombining Multiple ModelsModel EnsemblingModel FusionAdapter MergingSummary

Defining Alignment TrainingReinforcement LearningTypes of Human FeedbackRLHF ExampleHallucinationsMitigating HallucinationsSelf-ConsistencyChain-of-ActionsRecitationSampling Methods for Addressing HallucinationDecoding by Contrasting LayersIn-Context HallucinationsHallucinations Due to Irrelevant InformationReasoningDeductive ReasoningInductive ReasoningAbductive ReasoningCommon Sense ReasoningInducing Reasoning in LLMsVerifiers for Improving ReasoningInference-Time ComputationFine-Tuning for ReasoningSummary
LLM Inference ChallengesInference Optimization TechniquesTechniques for Reducing ComputeK-V CachingEarly ExitKnowledge DistillationTechniques for Accelerating DecodingSpeculative DecodingParallel DecodingTechniques for Reducing Storage NeedsSymmetric QuantizationAsymmetric QuantizationSummary
LLM Interaction ParadigmsPassive ApproachThe Explicit ApproachThe Autonomous ApproachDefining AgentsAgentic WorkflowComponents of an Agentic SystemModelsToolsData StoresAgent Loop PromptGuardrails and VerifiersAgent Orchestration SoftwareSummary
Introduction to EmbeddingsSemantic SearchSimilarity MeasuresFine-Tuning Embedding ModelsBase ModelsTraining DatasetLoss FunctionsInstruction EmbeddingsOptimizing Embedding SizeMatryoshka EmbeddingsBinary and Integer EmbeddingsProduct QuantizationChunkingSliding Window ChunkingMetadata-Aware ChunkingLayout-Aware ChunkingSemantic ChunkingLate ChunkingVector DatabasesInterpreting EmbeddingsSummary
The Need for RAGTypical RAG ScenariosDeciding When to RetrieveThe RAG PipelineRewriteRetrieveRerankRefineInsertGenerateRAG for Memory ManagementRAG for Selecting In-Context Training ExamplesRAG for Model TrainingLimitations of RAGRAG Versus Long ContextRAG Versus Fine-TuningSummary
Multi-LLM ArchitecturesLLM CascadesRoutersTask-Specialized LLMsProgramming ParadigmsDSPyLMQLSummary

Content preview from Designing Large Language Model Applications

Chapter 13. Design Patterns and System Architecture

Throughout this book, we have explored a variety of techniques to adapt LLMs to solve our tasks, including in-context learning, fine-tuning, RAG, and tool use. While these techniques can potentially be successful in satisfying the performance requirements of your use case, deploying an LLM-based application in production requires adherence to a variety of other criteria like cost, latency, and reliability. To achieve these goals, an LLM application needs a lot of software scaffolding and specialized components.

To this end, in this chapter we will discuss various techniques to compose a production-level LLM system that can power useful applications. We will explore how to leverage multi-LLM architectures to balance cost and performance. Finally, we will look into software frameworks like DSPy that integrate LLM application development into the conventional software programming paradigm.

Treating an LLM-based application as just a standalone LLM component is inadequate if we intend to deploy it as a production-grade system. We need to treat it as a system, made up of several software and model components that support the LLM and make it reliable, fast, and cost-effective. The way these components are composed and connected is referred to as the system architecture.

Let’s begin by discussing a specific type: multi-LLM architectures that leverage multiple LLMs to solve your task.

Multi-LLM Architectures

Throughout this book, we ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 9781098150495Errata Page Supplemental Content

Designing Large Language Model Applications

by Suhas Pai

Chapter 13. Design Patterns and System Architecture

Multi-LLM Architectures

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Hands-On Large Language Models

Designing Data-Intensive Applications

Build a Large Language Model (From Scratch)

Build a Large Language Model (From Scratch)

Publisher Resources

Chapter 13. Design Patterns and System Architecture

Multi-LLM Architectures

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Hands-On Large Language Models

Designing Data-Intensive Applications

Build a Large Language Model (From Scratch)

Build a Large Language Model (From Scratch)

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.