book

LLM Engineer's Handbook

by Paul Iusztin, Maxime Labonne

October 2024

Intermediate to advanced

522 pages

12h 55m

English

Packt Publishing

Read now

Unlock full access

ContributorsJoin our book’s Discord space
Who this book is forWhat this book coversTo get the most out of this bookGet in touchMaking the Most Out of This Book – Get to Know Your Free Benefits
Understanding the LLM Twin conceptWhat is an LLM Twin?Why building an LLM Twin mattersWhy not use ChatGPT (or another similar chatbot)?Planning the MVP of the LLM Twin productWhat is an MVP?Defining the LLM Twin MVPBuilding ML systems with feature/training/inference pipelinesThe problem with building ML systemsThe issue with previous solutionsThe solution – ML pipelines for ML systemsThe feature pipelineThe training pipelineThe inference pipelineBenefits of the FTI architectureDesigning the system architecture of the LLM TwinListing the technical details of the LLM Twin architectureHow to design the LLM Twin architecture using the FTI pipeline designData collection pipelineFeature pipelineTraining pipelineInference pipelineFinal thoughts on the FTI design and the LLM Twin architectureSummaryReferences
Python ecosystem and project installationPoetry: dependency and virtual environment managementPoe the Poet: task execution toolMLOps and LLMOps toolingHugging Face: model registryZenML: orchestrator, artifacts, and metadataOrchestratorArtifacts and metadataHow to run and configure a ZenML pipelineComet ML: experiment trackerOpik: prompt monitoringDatabases for storing unstructured and vector dataMongoDB: NoSQL databaseQdrant: vector databasePreparing for AWSSetting up an AWS account, an access key, and the CLISageMaker: training and inference computeWhy AWS SageMaker?SummaryReferencesJoin our book’s Discord space
Designing the LLM Twin’s data collection pipelineImplementing the LLM Twin’s data collection pipelineZenML pipeline and stepsThe dispatcher: How do you instantiate the right crawler?The crawlersBase classesGitHubCrawler classCustomArticleCrawler classMediumCrawler classThe NoSQL data warehouse documentsThe ORM and ODM software patternsImplementing the ODM classData categories and user document classesGathering raw data into the data warehouseTroubleshootingSelenium issuesImport our backed-up dataSummaryReferences
Understanding RAGWhy use RAG?HallucinationsOld informationThe vanilla RAG frameworkIngestion pipelineRetrieval pipelineGeneration pipelineWhat are embeddings?Why embeddings are so powerfulHow are embeddings created?Applications of embeddingsMore on vector DBsHow does a vector DB work?Algorithms for creating the vector indexDB operationsAn overview of advanced RAGPre-retrievalRetrievalPost-retrievalExploring the LLM Twin’s RAG feature pipeline architectureThe problem we are solvingThe feature storeWhere does the raw data come from?Designing the architecture of the RAG feature pipelineBatch pipelinesBatch versus streaming pipelinesCore stepsChange data capture: syncing the data warehouse and feature storeWhy is the data stored in two snapshots?OrchestrationImplementing the LLM Twin’s RAG feature pipelineSettingsZenML pipeline and stepsQuerying the data warehouseCleaning the documentsChunk and embed the cleaned documentsLoading the documents to the vector DBPydantic domain entitiesOVMThe dispatcher layerThe handlersThe cleaning handlersThe chunking handlersThe embedding handlersSummaryReferencesJoin our book’s Discord space
Creating an instruction datasetGeneral frameworkData quantityData curationRule-based filteringData deduplicationData decontaminationData quality evaluationData explorationData generationData augmentationCreating our own instruction datasetExploring SFT and its techniquesWhen to fine-tuneInstruction dataset formatsChat templatesParameter-efficient fine-tuning techniquesFull fine-tuningLoRAQLoRATraining parametersLearning rate and schedulerBatch sizeMaximum length and packingNumber of epochsOptimizersWeight decayGradient checkpointingFine-tuning in practiceSummaryReferences
Understanding preference datasetsPreference dataData quantityData generation and evaluationGenerating preferencesTips for data generationEvaluating preferencesCreating our own preference datasetPreference alignmentReinforcement Learning from Human FeedbackDirect Preference OptimizationImplementing DPOSummaryReferencesJoin our book’s Discord space
Model evaluationComparing ML and LLM evaluationGeneral-purpose LLM evaluationsDomain-specific LLM evaluationsTask-specific LLM evaluationsRAG evaluationRagasARESEvaluating TwinLlama-3.1-8BGenerating answersEvaluating answersAnalyzing resultsSummaryReferences
Model optimization strategiesKV cacheContinuous batchingSpeculative decodingOptimized attention mechanismsModel parallelismData parallelismPipeline parallelismTensor parallelismCombining approachesModel quantizationIntroduction to quantizationQuantization with GGUF and llama.cppQuantization with GPTQ and EXL2Other quantization techniquesSummaryReferencesJoin our book’s Discord space

Understanding the LLM Twin’s RAG inference pipelineExploring the LLM Twin’s advanced RAG techniquesAdvanced RAG pre-retrieval optimizations: query expansion and self-queryingQuery expansionSelf-queryingAdvanced RAG retrieval optimization: filtered vector searchAdvanced RAG post-retrieval optimization: rerankingImplementing the LLM Twin’s RAG inference pipelineImplementing the retrieval moduleBringing everything together into the RAG inference pipelineSummaryReferences
Criteria for choosing deployment typesThroughput and latencyDataUnderstanding inference deployment typesOnline real-time inferenceAsynchronous inferenceOffline batch transformMonolithic versus microservices architecture in model servingMonolithic architectureMicroservices architectureChoosing between monolithic and microservices architecturesExploring the LLM Twin’s inference pipeline deployment strategyThe training versus the inference pipelineDeploying the LLM Twin serviceImplementing the LLM microservice using AWS SageMakerWhat are Hugging Face’s DLCs?Configuring SageMaker rolesDeploying the LLM Twin model to AWS SageMakerCalling the AWS SageMaker Inference endpointBuilding the business microservice using FastAPIAutoscaling capabilities to handle spikes in usageRegistering a scalable targetCreating a scalable policyMinimum and maximum scaling limitsCooldown periodSummaryReferencesJoin our book’s Discord space
The path to LLMOps: Understanding its roots in DevOps and MLOpsDevOpsThe DevOps lifecycleThe core DevOps conceptsMLOpsMLOps core componentsMLOps principlesML vs. MLOps engineeringLLMOpsHuman feedbackGuardrailsPrompt monitoringDeploying the LLM Twin’s pipelines to the cloudUnderstanding the infrastructureSetting up MongoDBSetting up QdrantSetting up the ZenML cloudContainerize the code using DockerRun the pipelines on AWSTroubleshooting the ResourceLimitExceeded error after running a ZenML pipeline on SageMakerAdding LLMOps to the LLM TwinLLM Twin’s CI/CD pipeline flowMore on formatting errorsMore on linting errorsQuick overview of GitHub ActionsThe CI pipelineGitHub Actions CI YAML fileThe CD pipelineTest out the CI/CD pipelineThe CT pipelineInitial triggersTrigger downstream pipelinesPrompt monitoringAlertingSummaryReferences
1. Automation or operationalization2. Versioning3. Experiment tracking4. TestingTest typesWhat do we test?Test examples5. MonitoringLogsMetricsSystem metricsModel metricsDriftsMonitoring vs. observabilityAlerts6. Reproducibility

Content preview from LLM Engineer's Handbook

8 Inference Optimization

Deploying LLMs is challenging due to their significant computational and memory requirements. Efficiently running these models necessitates the use of specialized accelerators, such as GPUs or TPUs, which can parallelize operations and achieve higher throughput. While some tasks, like document generation, can be processed in batches overnight, others require low latency and fast generation, such as code completion. As a result, optimizing the inference process – how these models make predictions based on input data – is critical for many practical applications. This includes reducing the time it takes to generate the first token (latency), increasing the number of tokens generated per second (throughput), and minimizing ...