book

LLMs in Production

by Christopher Brousseau, Matthew Sharp

January 2025

Intermediate to advanced

456 pages

14h 39m

English

Manning Publications

Read now

Unlock full access

Who should read this bookHow this book is organizedAbout the codeliveBook Discussion Forum

1.1 Large language models accelerating communication1.2 Navigating the build-and-buy decision with LLMs1.2.1 Buying: The beaten path1.2.2 Building: The path less traveled1.2.3 A word of warning: Embrace the future now1.3 Debunking mythsSummary
2.1 Language modeling2.1.1 Linguistic features2.1.2 Semiotics2.1.3 Multilingual NLP2.2 Language modeling techniques2.2.1 N-gram and corpus-based techniques2.2.2 Bayesian techniques2.2.3 Markov chains2.2.4 Continuous language modeling2.2.5 Embeddings2.2.6 Multilayer perceptrons2.2.7 Recurrent neural networks and long short-term memory networks2.2.8 Attention2.3 Attention is all you need2.3.1 Encoders2.3.2 Decoders2.3.3 Transformers2.4 Really big transformersSummary
3.1 Introduction to large language model operations3.2 Operations challenges with large language models3.2.1 Long download times3.2.2 Longer deploy times3.2.3 Latency3.2.4 Managing GPUs3.2.5 Peculiarities of text data3.2.6 Token limits create bottlenecks3.2.7 Hallucinations cause confusion3.2.8 Bias and ethical considerations3.2.9 Security concerns3.2.10 Controlling costs3.3 LLMOps essentials3.3.1 Compression3.3.2 Distributed computing3.4 LLM operations infrastructure3.4.1 Data infrastructure3.4.2 Experiment trackers3.4.3 Model registry3.4.4 Feature stores3.4.5 Vector databases3.4.6 Monitoring system3.4.7 GPU-enabled workstations3.4.8 Deployment serviceSummary
4.1 Models are the foundation4.1.1 GPT4.1.2 BLOOM4.1.3 LLaMA4.1.4 Wizard4.1.5 Falcon4.1.6 Vicuna4.1.7 Dolly4.1.8 OpenChat4.2 Evaluating LLMs4.2.1 Metrics for evaluating text4.2.2 Industry benchmarks4.2.3 Responsible AI benchmarks4.2.4 Developing your own benchmark4.2.5 Evaluating code generators4.2.6 Evaluating model parameters4.3 Data for LLMs4.3.1 Datasets you should know4.3.2 Data cleaning and preparation4.4 Text processors4.4.1 Tokenization4.4.2 Embeddings4.5 Preparing a Slack datasetSummary
5.1 Multi-GPU environments5.1.1 Setting up5.1.2 Libraries5.2 Basic training techniques5.2.1 From scratch5.2.2 Transfer learning (finetuning)5.2.3 Prompting5.3 Advanced training techniques5.3.1 Prompt tuning5.3.2 Finetuning with knowledge distillation5.3.3 Reinforcement learning with human feedback5.3.4 Mixture of experts5.3.5 LoRA and PEFT5.4 Training tips and tricks5.4.1 Training data size notes5.4.2 Efficient training5.4.3 Local minima traps5.4.4 Hyperparameter tuning tips5.4.5 A note on operating systems5.4.6 Activation function adviceSummary
6.1 Creating an LLM service6.1.1 Model compilation6.1.2 LLM storage strategies6.1.3 Adaptive request batching6.1.4 Flow control6.1.5 Streaming responses6.1.6 Feature store6.1.7 Retrieval-augmented generation6.1.8 LLM service libraries6.2 Setting up infrastructure6.2.1 Provisioning clusters6.2.2 Autoscaling6.2.3 Rolling updates6.2.4 Inference graphs6.2.5 Monitoring6.3 Production challenges6.3.1 Model updates and retraining6.3.2 Load testing6.3.3 Troubleshooting poor latency6.3.4 Resource management6.3.5 Cost engineering6.3.6 Security6.4 Deploying to the edgeSummary
7.1 Prompting your model7.1.1 Few-shot prompting7.1.2 One-shot prompting7.1.3 Zero-shot prompting7.2 Prompt engineering basics7.2.1 Anatomy of a prompt7.2.2 Prompting hyperparameters7.2.3 Scrounging the training data7.3 Prompt engineering tooling7.3.1 LangChain7.3.2 Guidance7.3.3 DSPy7.3.4 Other tooling is available but …7.4 Advanced prompt engineering techniques7.4.1 Giving LLMs tools7.4.2 ReActSummary
8.1 Building an application8.1.1 Streaming on the frontend8.1.2 Keeping a history8.1.3 Chatbot interaction features8.1.4 Token counting8.1.5 RAG applied8.2 Edge applications8.3 LLM agentsSummary
9.1 Implementing Meta’s Llama9.1.1 Tokenization and configuration9.1.2 Dataset, data loading, evaluation, and generation9.1.3 Network architecture9.2 Simple Llama9.3 Making it better9.3.1 Quantization9.3.2 LoRA9.3.3 Fully sharded data parallel–quantized LoRA9.4 Deploy to a Hugging Face Hub SpaceSummary
10.1 Our model10.2 Data is king10.2.1 Our VectorDB10.2.2 Our dataset10.2.3 Using RAG10.3 Build the VS Code extension10.4 Lessons learned and next stepsSummary
11.1 Setting up your Raspberry Pi11.1.1 Pi Imager11.1.2 Connecting to Pi11.1.3 Software installations and updates11.2 Preparing the model11.3 Serving the model11.4 Improvements11.4.1 Using a better interface11.4.2 Changing quantization11.4.3 Adding multimodality11.4.4 Serving the model on Google ColabSummary
12.1 A thousand-foot view12.2 The future of LLMs12.2.1 Government and regulation12.2.2 LLMs are getting bigger12.2.3 Multimodal spaces12.2.4 Datasets12.2.5 Solving hallucination12.2.6 New hardware12.2.7 Agents will become useful12.3 Final thoughtsSummary
A.1 Ancient linguisticsA.2 Medieval linguisticsA.3 Renaissance and early modern linguisticsA.4 Early 20th-century linguisticsA.5 Mid-20th century and modern linguistics

Overview

Learn how to put Large Language Model-based applications into production safely and efficiently.

This practical book offers clear, example-rich explanations of how LLMs work, how you can interact with them, and how to integrate LLMs into your own applications. Find out what makes LLMs so different from traditional software and ML, discover best practices for working with them out of the lab, and dodge common pitfalls with experienced advice.

In LLMs in Production you will:

Grasp the fundamentals of LLMs and the technology behind them
Evaluate when to use a premade LLM and when to build your own
Efficiently scale up an ML platform to handle the needs of LLMs
Train LLM foundation models and finetune an existing LLM
Deploy LLMs to the cloud and edge devices using complex architectures like PEFT and LoRA
Build applications leveraging the strengths of LLMs while mitigating their weaknesses

LLMs in Production delivers vital insights into delivering MLOps so you can easily and seamlessly guide one to production usage. Inside, you’ll find practical insights into everything from acquiring an LLM-suitable training dataset, building a platform, and compensating for their immense size. Plus, tips and tricks for prompt engineering, retraining and load testing, handling costs, and ensuring security.

About the Technology
Most business software is developed and improved iteratively, and can change significantly even after deployment. By contrast, because LLMs are expensive to create and difficult to modify, they require meticulous upfront planning, exacting data standards, and carefully-executed technical implementation. Integrating LLMs into production products impacts every aspect of your operations plan, including the application lifecycle, data pipeline, compute cost, security, and more. Get it wrong, and you may have a costly failure on your hands.

About the Book
LLMs in Production teaches you how to develop an LLMOps plan that can take an AI app smoothly from design to delivery. You’ll learn techniques for preparing an LLM dataset, cost-efficient training hacks like LORA and RLHF, and industry benchmarks for model evaluation. Along the way, you’ll put your new skills to use in three exciting example projects: creating and training a custom LLM, building a VSCode AI coding extension, and deploying a small model to a Raspberry Pi.

What's Inside

Balancing cost and performance
Retraining and load testing
Optimizing models for commodity hardware
Deploying on a Kubernetes cluster

About the Reader
For data scientists and ML engineers who know Python and the basics of cloud deployment.

About the Authors
Christopher Brousseau and Matt Sharp are experienced engineers who have led numerous successful large scale LLM deployments.

Quotes
Covers all the essential aspects of how to build and deploy LLMs. It goes into the deep and fascinating areas that most other books gloss over.
- Andrew Carr, Cartwheel

A must-read for anyone looking to harness the potential of LLMs in production environments.
- Jepson Taylor, VEOX Inc.

An exceptional guide that simplifies the building and deployment of complex LLMs.
- Arunkumar Gopalan, Microsoft UK

A thorough and practical guide for running LLMs in production.
- Dinesh Chitlangia, AMD

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 9781633437203Publisher Support Other Publisher Website Supplemental Content Purchase Link

LLMs in Production

by Christopher Brousseau, Matthew Sharp

Overview

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Building LLMs for Production

Designing Data-Intensive Applications

Generative AI with LangChain

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Publisher Resources

Overview

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Building LLMs for Production

Designing Data-Intensive Applications

Generative AI with LangChain

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.