book

LLMs in Production

Name: LLMs in Production
ISBN: 9781633437203

by Matthew Sharp, Christopher Brousseau

January 2025

Intermediate to advanced

456 pages

14h 39m

English

Manning Publications

Read now

Unlock full access

LLMs in Production
copyright
dedication
contents
foreword
preface
acknowledgments
about the book
Who should read this bookHow this book is organizedAbout the codeliveBook Discussion Forum
about the authors
about the cover illustration

1 Words’ awakening: Why large language models have captured attention
1.1 Large language models accelerating communication1.2 Navigating the build-and-buy decision with LLMs1.2.1 Buying: The beaten path1.2.2 Building: The path less traveled1.2.3 A word of warning: Embrace the future now1.3 Debunking mythsSummary
2 Large language models: A deep dive into language modeling
2.1 Language modeling2.1.1 Linguistic features2.1.2 Semiotics2.1.3 Multilingual NLP2.2 Language modeling techniques2.2.1 N-gram and corpus-based techniques2.2.2 Bayesian techniques2.2.3 Markov chains2.2.4 Continuous language modeling2.2.5 Embeddings2.2.6 Multilayer perceptrons2.2.7 Recurrent neural networks and long short-term memory networks2.2.8 Attention2.3 Attention is all you need2.3.1 Encoders2.3.2 Decoders2.3.3 Transformers2.4 Really big transformersSummary
3 Large language model operations: Building a platform for LLMs
3.1 Introduction to large language model operations3.2 Operations challenges with large language models3.2.1 Long download times3.2.2 Longer deploy times3.2.3 Latency3.2.4 Managing GPUs3.2.5 Peculiarities of text data3.2.6 Token limits create bottlenecks3.2.7 Hallucinations cause confusion3.2.8 Bias and ethical considerations3.2.9 Security concerns3.2.10 Controlling costs3.3 LLMOps essentials3.3.1 Compression3.3.2 Distributed computing3.4 LLM operations infrastructure3.4.1 Data infrastructure3.4.2 Experiment trackers3.4.3 Model registry3.4.4 Feature stores3.4.5 Vector databases3.4.6 Monitoring system3.4.7 GPU-enabled workstations3.4.8 Deployment serviceSummary
4 Data engineering for large language models: Setting up for success
4.1 Models are the foundation4.1.1 GPT4.1.2 BLOOM4.1.3 LLaMA4.1.4 Wizard4.1.5 Falcon4.1.6 Vicuna4.1.7 Dolly4.1.8 OpenChat4.2 Evaluating LLMs4.2.1 Metrics for evaluating text4.2.2 Industry benchmarks4.2.3 Responsible AI benchmarks4.2.4 Developing your own benchmark4.2.5 Evaluating code generators4.2.6 Evaluating model parameters4.3 Data for LLMs4.3.1 Datasets you should know4.3.2 Data cleaning and preparation4.4 Text processors4.4.1 Tokenization4.4.2 Embeddings4.5 Preparing a Slack datasetSummary
5 Training large language models: How to generate the generator
5.1 Multi-GPU environments5.1.1 Setting up5.1.2 Libraries5.2 Basic training techniques5.2.1 From scratch5.2.2 Transfer learning (finetuning)5.2.3 Prompting5.3 Advanced training techniques5.3.1 Prompt tuning5.3.2 Finetuning with knowledge distillation5.3.3 Reinforcement learning with human feedback5.3.4 Mixture of experts5.3.5 LoRA and PEFT5.4 Training tips and tricks5.4.1 Training data size notes5.4.2 Efficient training5.4.3 Local minima traps5.4.4 Hyperparameter tuning tips5.4.5 A note on operating systems5.4.6 Activation function adviceSummary
6 Large language model services: A practical guide
6.1 Creating an LLM service6.1.1 Model compilation6.1.2 LLM storage strategies6.1.3 Adaptive request batching6.1.4 Flow control6.1.5 Streaming responses6.1.6 Feature store6.1.7 Retrieval-augmented generation6.1.8 LLM service libraries6.2 Setting up infrastructure6.2.1 Provisioning clusters6.2.2 Autoscaling6.2.3 Rolling updates6.2.4 Inference graphs6.2.5 Monitoring6.3 Production challenges6.3.1 Model updates and retraining6.3.2 Load testing6.3.3 Troubleshooting poor latency6.3.4 Resource management6.3.5 Cost engineering6.3.6 Security6.4 Deploying to the edgeSummary
7 Prompt engineering: Becoming an LLM whisperer
7.1 Prompting your model7.1.1 Few-shot prompting7.1.2 One-shot prompting7.1.3 Zero-shot prompting7.2 Prompt engineering basics7.2.1 Anatomy of a prompt7.2.2 Prompting hyperparameters7.2.3 Scrounging the training data7.3 Prompt engineering tooling7.3.1 LangChain7.3.2 Guidance7.3.3 DSPy7.3.4 Other tooling is available but …7.4 Advanced prompt engineering techniques7.4.1 Giving LLMs tools7.4.2 ReActSummary
8 Large language model applications: Building an interactive experience
8.1 Building an application8.1.1 Streaming on the frontend8.1.2 Keeping a history8.1.3 Chatbot interaction features8.1.4 Token counting8.1.5 RAG applied8.2 Edge applications8.3 LLM agentsSummary
9 Creating an LLM project: Reimplementing Llama 3
9.1 Implementing Meta’s Llama9.1.1 Tokenization and configuration9.1.2 Dataset, data loading, evaluation, and generation9.1.3 Network architecture9.2 Simple Llama9.3 Making it better9.3.1 Quantization9.3.2 LoRA9.3.3 Fully sharded data parallel–quantized LoRA9.4 Deploy to a Hugging Face Hub SpaceSummary
10 Creating a coding copilot project: This would have helped you earlier
10.1 Our model10.2 Data is king10.2.1 Our VectorDB10.2.2 Our dataset10.2.3 Using RAG10.3 Build the VS Code extension10.4 Lessons learned and next stepsSummary
11 Deploying an LLM on a Raspberry Pi: How low can you go?
11.1 Setting up your Raspberry Pi11.1.1 Pi Imager11.1.2 Connecting to Pi11.1.3 Software installations and updates11.2 Preparing the model11.3 Serving the model11.4 Improvements11.4.1 Using a better interface11.4.2 Changing quantization11.4.3 Adding multimodality11.4.4 Serving the model on Google ColabSummary
12 Production, an ever-changing landscape: Things are just getting started
12.1 A thousand-foot view12.2 The future of LLMs12.2.1 Government and regulation12.2.2 LLMs are getting bigger12.2.3 Multimodal spaces12.2.4 Datasets12.2.5 Solving hallucination12.2.6 New hardware12.2.7 Agents will become useful12.3 Final thoughtsSummary
appendix A History of linguistics
A.1 Ancient linguisticsA.2 Medieval linguisticsA.3 Renaissance and early modern linguisticsA.4 Early 20th-century linguisticsA.5 Mid-20th century and modern linguistics
appendix B Reinforcement learning with human feedback
appendix C Multimodal latent spaces
index

Content preview from LLMs in Production

appendix C Multimodal latent spaces

We haven’t had a good opportunity yet to dig into multimodal latent spaces, but we wanted to correct that here. An example of a multimodal model includes Stable Diffusion, which will turn a text prompt into an image. Diffusion refers to the process of comparing embeddings within two different modalities, and that comparison must be learned. A useful simplification of this process would be imagining all of the text embeddings as a big cloud of points, similar to the embedding visualization we made in chapter 2 (section 2.3), but with billions of words represented. With that cloud, we can then make another cloud of embeddings in a different but related modality—images, for example.

We need to make sure ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781633437203Publisher Support Publisher Website

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

LLMs in Production

by Matthew Sharp, Christopher Brousseau

appendix C Multimodal latent spaces

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.