book

Generative AI in Action

by Amit Bahree

November 2024

Intermediate to advanced

464 pages

14h 38m

English

Manning Publications

Read now

Unlock full access

Who should read this bookHow this book is organized: A road mapAbout the codeliveBook discussion forum

1.1 What is this book about?1.2 What is generative AI?1.3 What can we generate?1.3.1 Entities extraction1.3.2 Generating text1.3.3 Generating images1.3.4 Generating code1.3.5 Ability to solve logic problems1.3.6 Generating music1.3.7 Generating videos1.4 Enterprise use cases1.5 When not to use generative AI1.6 How is generative AI different from traditional AI?1.7 What approach should enterprises take?1.8 Architecture considerations1.9 So your enterprise wants to use generative AI. Now what?Summary
2.1 Overview of foundational models2.2 Overview of LLMs2.3 Transformer architecture2.4 Training cutoff2.5 Types of LLMs2.6 Small language models2.7 Open source vs. commercial LLMs2.7.1 Commercial LLMs2.7.2 Open source LLMs2.8 Key concepts of LLMs2.8.1 Prompts2.8.2 Tokens2.8.3 Counting tokens2.8.4 Embeddings2.8.5 Model configuration2.8.6 Context window2.8.7 Prompt engineering2.8.8 Model adaptation2.8.9 Emergent behaviorSummary
3.1 Model categories3.1.1 Dependencies3.1.2 Listing models3.2 Completion API3.2.1 Expanding completions3.2.2 Azure content safety filter3.2.3 Multiple completions3.2.4 Controlling randomness3.2.5 Controlling randomness using top_p3.3 Advanced completion API options3.3.1 Streaming completions3.3.2 Influencing token probabilities: logit_bias3.3.3 Presence and frequency penalties3.3.4 Log probabilities3.4 Chat completion API3.4.1 System role3.4.2 Finish reason3.4.3 Chat completion API for nonchat scenarios3.4.4 Managing conversation3.4.5 Best practices for managing tokens3.4.6 Additional LLM providersSummary
4.1 Vision models4.1.1 Variational autoencoders4.1.2 Generative adversarial networks4.1.3 Vision transformer models4.1.4 Diffusion models4.1.5 Multimodal models4.2 Image generation with Stable Diffusion4.2.1 Dependencies4.2.2 Generating an image4.3 Image generation with other providers4.3.1 OpenAI DALLE 34.3.2 Bing image creator4.3.3 Adobe Firefly4.4 Editing and enhancing images using Stable Diffusion4.4.1 Generating using image-to-image API4.4.2 Using the masking API4.4.3 Resize using the upscale API4.4.4 Image generation tipsSummary
5.1 Code generation5.1.1 Can I trust the code?5.1.2 GitHub Copilot5.1.3 How Copilot works5.2 Additional code-related tasks5.2.1 Code explanation5.2.2 Generate tests5.2.3 Code referencing5.2.4 Code refactoring5.3 Other code generation tools5.3.1 Amazon CodeWhisperer5.3.2 Code Llama5.3.3 Tabnine5.3.4 Check yourself5.3.5 Best practices for code generation5.4 Video generation5.5 Audio and music generationSummary
6.1 What is prompt engineering?6.1.1 Why do we need prompt engineering?6.2 The basics of prompt engineering6.3 In-context learning and prompting6.4 Prompt engineering techniques6.4.1 System message6.4.2 Zero-shot, few-shot, and many-shot learning6.4.3 Use clear syntax6.4.4 Making in-context learning work6.4.5 Reasoning: Chain of Thought6.4.6 Self-consistency sampling6.5 Image prompting6.6 Prompt injection6.7 Prompt engineering challenges6.8 Best practicesSummary
7.1 What is RAG?7.2 RAG benefits7.3 RAG architecture7.4 Retriever system7.5 Understanding vector databases7.5.1 What is a vector index?7.5.2 Vector search7.6 RAG challenges7.7 Overcoming challenges for chunking7.7.1 Chunking strategies7.7.2 Factors affecting chunking strategies7.7.3 Handling unknown complexities7.7.4 Chunking sentences7.7.5 Chunking using natural language processing7.8 Chunking PDFsSummary
8.1 Advantages to enterprises using their data8.1.1 What about large context windows?8.1.2 Building a chat application using our data8.2 Using a vector database8.3 Planning for retrieving the information8.4 Retrieving the data8.4.1 Retriever pipeline best practices8.5 Search using Redis8.6 An end-to-end chat implementation powered by RAG8.7 Using Azure OpenAI on your data8.8 Benefits of bringing your data using RAGSummary
9.1 What is model adaptation?9.1.1 Basics of model adaptation9.1.2 Advantages and challenges for enterprises9.2 When to fine-tune an LLM9.2.1 Key stages of fine-tuning an LLM9.3 Fine-tuning OpenAI models9.3.1 Preparing a dataset for fine-tuning9.3.2 LLM evaluation9.3.3 Fine-tuning9.3.4 Fine-tuning training metrics9.3.5 Fine-tuning using Azure OpenAI9.4 Deployment of a fine-tuned model9.4.1 Inference: Fine-tuned model9.5 Training an LLM9.5.1 Pretraining9.5.2 Supervised fine-tuning9.5.3 Reward modeling9.5.4 Reinforcement learning9.5.5 Direct policy optimization9.6 Model adaptation techniques9.6.1 Low-rank adaptation9.7 RLHF overview9.7.1 Challenges with RLHF9.7.2 Scaling an RLHF implementationSummary
10.1 Generative AI: Application architecture10.1.1 Software 2.010.1.2 The era of copilots10.2 Generative AI: Application stack10.2.1 Integrating the GenAI stack10.2.2 GenAI architecture principles10.2.3 GenAI application architecture: A detailed view10.3 Orchestration layer10.3.1 Benefits of an orchestration framework10.3.2 Orchestration frameworks10.3.3 Managing operations10.3.4 Prompt management10.4 Grounding layer10.4.1 Data integration and preprocessing10.4.2 Embeddings and vector management10.5 Model layer10.5.1 Model ensemble architecture10.5.2 Model serving10.6 Response filteringSummary
11.1 Challenges for production deployments11.2 Deployment options11.3 Managed LLMs via API11.4 Best practices for production deployment11.4.1 Metrics for LLM inference11.4.2 Latency11.4.3 Scalability11.4.4 PAYGO11.4.5 Quotas and rate limits11.4.6 Managing quota11.4.7 Observability11.4.8 Security and compliance considerations11.5 GenAI operational considerations11.5.1 Reliability and performance considerations11.5.2 Managed identities11.5.3 Caching11.6 LLMOps and MLOps11.7 Checklist for production deploymentSummary
12.1 LLM evaluations12.2 Traditional evaluation metrics12.2.1 BLEU12.2.2 ROUGE12.2.3 BERTScore12.2.4 An example of traditional metric evaluation12.3 LLM task-specific benchmarks12.3.1 G-Eval: A measuring approach for NLG evaluation12.3.2 An example of LLM-based evaluation metrics12.3.3 HELM12.3.4 HEIM12.3.5 HellaSWAG12.3.6 Massive Multitask Language Understanding12.3.7 Using Azure AI Studio for evaluations12.3.8 DeepEval: An LLM evaluation framework12.4 New evaluation benchmarks12.4.1 SWE-bench12.4.2 MMMU12.4.3 MoCa12.4.4 HaluEval12.5 Human evaluationSummary
13.1 GenAI risks13.1.1 LLM limitations13.1.2 Hallucination13.2 Understanding GenAI attacks13.2.1 Prompt injection13.2.2 Insecure output handling example13.2.3 Model denial of service13.2.4 Data poisoning and backdoors13.2.5 Sensitive information disclosure13.2.6 Overreliance13.2.7 Model theft13.3 A responsible AI lifecycle13.3.1 Identifying harms13.3.2 Measure and evaluate harms13.3.3 Mitigate harms13.3.4 Transparency and explainability13.4 Red-teaming13.4.1 Red-teaming example13.4.2 Red-teaming tools and techniques13.5 Content safety13.5.1 Azure Content Safety13.5.2 Google Perspective API13.5.3 Evaluating content filtersSummary
B.1 Model cardB.2 Transparency notesB.3 HAX ToolkitB.4 Responsible AI ToolboxB.5 Learning Interpretability Tool (LIT)B.6 AI Fairness 360B.7 C2PA
Chapter 1Chapter 2Chapter 4Chapter 6Chapter 7Chapter 9Chapter 10Chapter 11Chapter 12Chapter 13

Content preview from Generative AI in Action

12 Evaluations and benchmarks

This chapter covers

Understanding the significance of benchmarking and evaluating LLMs
Learning different evaluation metrics
Benchmarking model performance
Implementing comprehensive evaluation strategies
Best practices for evaluation benchmarks and key evaluation criteria to consider

Taking into account the recent surge of interest in GenAI and specifically in large language models (LLMs), it’s crucial to approach these novel and uncertain features cautiously and responsibly. Many leaderboards and studies have shown that LLMs can match human performance in various tasks, such as taking standardized tests or creating art, sparking enthusiasm and attention. However, their novelty and uncertainties necessitate ...