book

AI Systems Performance Engineering

Name: AI Systems Performance Engineering
Author: Chris Fregly
ISBN: 9798341627789

by Chris Fregly

November 2025

Intermediate to advanced

1062 pages

34h 20m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Quizzes

Preface
Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction and AI System Overview
The AI Systems Performance EngineerBenchmarking and ProfilingScaling Distributed Training and InferenceManaging Resources EfficientlyCross-Team CollaborationTransparency and ReproducibilityDeepSeek Scales to ~680-Billion Parameter Models Despite US Export Hardware Restrictions in ChinaToward 100-Trillion-Parameter ModelsNVIDIA’s “AI Supercomputer in a Rack”Mechanical Sympathy: Hardware-Software CodesignMeasuring “Goodput” Useful ThroughputBook Roadmap and MethodologyKey TakeawaysConclusion
2. AI System Hardware Overview
The CPU and GPU SuperchipNVIDIA Grace CPUNVIDIA Blackwell “Dual-Die” GPUNVIDIA GPU Tensor Cores and Transformer EngineStreaming Multiprocessor, Threads, and WarpsUltrascale Networking Treating Many GPUs as OneNVLink and NVSwitchMulti-GPU ProgrammingIn-Network Aggregations with NVIDIA SHARPMultirack and Storage CommunicationPreintegrated Rack ApplianceCo-Packaged Optics: Future of Networking HardwareCompute Density and Power RequirementsLiquid Cooling Versus Air CoolingPerformance Monitoring and Utilization in PracticeSharing and SchedulingROI of Upgrading Your HardwareA Glimpse into the Future: NVIDIA’s RoadmapBlackwell Ultra and Grace Blackwell UltraVera Rubin Superchip (2026)Rubin Ultra and Vera Rubin Ultra (2027)Feynman GPU (2028) and Doubling Something Every YearKey TakeawaysConclusion
3. OS, Docker, and Kubernetes Tuning for GPU-Based Environments
Operating SystemNVIDIA Software StackGPU DriverCUDA Toolkit and RuntimeCUDA Forward and Backward Compatibility Across GPU Hardware GenerationsC++ and Python CUDA LibrariesPyTorch and Higher-Level AI FrameworksConfiguring the CPUs and OS for GPU EnvironmentsNUMA Awareness and CPU PinningNUMA-Friendly Memory Allocation and Memory PinningTransparent HugepagesScheduler and Interrupt AffinityVirtual Memory and SwappingFilesystem Caching and Write-BackCPU Frequency and C-statesTune Host CPU Memory AllocatorGPU Driver and Runtime Settings for PerformanceGPU Persistence ModeMPSMIGGPU Clock Speeds and ECCGPU Memory Oversubscription, Fragmentation, and Out-of-Memory HandlingContainer Runtime Optimizations for GPUsNVIDIA Container Toolkit and CUDA CompatibilityNVIDIA Container RuntimeAvoiding Container Overlay Filesystem OverheadReduce Image Size for Faster Container StartupKubernetes for Topology-Aware Container Orchestration and NetworkingOrchestrating Containers with Kubernetes Topology ManagerJob Scheduling with Kubernetes and SLURMSlicing a GPU with MIGOptimizing Network Communication for KubernetesReducing Kubernetes Orchestration JitterImproving Resource GuaranteesMemory Isolation and Avoiding the OOM KillerDealing with I/O IsolationKey TakeawaysConclusion
4. Tuning Distributed Networking Communication
Overlapping Communication and Computation (Pipelining)Asynchronous Execution with StreamsReducing Communication Frequency and VolumeAchieving Maximal Overlap in PracticeNVIDIA Magnum IO Optimization StackHigh-Speed, Low-Overhead Data Transfers with RDMATuning Multinode ConnectivityMultinode Communication PitfallsNCCL for Distributed Multi-GPU CommunicationTopology Awareness in NCCLNCCL Communication AlgorithmsDistributed Data Parallel StrategiesNCCL Communicator Lifecycle and Environment GotchasProfiling and Debugging NCCLIn-Network SHARP AggregationPersistent NCCL User Buffers and Zero-Copy RegistrationNVIDIA’s NIXL and Disaggregated InferenceSeparate Prefill and Decode Inference StagesIntelligent Interconnect Routing for KV Cache TransfersNIXL Asynchronous API with CallbacksKV Cache Offloading with NIXLNIXL and High-Performance Inference Systems Like NVIDIA DynamoNCCL Versus NIXLKey TakeawaysConclusion
5. GPU-Based Storage I/O Optimizations
Fast Storage and Data LocalitySequential Versus Random Read PatternsTuning NVMe and Filesystem for ThroughputUsing NVIDIA GDSCheckpointing GPU State with cuda-checkpointMeasuring GDS with gdsioDeepSeek’s Fire-Flyer File SystemDistributed, Parallel Filesystems and Object StoresTuning, Replicating, and Compressing DataMonitoring Storage I/OTuning the Data PipelineEfficient Data Loading and PreprocessingScaling Out Workers as You Scale Out Number of GPUsMultimodal Data Processing with NVIDIA DALICreating High-Quality LLM Datasets with NVIDIA NeMo CuratorContinuous Profiling and Tuning WorkflowDiagnosing Communication- Versus Compute-Bound WorkloadsKey TakeawaysConclusion
6. GPU Architecture, CUDA Programming, and Maximizing Occupancy
Understanding GPU ArchitectureThreads, Warps, Blocks, and GridsChoosing Threads-per-Block and Blocks-per-Grid SizesCUDA GPU Backward and Forward Compatibility ModelCUDA Programming RefresherConfiguring Launch Parameters: Blocks per Grid and Threads per Block2D and 3D Kernel InputsAsynchronous Memory Allocation and Memory PoolsUnderstanding GPU Memory HierarchyUnified MemoryMaintaining High Occupancy and GPU UtilizationTuning Occupancy with Launch BoundsDebugging Functional Correctness with NVIDIA Compute SanitizerRoofline Model: Compute-Bound or Memory-Bound WorkloadsKey TakeawaysConclusion
7. Profiling and Tuning GPU Memory Access Patterns
Coalesced Versus Uncoalesced Global Memory AccessVectorized Memory AccessTiling and Data Reuse Using Shared MemoryAvoid Shared-Memory Bank ConflictsWarp Shuffle Intrinsics: Avoid Shared Memory and Explicit SynchronizationRead-Only Data CachesAsynchronous Memory Prefetching and Tensor Memory AcceleratorKey TakeawaysConclusion
8. Occupancy Tuning, Warp Efficiency, and Instruction-Level Parallelism
Profiling and Diagnosing GPU BottlenecksNsight Systems Timeline ViewProfiling and Tuning the Data PipelineNsight Compute and Roofline AnalysisPyTorch Profiler and Visualization ToolsProfiler-Guided AnalysisAnalyzing Warp Stall Reasons with Nsight ComputeMemory-Related StallsExecution-Dependency StallsExecution Unit ContentionOther Stall ReasonsInspecting Achieved Occupancy and GPU UtilizationKernel Memory Throughput Versus Peak HBM Memory BandwidthKernel Compute Throughput Versus Peak GPU FLOPSIteratively Profiling and Determining the Kernel BottleneckOptimizing the KernelTuning OccupancyFind the Right Occupancy for Your WorkloadTechniques for Occupancy TuningCompiler Hints to Optimize OccupancyDetermine Optimal Launch Configuration with the Occupancy APITuning Occupancy with PyTorchImproving Warp Execution Efficiency (Warp Divergence)Causes of Warp DivergenceTechniques to Avoid Warp DivergenceProfiling and Detecting Warp DivergenceUsing Predication to Minimize DivergenceEfficient Intrawarp Communication with Warp IntrinsicsPyTorch Considerations for Warp-Level EfficiencyExposing Instruction-Level ParallelismWarp Scheduling and Dual Issue InstructionsILP and OccupancyLoop Unrolling, Interleaving, and Compiler HintingProfiling and Mitigating Register PressureKey TakeawaysConclusion
9. Increasing CUDA Kernel Efficiency and Arithmetic Intensity
Multilevel Microtiling and Software PrefetchingTiling with Thread Block ClustersKernel FusionStructured SparsityRecomputation Versus Memory Trade-OffPyTorch and Arithmetic IntensityMixed Precision and Utilizing Tensor CoresFeeding Tensor Cores with TMEM and TMATF32 and Automatic Mixed Precision (PyTorch)BF16/FP16, FP8, and FP4 Reduced PrecisionINT8 Reduced Precision and DP4A Instructions for InferenceTransformer Engine and TMEM in DepthUsing CUTLASS for Optimal Arithmetic Intensity and Tensor Core PerformanceInline PTX and SASS Tuning for MicrooptimizationsDeepSeek’s Use of Inline PTX for Memory Allocation OptimizationKey TakeawaysConclusion

10. Intra-Kernel Pipelining, Warp Specialization, and Cooperative Thread Block Clusters
Intra-Kernel Pipelining TechniquesCooperative Tiling and Double-Buffering with the CUDA Pipeline APIWarp Specialization and the Producer-Consumer ModelUsing CUDA Pipeline API for Warp SpecializationPyTorch, CUDA Pipeline API, and Warp SpecializationPersistent Kernels and MegakernelsCommon Workloads for Persistent KernelsMegakernels for InferencePersistent Kernels and Warp SpecializationCooperative GroupsCooperative Grid Synchronization and Persistent KernelsWhen to Combine Persistent Kernels and Cooperative GroupsThread Block Clusters and Distributed Shared MemoryThread Block SwizzlingDistributed Shared MemoryScratch MemoryLaunching a Thread Block ClusterCoordinating Thread Block Clusters with Cooperative Groups APIThread Block PairReducing Global Memory Traffic with Thread Block ClustersDesigning Efficient Algorithms with Thread Block ClustersWarp Specialization with Thread Block ClustersKey TakeawaysConclusion
11. Inter-Kernel Pipelining, Synchronization, and CUDA Stream-Ordered Memory Allocations
Overlapping Kernel Execution with CUDA StreamsUsing Streams to Overlap Compute with Data TransfersStream-Ordered Memory AllocatorUsing CUDA Streams and Stream-Ordered Memory Allocator with LLMsLegacy Default StreamModern Per-Thread Default StreamDefault Versus Explicit (Nondefault) StreamsBest Practices for Default Stream UsageFine-Grained Synchronization with Events and CallbacksUsing CUDA Events for Cross-Stream SynchronizationPipelining with Warp Specialization (Intra-Kernel) and CUDA Streams (Inter-Kernel)Warp Specialization with Thread Block Clusters and CUDA StreamsMulti-GPU Compute and Data Transfer Overlap with CUDA StreamsProgrammatic Dependent LaunchCombining PDL and Thread Block Clusters with Warp SpecializationKey TakeawaysConclusion
12. Dynamic Scheduling, CUDA Graphs, and Device-Initiated Kernel Orchestration
Dynamic Scheduling with Atomic Work QueuesAtomic CountersAtomic QueuesCUDA GraphsPyTorch, Inference Engines, and CUDA GraphsMemory Pools for CUDA GraphsCapturing a CUDA Graph with a CUDA StreamDynamic Graph UpdateDevice-Initiated CUDA Graph LaunchAtomic Queues and Device-Initiated CUDA Graphs for In-Kernel Persistent SchedulingConditional Graph NodesDynamic ParallelismOrchestrate Across Multiple GPUs and Cluster Nodes (NVSHMEM)Fine-Grained GPU-to-GPU Memory Sharing with NVSHMEMCapturing Multi-GPU Collectives with NCCL and CUDA GraphsPattern for N-GPU ScalingRoofline-Guided Scheduling and Orchestration DecisionsKey TakeawaysConclusion
13. Profiling, Tuning, and Scaling PyTorch
NVTX Markers and Profiling ToolsProfiling PyTorch to Identify BottlenecksUsing PyTorch ProfilerSystem Profiling with Nsight Systems and NVTX TimelinesKernel Roofline Analysis for General Matrix Multiply (GEMM)CPU and GPU Profiling with Linux perfPyTorch Compiler (torch.compile)Using the PyTorch CompilerCompiling Versus Writing Custom KernelsCompilation Modes and Trade-Offs in Speed, Memory, and Compile TimeRegional CompilationProfiling and Debugging Compiler Performance IssuesPyTorch Optimized Attention MechanismsPyTorch Architecture Optimization (torchao), Quantization, Sparsity, and PruningConcurrency with CUDA StreamsOverlapping Communication and ComputationStream Synchronization with EventsUsing CUDA Streams with MoE ModelsReducing Kernel Launch Overhead with CUDA GraphsCapturing a CUDA Graph and Preallocating MemoryReplaying the GraphBest Practices for CUDA GraphsCUDA Graph Trees (PyTorch Compiler Internal)Profiling and Tuning Memory in PyTorchTuning the CUDA Memory AllocatorActivation Checkpointing for Memory SavingsOffloading Parameters to CPU and NVMeSuperOffload: Optimized CPU-GPU Superchip OffloadFSDP Automatic Checkpointing and OffloadingCombining FSDP with Tensor Parallel and Pipeline ParallelPluggable Memory Allocators and Cross-GPU Data TransfersEnabling Peer-to-Peer DMA and UCXPyTorch Symmetric MemoryOptimizing the Data Input PipelineScaling with PyTorch DistributedDDP with torch.compileFSDP with torch.compileTensor and Pipeline Parallelism with torch.compileTorchTitan, AsyncTP, AutoParallel, and SimpleFSDPMulti-GPU Profiling with HTAContinuous Integration and Performance BenchmarkingPyTorch HUD Performance DashboardPerformance Benchmarks and MLPerf LoggingKey TakeawaysConclusion
14. PyTorch Compiler, OpenAI Triton, and XLA Backends
PyTorch Compiler Deep DiveTorchDynamo for Bytecode Capture and Graph ExtractionAOT Autograd Fusion for Forward and Backward PassesPrimTorch IR (Prims) Simplified Operator SetTorchInductor Backend Code GenerationAutotuning with TorchInductorDynamic Shapes and Variable Sequence LengthsDisabling the PyTorch Compiler and Reverting Back to Eager ModePerformance Hints and Debugging Generated CodeDebugging Numerical Correctness and AccuracyExplaining and Minimizing Graph BreaksGraph Breaks and TorchDynamo explain()Minimize Graph RecompilationsMark Functions and Code Blocks as Safe with allow_in_graphTips for Handling Graph BreaksDebugging Compiler Phases, Graph Breaks, and PerformanceWriting Custom Kernels with OpenAI TritonTriton Programming ModelAccessing Shared Memory in TritonRegistering Custom Kernels with PyTorchTuning Kernel-Launch ParametersAutotuning Triton KernelsAdvanced Triton Kernel ImplementationsWarp Specialization with TritonTiled and Persistent GEMM Kernel (Triton)Software Pipelining and Double Buffering with TritonProfiling with Triton Proton ProfilerPyTorch XLA BackendKey TakeawaysConclusion
15. Multinode Inference, Parallelism, Decoding, and Routing Optimizations
Disaggregated Prefill and Decode ArchitecturePrefill-Decode InterferenceScaling Prefill and Worker Nodes IndependentlyImpact on Latency (TTFT) and Throughput (TPOT)KV Cache Data Transfer and NIXLDeploying Disaggregated Prefill and Decode with KubernetesParallelism Strategies for Serving Massive MoE ModelsTensor ParallelismPipeline ParallelismExpert ParallelismData ParallelismContext (Sequence) ParallelismHybrid ParallelismSpeculative Decoding and Parallel Token Generation TechniquesTwo-Model, Draft-Based Speculative Decoding and EAGLESingle-Model Self-Speculative DecodingMultitoken Decoding with Medusa’s Multiple HeadsInterleaving Decode Steps from Multiple RequestsCombining Decoding Techniques and Evaluating ComplexityConstrained Decoding Performance ImplicationsDynamic Routing Strategies for MoE InferenceExpert Communication OptimizationLoad Balancing, Capacity Factor, and Expert ReplicationAdaptive Expert Routing and Real-Time MonitoringKey TakeawaysConclusion
16. Profiling, Debugging, and Tuning Inference at Scale
Profiling, Debugging, and Tuning Inference PerformanceMonitoring System Metrics and CountersProfiling with Nsight Systems and Nsight ComputeInference Troubleshooting RecipesFull-Stack Inference OptimizationsDebugging Correctness IssuesDynamic Batching, Scheduling, and RoutingDynamic BatchingContinuous BatchingContinuous SchedulingStall-Free Scheduling (Chunked Prefill)Latency-Aware Scheduling and Dynamic RoutingSystems-Level OptimizationsOverlapping Communication and ComputationMaximizing GPU Utilization and Throughput Versus Latency Trade-OffsPower and Thermal ConstraintsError HandlingMemoryKV Cache Offloading and Memory Pool AllocationQuantization Approaches for Real-Time InferenceReducing Precision from FP16 to FP8 and FP4Weight-Only Quantization (GPTQ, AWQ)Activation QuantizationPost-Training Quantization WorkflowCombining Weight and Activation QuantizationFusing Quantization-Dequantization Steps into the Execution GraphApplication-Level OptimizationsPrompt CompressionPrompt CleansingPrefix CachingModel Cascading and Tiered Model DeploymentStreaming ResponsesDebouncing and Request CoalescingToken Output Limits and TimeoutsKey TakeawaysConclusion
17. Scaling Disaggregated Prefill and Decode for Inference
Why Prefill-Decode Disaggregation?Advantages of DisaggregationDisaggregated Prefill and Decode Cluster PoolsDisaggregated Routing and Scheduling PoliciesScalability of Disaggregated Prefill and DecodeKey TakeawaysConclusion
18. Advanced Prefill-Decode and KV Cache Tuning
Optimized Decode KernelsFlashMLA (DeepSeek)ThunderMLA (Stanford)FlexDecoding (PyTorch)Tuning KV Cache Utilization and ManagementDisaggregated KV Cache PoolKV Cache Reuse and Prefix SharingOptimized KV Cache Memory LayoutGPU and CPU-GPU Superchip ImprovementsFast KV Cache Transfer Between Prefill and DecodeKV Cache SizeZero-Copy GPU-to-GPU TransferConnector and Data Path DesignHeterogeneous Hardware and Parallelism Strategies for Prefill and DecodeCompute-Optimized Versus Memory-Optimized HardwareHybrid Prefill with GPU-CPU CollaborationSLO-Aware Request Management and Fault ToleranceEarly Rejection (Admission Control)Quality of ServiceFault ToleranceDynamic Scheduling and Load BalancingAdaptive Resource Scheduling and Hotspot PreventionKey TakeawaysConclusion
19. Dynamic and Adaptive Inference Engine Optimizations
Adaptive Parallelism Strategies (TP Versus PP Versus Hybrid)Dynamic Precision ChangesKernel Autotuning for Transformer Self-Attention and MLP PathsDynamic Shared-Memory Allocation and Occupancy-Aware Kernel SelectionSpeculative KV Prefetching for Faster TTFTReal-Time KV Cache Compression and Policy SwitchingReinforcement Learning Agents for Tuning AI Systems at RuntimeDynamic Memory-Allocation Switching (Slab Versus Caching Versus Stream-Ordered)Runtime Kernel Performance Improvements and Hot-Swappable ImplementationsContinuous Prewarming of CUDA Graphs and Caches Using Time-Series PredictionAdaptive Batching and Chunked Prefill SchedulingCongestion-Aware and Topology-Aware Scheduling with Multiple GPUsNVLink/NVSwitch Topology and Bandwidth ConstraintsReal-Time Link Telemetry and MonitoringAdaptive Process-GPU MappingOptimizing Collective Communication with NCCLMultinode and Multirack Communication with GPUDirect RDMAMoE Expert Rebalancing and RegroupingDynamic Congestion-Aware SchedulingCoordinating NVSwitch Transfers with Fine-Tuned SchedulingAdditional Adaptive and Dynamic Optimization TechniquesDynamic Early-Exit NetworksInput-Aware Layer Skipping (DASH)Speculative MoE Expert Routing and Communication ReductionDynamic Token Pruning with LazyLLMEdge-Oriented MoE Memory BudgetingDynamic Quantization and Activation Range AdjustmentKey TakeawaysConclusion
20. AI-Assisted Performance Optimizations and Scaling Toward Multimillion GPU Clusters
AlphaTensor AI-Discovered Algorithms Boosting GPU Performance (Google DeepMind)Automated GPU Kernel Optimizations with DeepSeek-R1 (NVIDIA)Reinforcement Learning Approach to Generating Optimized GPU Kernels (Predibase)Self-Improving AI Agents (AI Futures Project)Smart Compilers and Automated Code OptimizationsAI-Assisted Real-Time System Optimizations and Cluster OperationsScaling Toward Multimillion GPU Clusters and 100-Trillion-Parameter ModelsKey TakeawaysConclusion
Appendix. AI Systems Performance Checklist (175+ Items)
Performance Tuning and Cost Optimization MindsetReproducibility and Documentation Best PracticesSystem Architecture and Hardware PlanningUnified CPU-GPU “Superchip” ArchitectureMulti-GPU Scaling and Interconnect OptimizationsOperating System and Driver OptimizationsGPU Resource Management and SchedulingI/O OptimizationData Processing PipelinesPerformance Profiling, Debugging, and MonitoringGPU Programming and CUDA Tuning OptimizationsKernel Scheduling and Execution OptimizationsArithmetic Optimizations and Reduced/Mixed PrecisionAdvanced Tuning Strategies and Algorithmic TricksDistributed Training and Network OptimizationEfficient Inference and ServingMultinode Inference and ServingPower and Thermal ManagementConclusion
Index
About the Author

Content preview from AI Systems Performance Engineering

Chapter 9. Increasing CUDA Kernel Efficiency and Arithmetic Intensity

Even if you fully hide latency with massive parallelism and high ILP, a kernel’s performance may still be limited by how much useful work it does per memory access. Arithmetic intensity, also called operational intensity, measures how many floating-point operations are performed per byte of data transferred from memory, or FLOPS per byte.

Newer GPU generations are advancing compute throughput well beyond memory bandwidth. This widening gap means that increasing arithmetic intensity is even more critical than ever. Higher arithmetic intensity indicates a kernel does more computation for each byte fetched, which is essential for fully utilizing the GPU’s computational capabilities.

Arithmetic intensity is a key metric in the Roofline performance model. The Roofline model is a useful visual tool that plots kernel performance (FLOPs/sec) against arithmetic intensity (FLOPs/byte). It shows hardware ceilings (roofs) for memory bandwidth and compute throughput, allowing us to see if a kernel is memory bound, performance limited by memory transfers, or compute bound, performance limited by ALU throughput.

In practice, you can generate roofline charts using tools like Nsight Compute, which includes a Roofline analysis view. Using these tools, you can verify if your kernel is initially memory bound or compute bound—then continue to profile and verify improvements as you make optimizations.

The goal is to push the kernel ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9798341627772Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

AI Systems Performance Engineering

by Chris Fregly

Chapter 9. Increasing CUDA Kernel Efficiency and Arithmetic Intensity

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.