Skip to Content
人工智能系统性能工程 (Chinese Edition)
book

人工智能系统性能工程 (Chinese Edition)

by Chris Fregly
November 2025
Intermediate to advanced
1060 pages
14h 20m
Chinese
O'Reilly Media, Inc.
Content preview from 人工智能系统性能工程 (Chinese Edition)

第3章 GPU环境下的操作系统 、Docker与Kubernetes调优

本作品已使用人工智能进行翻译。欢迎您提供反馈和意见:translation-feedback@oreilly.com

即使拥有高度优化的GPU代码和库,系统级瓶颈仍可能限制大规模AI训练的性能。再快的GPU也取决于为其提供数据和指令的环境。本章将探讨如何调整操作系统和容器运行时,以充分发挥GPU的潜力。

首先我们将剖析基础GPU软件栈架构,随后深入探讨关键CPU与内存优化技术(如NUMA亲和性与巨页内存),确保数据从存储设备经CPU高效传输至GPU。同时将解析关键GPU驱动设置(持久化模式、多进程服务MPS及多实例GPU隔离技术MIG),通过降低开销与资源协同机制实现GPU利用率最大化。

借助NVIDIA容器工具包、容器运行时、Kubernetes拓扑管理器及Kubernetes GPU操作员等解决方案,您可构建统一且高度优化的GPU环境软件堆栈。这些方案支持在单节点与多节点GPU环境中实现高效资源分配和工作负载调度,确保GPU能力得到充分释放。

在此过程中,您将逐步理解这些优化的核心价值:它们能有效降低延迟、提升吞吐量,确保GPU持续接收数据并保持峰值性能。最终构建出稳健且具备可扩展性的系统,为训练与推理任务带来显著性能提升和高有效吞吐率。

操作系统

操作系统(OS)是所有组件运行的基础平台。GPU服务器通常搭载Linux发行版(如Ubuntu Server LTS或Red Hat),并配备支持最新GPU硬件的更新内核。NVIDIA驱动程序会安装内核模块,生成/dev/nvidia0/dev/nvidia1/dev/nvidia2等设备文件(每块GPU对应一个)。驱动程序还创建用于控制操作的/dev/nvidiactl 、统一虚拟内存的/dev/nvidia-uvm ,以及模式设置与缓冲区管理的/dev/nvidia-modeset

操作系统负责管理CPU调度、内存、网络和存储——所有这些都应针对高GPU吞吐量进行优化。因此,操作系统配置应避免干扰GPU任务。例如,GPU节点应禁用交换功能或将vm.swappiness 设置为0,以避免操作系统发起的内存交换干扰GPU工作负载。作为性能工程师,我们的工作之一就是调整这些操作系统设置,使GPU达到最佳性能。

专注于GPU的服务器可能需要运行额外的守护进程或后台进程,例如NVIDIA Persistence Daemon,以确保GPU驱动程序和硬件上下文始终处于加载就绪状态——即使没有GPU任务运行时也是如此。此外,Fabric Manager负责管理GPU互连拓扑结构,而NVIDIA数据中心GPU管理器(DCGM)则监控GPU系统健康指标。

NVIDIA软件堆栈

运行多拍FLOP级GPU集群 不仅涉及编写高级PyTorch、TensorFlow或JAX代码。支撑GPU运算的完整软件栈中,每一层都可能影响性能表现。图3-1展示了 用于开发和生产化现代LLM工作负载的通用框架、库、编译器、运行时及工具集,包括PyTorch、cuDNN、cuBLAS、CUTLASS、CUDA C++、nvcc 以及CUDA运行时API(如CUDA工具、驱动程序等)。

此外,NVIDIA GPU与CUDA生态系统 支持Python库,可通过OpenAI的Triton领域特定语言(DSL)、NVIDIA的Warp 框架以及CUDA Python、cuTile和CUTLASS库等工具在Python中创建CUDA内核。

图3-1. ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

产品思维工程师 (Chinese Edition)

产品思维工程师 (Chinese Edition)

Drew Hoskins

Publisher Resources

ISBN: 0642572281557