Skip to Content
人工智能系统性能工程 (Chinese Edition)
book

人工智能系统性能工程 (Chinese Edition)

by Chris Fregly
November 2025
Intermediate to advanced
1060 pages
14h 20m
Chinese
O'Reilly Media, Inc.
Content preview from 人工智能系统性能工程 (Chinese Edition)

第14章 PyTorch 编译器、OpenAI Triton与XLA后端

本作品已使用人工智能进行翻译。欢迎您提供反馈和意见:translation-feedback@oreilly.com

第13章中,我们探讨了多种优化和调整基于PyTorch的训练与推理工作负载的方法。我们简要介绍了PyTorch编译器如何通过自动化内核融合及其他内核级技术来提升性能,且几乎无需修改代码。

本章将深入解析动态 PyTorch 编译栈,涵盖 TorchDynamo、预编译自动微分(AOT Autograd)、PrimTorch 中间表示(IR)(亦称PrimsPrims IR)等组件,以及 TorchInductor、加速线性代数(XLA)和OpenAITriton生态系统等编译器后端。PyTorch编译器栈如图14-1所示.

我们还将介绍用于调试编译管道的工具,以及支持PyTorch在多GPU和多节点集群中扩展的库。随后将剖析torch.compile 的底层工作原理,并探讨如何高效处理动态形状和可变序列长度。

同时我们将考察PyTorch编译器与OpenAI Triton生态系统的集成方案。我们的目标是在保持PyTorch灵活的即时执行开发体验的同时,实现模型与应用的加速与扩展。

Diagram illustrating the components and flow of the PyTorch compiler stack, including stages like TorchDynamo, AOT Autograd, and integration with backends such as Inductor and Triton.
图14-1. PyTorch编译器架构概览

PyTorch编译器深度解析

第13章所述,PyTorch的torch.compile 可将代码(及模型)编译为实现显著加速。多数情况下仅需一行代码即可完成,如下所示。后续我们将详细探讨各项配置选项:

compiled_model = torch.compile(model, 
  mode="max-autotune",
#  ...
) 

本节将详细解析的PyTorch编译管道流程,涵盖TorchDynamo的图捕获、AOT Autograd的前向/后向图联合优化、PrimTorch中间表示(IR)以及TorchInductor的代码生成。该管道负责为目标GPU硬件生成优化内核,其流程如图14-2所示。

Diagram of the PyTorch compiler pipeline showing stages of graph acquisition, graph lowering, and graph compilation leading to optimized kernel generation for GPUs.
图14-2. PyTorch 编译器管道(来源:https://oreil.ly/55JDn

TorchDynamo实现字节码捕获与图结构提取

TorchDynamo(简称Dynamo, )是torch.compile 的第一阶段。它通过挂钩Python的帧评估机制,在字节码层面拦截模型执行。

Dynamo通过挂钩CPython的帧评估机制,识别生成张量的字节码区域并构建对应的执行图。随后使用选定的后端执行编译后的图,未支持的代码则保留为即时执行模式。

正是这种拦截与重写机制,使TorchDynamo能够将PyTorch操作序列捕获为图结构表示,以便后续步骤(即下文将介绍的AOT Autograd和PrimTorch中间表示)进行优化。

TorchDynamo利用 的CPython帧评估API(PEP 523)安全捕获操作,且开销极低。通常,Python解释器会逐个执行每个操作。但启用Dynamo后,解释器将执行任务重定向至Dynamo,后者会先将张量操作聚合为图结构再统一执行。这使得内核融合等全图优化成为可能,从而降低了单操作的Python和主机端开销。 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

产品思维工程师 (Chinese Edition)

产品思维工程师 (Chinese Edition)

Drew Hoskins

Publisher Resources

ISBN: 0642572281557