Skip to Content
人工智能系统性能工程 (Chinese Edition)
book

人工智能系统性能工程 (Chinese Edition)

by Chris Fregly
November 2025
Intermediate to advanced
1060 pages
14h 20m
Chinese
O'Reilly Media, Inc.
Content preview from 人工智能系统性能工程 (Chinese Edition)

第20章. 人工智能辅助 性能优化与向千万级GPU集群的扩展

本作品已使用人工智能进行翻译。欢迎您提供反馈和意见:translation-feedback@oreilly.com

本章汇集了多项案例研究与未来趋势,展示人类与人工智能如何协同优化AI系统性能。具体而言,人工智能可协助微调底层GPU代码,生成比人工编写更高效的内核程序。

从更宏观的角度看,这些案例表明即使在矩阵乘法等核心运算中,算法创新也能带来媲美硬件升级的性能提升。以强化学习迭代过程中的奖励反馈工作流为例(如图20-1所示),该方法可帮助系统环境找到最优GPU内核代码。

这些人工智能辅助方法不仅能提升性能、缩短训练时间并降低运营成本,还能实现大型模型在小型系统上的高效部署,从而推动人工智能的未来发展。换言之,这是人工智能助力创造更优秀的人工智能——我们对此深表赞赏!

Diagram illustrating a reinforcement learning process that uses interleaved code execution to optimize GPU kernel code, involving a policy model, code sandbox, and a flow of advantage and reward feedback.
图20-1. 运用 强化学习为特定环境寻找最优GPU内核代码

AlphaTensor人工智能发现算法提升GPU性能(谷歌DeepMind)

并非所有AI优化都发生在代码层面。 有时优化会深入算法与数学领域。DeepMind 2022年推出的AlphaTensor项目便是突破性范例——该项目运用AI发现了新型通用矩阵乘法(GEMM)技术。

GEMM作为支撑几乎所有模型训练与推理任务的核心运算,其效率的微小提升都将对整个AI领域产生深远影响。AlphaTensor通过强化学习将快速算法的探索转化为单人游戏模式,海量可能性中进行搜索。

其惊人成果在于:发现的矩阵乘法公式超越了当时所有人类设计的算法。例如,它不仅重新发现了斯特拉森著名的2×2矩阵 次二次算法如图20-2所示),还针对更大矩阵尺寸进行了优化。

真正的验证来自实际硬件测试。AlphaTensor针对NVIDIA Volta V100 GPU架构发现的专属方法,其大规模矩阵乘法速度比当时标准的cuBLAS库快10%-20%。在GEMM性能上实现10%-20%的提升意义重大——这相当于每个模型的正向和反向推导过程中,额外获得10%-20%的免费计算资源。这相当于每个模型的正向和反向传播都获得了额外10%–20%的免费计算能力。

Diagram illustrating Strassen’s subquadratic algorithm for multiplying 2 × 2 matrices, showing matrix components involved in the computations.
图20-2. 斯特拉森用于 2×2矩阵乘法的次二次算法(来源:https://oreil.ly/5jzLn

通常这类提升需依赖新硬件世代——或耗时数月的低级CUDA调优。而此次AI在相对短时间内,通过数学方法找到了更优解。

我们从中获得的启示是:在人类工程师视为创新的基础算法和数学运算中,可能仍存在未被发掘的效率空间。人工智能能够筛选数以万计甚至数百万种算法变体,而人类在合理时间内根本无法尝试如此庞大的数量级。对性能工程师而言,AlphaTensor的成功表明算法创新远未终结。未来,人工智能或许能为卷积、排序或注意力等基础运算提供更高效的算法工具包。

此案例的投资回报虽间接却影响深远。将AlphaTensor的矩阵乘法算法集成至GPU库后,任何大规模训练任务或推理工作负载都将获得瞬时提速。这将从图形渲染、LLM性能到科学计算产生全面影响。AlphaTensor证明:在数百台GPU上进行数千次训练迭代器时,15%的速度提升可转化为巨大的时间与能源节约。每次运行代码都能获得回报。更关键的是,这种加速无需额外硬件支持——仅凭更智能的软件即可实现。 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

产品思维工程师 (Chinese Edition)

产品思维工程师 (Chinese Edition)

Drew Hoskins
Implementing the IBM Storwize V7000

Implementing the IBM Storwize V7000

Brian Cartwright, Ronda Hruby, Daniel Koeck, Xin Liu, Massimo Rosati, Thomas Vogel, Bill Wiegand, Jon Tate

Publisher Resources

ISBN: 0642572281557