Skip to Content
Ray 分布式机器学习:利用Ray 进行大模型的数据处理、训练、推理和部署
book

Ray 分布式机器学习:利用Ray 进行大模型的数据处理、训练、推理和部署

by Max Pumperla, Edward Oakes, Richard Liaw
May 2024
Intermediate
252 pages
5h 31m
Chinese
China Machine Press
Content preview from Ray 分布式机器学习:利用Ray 进行大模型的数据处理、训练、推理和部署
218
|
10
适用于主节点故障。并且,存储集群元数据的全局控制服务(
Global Control
Service
GCS
10
的崩溃将导致集群中的所有作业被终止。
1
涉及有状态计算的作业主要依赖基于检查点的容错能力。
Tune
会从其故障配置
的最后一个检查点重新启动分布式试验。通过配置检查点间隔,
Tune
可以有效
地在由“检查点实例”组成的集群上运行试验。此外,如果整个集群发生故障,
还可以从整个实验的检查点恢复整个
Tune
实验。
复合任务继承了无状态任务和有状态任务的容错策略,保留了两者的优点。这
意味着谱系重建适用于任务的无状态部分,应用程序级别的检查点仍然适用于
整体计算。
10.3.4
自动扩展
AIR
任务
AIR
库可以在第
9
章介绍的自动扩展
Ray
集群上运行。对于无状态任务,如果
有排队的任务(或排队的数据集计算执行器),则
Ray
将自动扩展。对于有状态
任务,如果集群中有待定的放置组(即
Tune
试验)尚未调度,则
Ray
将自动扩
大规模。节点处于空闲状态时,
Ray
将自动缩小规模。当节点上没有资源使用,
且内存中没有
Ray
对象或磁盘上没有
Ray
对象时,节点被视为空闲节点。由于
大多数
AIR
库都会利用对象,因此如果节点上的对象对其他节点(例如,另一
个试验使用的数据集块)上的
worker
引用,则可能会保留节点。
需要注意的是,自动扩展可能会导致集群中数据不够均衡,因为较早启动的节
点在其生命周期内自然会运行更多的任务。可以考虑限制(例如,从一定的最
小集群规模开始)或禁用自动扩展,以优化数据密集型任务的效率。 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

通过可观测性确保数据与AI的可靠性

通过可观测性确保数据与AI的可靠性

Barr Moses, Michael Segner

Publisher Resources

ISBN: 9787111753384