AI Data Center Network Design and Technologies
by Mahesh Subramaniam, Michal Styszynski, Himanshu Tambakuwala
13
Scale-Up Systems
With all the technologies and diverse design options for backend training networks and storage networking for the domains covered so far in this book, you might think that AI/ML data center networking is already well defined with scale-out systems, and most problems can be solved using the current RoCEv2 or the emerging Ultra Ethernet Specification from the Ultra Ethernet Consortium (UEC). Today, however, computing power of xPU accelerators (such as GPUs or TPUs) sometimes increases faster than the capacity of Ethernet and InfiniBand, so newer, higher-capacity XPU-scale-up systems are being developed. In these systems, the xPU is closely connected to a purpose-built interconnect point, which helps memory schematics communicate ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access