February 2026
Intermediate to advanced
384 pages
12h 48m
English
RDMA over Converged Ethernet (RoCEv2) is the most popular transport protocol used to synchronize data chunks between application buffers on distributed GPU servers of an AI/ML cluster. It uses UDP, which does not require maintaining state on the GPU NIC, as the transport protocol. Because RoCEv2 does not engage the server CPU, it scales better than any other TCP-based data transport, such as Non-Volatile Memory Express over TCP (NVMe/TCP), a protocol that is often used for traditional storage data networks. However, despite offering better parallel session scalability, RoCEv2 has entropy characteristics that present networking challenges. The traffic originating from GPUs may cause either momentary ...
Read now
Unlock full access