February 2026
Intermediate to advanced
384 pages
12h 48m
English
Start your journey by examining the complete lifecycle of AI/ML workloads. This chapter explains how raw data is collected, labeled, and preprocessed before being routed into distributed training pipelines. You will understand the fundamentals of forward and backward propagation, gradient descent, and iterative optimization—and see how these processes scale across thousands of GPUs through data, pipeline, and tensor parallelism. The chapter introduces job completion time (JCT) and tail latency as key metrics and discusses how RDMA (in RoCEv2) facilitates the low-latency, high-throughput transfers needed for modern AI. The technical groundwork clarifies why lossless, high-radix, dynamically ...
Read now
Unlock full access