Chapter 8. Scaling Beyond Data Parallelism: Model, Pipeline, Tensor, and Hybrid Parallelism
You have read about several concepts and techniques related to distributed training in the previous chapters of this book. Chapter 6 laid out the fundamentals of distributed model training and discussed the possible dimensions of scaling, while Chapter 7 provided practical knowledge to scale based on the data dimension.
As you learned in Chapter 3, a task can typically be parallelized in two ways: by applying the same set of instructions on different data (SIMD) or by decomposing the set of instructions such that different parts of the algorithm can be performed at the same time on different data (MIMD). Data parallel model training is akin to SIMD, whereas the other forms of parallelism that you will read about in this chapter are akin to MIMD.
Scaling model training using data parallel techniques is often considered “weak” because you are scaling only horizontally, using just one of many possible dimensions of scale (i.e., data). Your overall scalability is limited by the number of parallel workers you can have, the ability of each worker to fit your model in its available memory, and the maximum effective batch size you can have before scaling law fails (for your case), producing diminishing returns. For most scenarios, weak scaling might be sufficient. However, if the limitations are causing you problems, you will need to look beyond data parallelism and explore more advanced vertical ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access