book

Distributed Machine Learning with Python

by Guanhua Wang

April 2022

Intermediate to advanced

284 pages

5h 53m

English

Packt Publishing

Read now

Unlock full access

ContributorsAbout the authorAbout the reviewers
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesDownload the color imagesConventions usedGet in touchShare Your Thoughts
Single-node training is too slowThe mismatch between data loading bandwidth and model training bandwidthSingle-node training time on popular datasetsAccelerating the training process with data parallelismData parallelism – the high-level bitsStochastic gradient descent Model synchronization Hyperparameter tuningGlobal batch sizeLearning rate adjustmentModel synchronization schemesSummary
Technical requirementsParameter server architectureCommunication bottleneck in the parameter server architectureSharding the model among parameter serversImplementing the parameter serverDefining model layersDefining the parameter serverDefining the workerPassing data between the parameter server and workerIssues with the parameter server The parameter server architecture introduces a high coding complexity for practitionersAll-Reduce architectureReduceAll-Reduce Ring All-ReduceCollective communication BroadcastGatherAll-GatherSummary
Technical requirements The data parallel training pipeline in a nutshellInput pre-processing Input data partitionData loadingTrainingModel synchronizationModel updateSingle-machine multi-GPUs and multi-machine multi-GPUsSingle-machine multi-GPUMulti-machine multi-GPUCheckpointing and fault toleranceModel checkpointingLoad model checkpointsModel evaluation and hyperparameter tuningModel serving in data parallelismSummary
Communication bottlenecks in data parallel trainingAnalyzing the communication workloadsParameter server architectureThe All-Reduce architectureThe inefficiency of state-of-the-art communication schemesLeveraging idle links and host resourcesTree All-ReduceHybrid data transfer over PCIe and NVLinkOn-device memory bottlenecksRecomputation and quantizationRecomputationQuantizationSummary
Technical requirementsSingle-node training error – out of memoryFine-tuning BERT on a single GPUTrying to pack a giant model inside one state-of-the-art GPUELMo, BERT, and GPTBasic conceptsRNNELMoBERTGPTPre-training and fine-tuningState-of-the-art hardwareP100, V100, and DGX-1NVLinkA100 and DGX-2NVSwitchSummary
Vanilla model parallelism is inefficientForward propagationBackward propagationGPU idle time between forward and backward propagationPipeline inputPros and cons of pipeline parallelismAdvantages of pipeline parallelismDisadvantages of pipeline parallelismLayer splitNotes on intra-layer model parallelismSummary

Technical requirementsWrapping up the whole model parallelism pipelineA model parallel training overviewImplementing a model parallel training pipelineSpecifying communication protocol among GPUsModel parallel servingFine-tuning transformersHyperparameter tuning in model parallelismBalancing the workload among GPUsEnabling/disabling pipeline parallelismNLP model servingSummary
Technical requirementsFreezing layersFreezing layers during forward propagationReducing computation cost during forward propagationFreezing layers during backward propagationExploring memory and storage resourcesUnderstanding model decomposition and distillationModel decompositionModel distillationReducing bits in hardwareSummary
Technical requirementsCase study of Megatron-LMLayer split for model parallelismRow-wise trial-and-error approachColumn-wise trial-and-error approachCross-machine for data parallelismImplementation of Megatron-LMCase study of Mesh-TensorFlowImplementation of Mesh-TensorFlowPros and cons of Megatron-LM and Mesh-TensorFlowSummary
Technical requirementsSharing knowledge without sharing dataRecapping the traditional data parallel model training paradigmNo input sharing among workersCommunicating gradients for collaborative learningCase study: TensorFlow FederatedRunning edge devices with TinyMLCase study: TensorFlow LiteSummary
Technical requirementsIntroducing adaptive model trainingTraditional data parallel training Adaptive model training in data parallelismAdaptive model training (AllReduce-based)Adaptive model training (parameter server-based)Traditional model-parallel model training paradigmAdaptive model training in model parallelismImplementing adaptive model training in the cloudElasticity in model inferenceServerlessSummary
Technical requirementsDebugging and performance analyticsGeneral concepts in the profiling resultsCommunication results analysisComputation results analysisJob migration and multiplexingJob migrationJob multiplexingModel training in a heterogeneous environmentSummary
Other Books You May EnjoyPackt is searching for authors like youShare Your Thoughts

Content preview from Distributed Machine Learning with Python

Chapter 1: Splitting Input Data

Over the recent years, data has grown drastically in size. For instance, if you take the computer vision domain as an example, datasets such as MNIST and CIFAR-10/100 consist of only 50k training images each, whereas recent datasets such as ImageNet-1k contain over 1 million training images. However, having a larger input data size leads to a much longer model training time on a single GPU/node. In the example mentioned previously, the total training time of a useable state-of-the-art single GPU training model on a CIFAR-10/100 dataset only takes a couple of hours. However, when it comes to the ImageNet-1K dataset, the training time for a GPU model will take days or even weeks.

The standard practice for speeding ...