book

Deep Learning at Scale

Name: Deep Learning at Scale
Author: Suneeta Mall
ISBN: 9781098145286

by Suneeta Mall

June 2024

Intermediate to advanced

448 pages

11h 55m

English

O'Reilly Media, Inc.

Audio summary available

Read now

Unlock full access

Includes

Quizzes

Preface
Why Scaling MattersWho This Book Is ForHow This Book Is OrganizedIntroductionPart I: Foundational Concepts of Deep LearningPart II: Distributed TrainingPart III: Extreme ScalingWhat You Need to Use This BookSetting Up Your Environment for Hands-on ExercisesUsing Code ExamplesConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
1. What Nature and History Have Taught Us About Scale
The Philosophy of ScalingThe General Law of ScalingHistory of Scaling LawScalable SystemsNature as a Scalable SystemOur Visual System: A Biological InspirationArtificial Intelligence: The Evolution of Learnable SystemsIt Takes Four to TangoEvolving Deep Learning TrendsScale in the Context of Deep LearningSix Development ConsiderationsScaling ConsiderationsSummary
I. Foundational Concepts of Deep Learning
2. Deep Learning
The Role of Data in Deep LearningData Flow in Deep LearningHands-On Exercise #1: Implementing Minimalistic Deep LearningDeveloping the ModelThe Embedded/Latent SpaceA Word of CautionThe Learning Rate and Loss LandscapeScaling ConsiderationProfilingHands-On Exercise #2: Getting Complex with PyTorchModel Input Data and PipelineModelAuxiliary UtilitiesPutting It All TogetherComputation GraphsInferenceSummary
3. The Computational Side of Deep Learning
The Higgs Boson of the Digital WorldFloating-Point Numbers: The Faux Continuous NumbersUnits of Data MeasurementData Storage Formats: The Trade-off of Latency and ThroughputComputer ArchitectureThe Birth of the Electromechanical EngineMemory and PersistenceComputation and Memory CombinedThe Scaling Laws of ElectronicsScaling Out Computation with ParallelizationThreads Versus Processes: The Unit of ParallelizationHardware-Optimized Libraries for AccelerationParallel Computer Architectures: Flynn’s and Duncan’s TaxonomiesAccelerated ComputingPopular Accelerated Devices for Deep LearningCUDAAccelerator BenchmarkingSummary
4. Putting It All Together: Efficient Deep Learning
Hands-On Exercise #1: GPT-2Exercise ObjectivesModel ArchitectureImplementationRunning the ExampleExperiment TrackingMeasuring to Understand the Limitations and Scale OutTransitioning from Language to VisionHands-On Exercise #2: Vision Model with ConvolutionModel ArchitectureRunning the ExampleObservationsGraph Compilation Using PyTorch 2.0New Components of PyTorch 2.0Graph Execution in PyTorch 2.0Modeling Techniques to Scale Training on a Single DeviceGraph CompilationReduced- and Mixed-Precision TrainingMemory Tricks for EfficiencyOptimizer EfficienciesModel Input Pipeline TricksWriting Custom Kernels in PyTorch 2.0 with TritonSummary
II. Distributed Training
5. Distributed Systems and Communications
Distributed SystemsThe Eight Fallacies of Distributed ComputingThe Consistency, Availability, and Partition Tolerance (CAP) TheoremThe Scaling Law of Distributed SystemsTypes of Distributed SystemsCommunication in Distributed SystemsCommunication ParadigmCommunication PatternsCommunication TechnologiesMPICommunication Initialization: RendezvousHands-On ExerciseScaling Compute CapacityInfrastructure Setup OptionsProvisioning of Accelerated DevicesWorkload ManagementDeep Learning Infrastructure ReviewOverview of Leading Deep Learning ClustersSimilarities Between Today’s Most Powerful SystemsSummary
6. Theoretical Foundations of Distributed Deep Learning
Distributed Deep LearningCentralized DDLDecentralized DDLDimensions of Scaling Distributed Deep LearningPartitioning Dimensions of Distributed Deep LearningTypes of Distributed Deep Learning TechniquesChoosing a Scaling TechniqueMeasuring ScaleEnd-to-End Metrics and BenchmarksMeasuring Incrementally in a Reproducible EnvironmentSummary
7. Data Parallelism
Data PartitioningImplications of Data Sampling StrategiesWorking with Remote DatasetsIntroduction to Data Parallel TechniquesHands-On Exercise #1: Centralized Parameter Server Using RCPHands-On Exercise #2: Centralized Gradient-Partitioned Joint Worker/Server Distributed TrainingHands-On Exercise #3: Decentralized Asynchronous Distributed TrainingCentralized Synchronous Data Parallel StrategiesData Parallel (DP)Distributed Data Parallel (DDP)Zero Redundancy Optimizer–Powered Data Parallelism (ZeRO-DP)Fault-Tolerant TrainingHands-On Exercise #4: Scene Parsing with DDPHands-On Exercise #5: Distributed Sharded DDP (ZeRO)Building Efficient PipelinesDataset FormatLocal Versus RemoteStagingThreads Versus Processes: Scaling Your PipelinesMemory TricksData Augmentations: CPU Versus GPUJIT AccelerationHands-On Exercise #6: Pipeline Efficiency with FFCVSummary

8. Scaling Beyond Data Parallelism: Model, Pipeline, Tensor, and Hybrid Parallelism
Questions to Ask Before Scaling VerticallyTheoretical Foundations of Vertical ScalingRevisiting the Dimensions of ScalingOperators’ Perspective of Parallelism DimensionsData Flow and Communications in Vertical ScalingBasic Building Blocks for Scaling Beyond DPPyTorch Primitives for Vertical ScalingWorking with Larger ModelsDistributed Checkpointing: Saving the Partitioned ModelSummary
9. Gaining Practical Expertise with Scaling Across All Dimensions
Hands-On Exercises: Model, Tensor, Pipeline, and Hybrid ParallelismThe DatasetHands-On Exercise #1: Baseline DeepFMHands-On Exercise #2: Model Parallel DeepFMHands-On Exercise #3: Pipeline Parallel DeepFMHands-On Exercise #4: Pipeline Parallel DeepFM with RPCHands-On Exercise #5: Tensor Parallel DeepFMHands-On Exercise #6: Hybrid Parallel DeepFMTools and Libraries for Vertical ScalingOneFlowFairScaleDeepSpeedFSDPOverview and ComparisonHands-On Exercise #7: Automatic Vertical Scaling with DeepSpeedObservationsSummary
III. Extreme Scaling
10. Data-Centric Scaling
The Seven Vs of Data Through a Deep Learning LensThe Scaling Law of DataData QualityValidityVarietyVeracityValue and VolumeThe Data Engine and Continual LearningVolatilityVelocitySummary
11. Scaling Experiments: Effective Planning and Management
Model Development Is IterativePlanning for Experiments and ExecutionSimplify the ComplexFast Iteration for Fast FeedbackDecoupled IterationsFeasibility TestingDeveloping and Scaling a Minimal Viable SolutionSetting Up for Iterative ExecutionTechniques to Scale Your ExperimentsAccelerating Model ConvergenceAccelerating Learning Via Optimization and AutomationAccelerating Learning by Increasing ExpertiseLearning with Scarce SupervisionHands-On ExercisesHands-On Exercise #1: Transfer LearningHands-On Exercise #2: Hyperparameter OptimizationHands-On Exercise #3: Knowledge DistillationHands-On Exercise #4: Mixture of ExpertsHands-On Exercise #5: Contrastive LearningHands-On Exercise #6: Meta-LearningSummary
12. Efficient Fine-Tuning of Large Models
Review of Fine-Tuning TechniquesStandard Fine TuningMeta-Learning (Zero-/Few-Shot Learning)Adapter-Based Fine TuningLow-Rank TuningLoRA—Parameter-Efficient Fine TuningQuantized LoRA (QLoRA)Hands-on Exercise: QLoRA-Based Fine TuningImplementation DetailsInferenceExercise SummarySummary
13. Foundation Models
What Are Foundation Models?The Evolution of Foundation ModelsChallenges Involved in Developing Foundation ModelsMeasurement ComplexityDeployment ChallengesPropagation of Defects to All Downstream ModelsLegal and Ethical ConsiderationsEnsuring Consistency and CoherencyMultimodal Large Language ModelsProjectionGated Cross-AttentionQuery-Based EncodingFurther ExplorationSummary
Index
About the Author

Content preview from Deep Learning at Scale

Chapter 8. Scaling Beyond Data Parallelism: Model, Pipeline, Tensor, and Hybrid Parallelism

You have read about several concepts and techniques related to distributed training in the previous chapters of this book. Chapter 6 laid out the fundamentals of distributed model training and discussed the possible dimensions of scaling, while Chapter 7 provided practical knowledge to scale based on the data dimension.

As you learned in Chapter 3, a task can typically be parallelized in two ways: by applying the same set of instructions on different data (SIMD) or by decomposing the set of instructions such that different parts of the algorithm can be performed at the same time on different data (MIMD). Data parallel model training is akin to SIMD, whereas the other forms of parallelism that you will read about in this chapter are akin to MIMD.

Scaling model training using data parallel techniques is often considered “weak” because you are scaling only horizontally, using just one of many possible dimensions of scale (i.e., data). Your overall scalability is limited by the number of parallel workers you can have, the ability of each worker to fit your model in its available memory, and the maximum effective batch size you can have before scaling law fails (for your case), producing diminishing returns. For most scenarios, weak scaling might be sufficient. However, if the limitations are causing you problems, you will need to look beyond data parallelism and explore more advanced vertical ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Practical Deep Learning at Scale with MLflow

Publisher Resources

ISBN: 9781098145279Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Deep Learning at Scale

by Suneeta Mall

Chapter 8. Scaling Beyond Data Parallelism: Model, Pipeline, Tensor, and Hybrid Parallelism

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.