book

Learning Ray

by Max Pumperla, Edward Oakes, Richard Liaw

February 2023

Beginner

271 pages

7h 15m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Who Should Read This BookGoals of This BookNavigating This BookHow to Use the Code ExamplesConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
What Is Ray?What Led to Ray?Ray’s Design PrinciplesThree Layers: Core, Libraries, and EcosystemA Distributed Computing FrameworkA Suite of Data Science LibrariesRay AIR and the Data Science WorkflowData Processing with Ray DatasetsModel TrainingHyperparameter TuningModel ServingA Growing EcosystemSummary
An Introduction to Ray CoreA First Example Using the Ray APIAn Overview of the Ray Core APIUnderstanding Ray System ComponentsScheduling and Executing Work on a NodeThe Head NodeDistributed Scheduling and ExecutionA Simple MapReduce Example with RayMapping and Shuffling Document DataReducing Word CountsSummary
Introducing Reinforcement LearningSetting Up a Simple Maze ProblemBuilding a SimulationTraining a Reinforcement Learning ModelBuilding a Distributed Ray AppRecapping RL TerminologySummary
An Overview of RLlibGetting Started with RLlibBuilding a Gym EnvironmentRunning the RLlib CLIUsing the RLlib Python APIConfiguring RLlib ExperimentsResource ConfigurationRollout Worker ConfigurationEnvironment ConfigurationWorking with RLlib EnvironmentsAn Overview of RLlib EnvironmentsWorking with Multiple AgentsWorking with Policy Servers and ClientsAdvanced ConceptsBuilding an Advanced EnvironmentApplying Curriculum LearningWorking with Offline DataOther Advanced TopicsSummary
Tuning HyperparametersBuilding a Random Search Example with RayWhy Is HPO Hard?An Introduction to TuneHow Does Tune Work?Configuring and Running TuneMachine Learning with TuneUsing RLlib with TuneTuning Keras ModelsSummary
Ray DatasetsRay Datasets BasicsComputing Over Ray DatasetsDataset PipelinesExample: Training Copies of a Classifier in ParallelExternal Library IntegrationsBuilding an ML PipelineSummary
The Basics of Distributed Model TrainingIntroduction to Ray Train by ExamplePredicting Big Tips in NYC Taxi RidesLoading, Preprocessing, and FeaturizationDefining a Deep Learning ModelDistributed Training with Ray TrainDistributed Batch InferenceMore on Trainers in Ray TrainMigrating to Ray Train with Minimal Code ChangesScaling Out TrainersPreprocessing with Ray TrainIntegrating Trainers with Ray TuneUsing Callbacks to Monitor TrainingSummary
Key Characteristics of Online InferenceML Models Are Compute IntensiveML Models Aren’t Useful in IsolationAn Introduction to Ray ServeArchitectural OverviewDefining a Basic HTTP EndpointScaling and Resource AllocationRequest BatchingMultimodel Inference GraphsEnd-to-End Example: Building an NLP-Powered APIFetching Content and PreprocessingNLP ModelsHTTP Handling and Driver LogicPutting It All TogetherSummary

Manually Creating a Ray ClusterDeployment on KubernetesSetting Up Your First KubeRay ClusterInteracting with the KubeRay ClusterExposing KubeRayConfiguring KubeRayConfiguring Logging for KubeRayUsing the Ray Cluster LauncherConfiguring Your Ray ClusterUsing the Cluster Launcher CLIInteracting with a Ray ClusterWorking with Cloud ClustersAWSUsing Other Cloud ProvidersAutoscalingSummary
Why Use AIR?Key AIR Concepts by ExampleRay Datasets and PreprocessorsTrainersTuners and CheckpointsBatch PredictorsDeploymentsWorkloads That Are Suited for AIRAIR Workload ExecutionAIR Memory ManagementAIR Failure ModelAutoscaling AIR WorkloadsSummary
A Growing EcosystemData Loading and ProcessingModel TrainingModel ServingBuilding Custom IntegrationsAn Overview of Ray’s IntegrationsRay and Other SystemsDistributed Python FrameworksRay AIR and the Broader ML EcosystemHow to Integrate AIR into Your ML PlatformWhere to Go from Here?Summary

Content preview from Learning Ray

Preface

Distributed computing is a fascinating topic. Looking back at the early days of computing, one can’t help but be impressed by the fact that so many companies today distribute their workloads across clusters of computers. It’s impressive that we have figured out efficient ways to do so, but scaling out is also becoming more and more of a necessity. Individual computers keep getting faster, and yet our need for large-scale computing keeps exceeding what single machines can do.

Recognizing that scaling is both a necessity and a challenge, Ray aims to make distributed computing simple for developers. It makes distributed computing accessible to nonexperts and makes it possible to scale your Python scripts across multiple nodes fairly easily. Ray is good at scaling both data- and compute-heavy workloads, such as data preprocessing and model training—and it explicitly targets machine learning (ML) workloads with the need to scale. While it is possible today to scale these two types of workloads without Ray, you would likely have to use different APIs and distributed systems for each. And managing several distributed systems can be messy and inefficient in many ways.

The addition of the Ray AI Runtime (AIR) with the release of Ray 2.0 in August 2022 increased the support for complex ML workloads in Ray even further. AIR is a collection of libraries and tools that make it easy to build and deploy end-to-end ML applications in a single distributed system. With AIR, even the most ...