book

Deep Reinforcement Learning Hands-On - Third Edition

Name: Deep Reinforcement Learning Hands-On - Third Edition
Author: Maxim Lapan
ISBN: 9781835882702

by Maxim Lapan

November 2024

Intermediate to advanced

716 pages

19h 34m

English

Packt Publishing

Read now

Unlock full access

Preface
Why I wrote this bookThe approachWho this book is forWhat this book coversTo get the most out of this bookChanges in the third edition
Part 1 Introduction to RL
What Is Reinforcement Learning?
Supervised learningUnsupervised learningReinforcement learningComplications in RLRL formalismsRewardThe agentThe environmentActionsObservationsThe theoretical foundations of RLMarkov decision processesThe Markov processMarkov reward processesAdding actions to MDPPolicySummary
OpenAI Gym API and Gymnasium
The anatomy of the agentHardware and software requirementsThe OpenAI Gym API and GymnasiumThe action spaceThe observation spaceThe environmentCreating an environmentThe CartPole sessionThe random CartPole agentExtra Gym API functionalityWrappersRendering the environmentMore wrappersSummary
Deep Learning with PyTorch
TensorsThe creation of tensorsScalar tensorsTensor operationsGPU tensorsGradientsTensors and gradientsNN building blocksCustom layersLoss functions and optimizersLoss functionsOptimizersMonitoring with TensorBoardTensorBoard 101Plotting metricsGAN on Atari imagesPyTorch IgniteIgnite conceptsGAN training on Atari using IgniteSummary
The Cross-Entropy Method
The taxonomy of RL methodsThe cross-entropy method in practiceThe cross-entropy method on CartPoleThe cross-entropy method on FrozenLakeThe theoretical background of the cross-entropy methodSummary
Part 2 Value-based methods
Tabular Learning and the Bellman Equation
Value, state, and optimalityThe Bellman equation of optimalityThe value of the actionThe value iteration methodValue iteration in practiceQ-iteration for FrozenLakeSummary
Deep Q-Networks
Real-life value iterationTabular Q-learningDeep Q-learningInteraction with the environmentSGD optimizationCorrelation between stepsThe Markov propertyThe final form of DQN trainingDQN on PongWrappersThe DQN modelTrainingRunning and performanceYour model in actionThings to trySummary
Higher-Level RL Libraries
Why RL libraries?The PTAN libraryAction selectorsThe agentDQNAgentPolicyAgentExperience sourceToy environmentThe ExperienceSource classThe ExperienceSourceFirstLast ClassExperience replay buffersThe TargetNet classIgnite helpersThe PTAN CartPole solverOther RL librariesSummary

DQN Extensions
Basic DQNCommon libraryImplementationHyperparameter tuningResults with common parametersTuned baseline DQNN-step DQNImplementationResultsHyperparameter tuningDouble DQNImplementationResultsHyperparameter tuningNoisy networksImplementationResultsHyperparameter tuningPrioritized replay bufferImplementationResultsHyperparameter tuningDueling DQNImplementationResultsHyperparameter tuningCategorical DQNImplementationResultsHyperparameter tuningCombining everythingResultsHyperparameter tuningSummary
Ways to Speed Up RL
Why speed mattersBaselineThe computation graph in PyTorchSeveral environmentsPlaying and training in separate processesTweaking wrappersBenchmark resultsSummary
Stocks Trading Using RL
Why trading?Problem statement and key decisionsDataThe trading environmentModelsTraining codeResultsThe feed-forward modelThe convolution modelThings to trySummary
Part 3 Policy-based methods
Policy Gradients
Values and policyWhy the policy?Policy representationPolicy gradientsThe REINFORCE methodThe CartPole exampleResultsPolicy-based versus value-based methodsREINFORCE issuesFull episodes are requiredHigh gradient varianceExploration problemsHigh correlation of samplesPolicy gradient methods on CartPoleImplementationResultsPolicy gradient methods on PongImplementationResultsSummary
Actor-Critic Method: A2C and A3C
Variance reductionCartPole varianceAdvantage actor-critic (A2C)A2C on PongResultsAsynchronous Advantage Actor-Critic (A3C)Correlation and sample efficiencyAdding an extra “A” to A2CA3C with data parallelismResultsA3C with gradient parallelismImplementationResultsSummary
The TextWorld Environment
Interactive fictionThe environmentInstallationGame generationObservation and action spacesExtra game informationThe deep NLP basicsRecurrent Neural Networks (RNNs)Word embeddingThe Encoder-Decoder architectureTransformersBaseline DQNObservation preprocessingEmbeddings and encodersThe DQN model and the agentTraining codeTraining resultsTweaking observationsTracking visited roomsRelative actionsObjective in observationTransformersChatGPTSetupInteractive modeChatGPT APISummary
Web Navigation
The evolution of web navigationBrowser automation and RLChallenges in browser automationThe MiniWoB benchmarkMiniWoB++InstallationActions and observationsSimple exampleThe simple clicking approachGrid actionsThe RL part of our implementationThe model and training codeTraining resultsSimple clicking limitationsAdding text descriptionImplementationResultsHuman demonstrationsRecording the demonstrationsTraining with demonstrationsResultsThings to trySummary
Part 4 Advanced RL
Continous Action Space
Why a continuous space?The action spaceEnvironmentsThe A2C methodImplementationResultsUsing models and recording videosDeep deterministic policy gradientsExplorationImplementationResults and videoDistributional policy gradientsArchitectureImplementationResultsThings to trySummary
Trust Region Methods
EnvironmentsThe A2C baselineImplementationResultsVideo recordingPPOImplementationResultsTRPOImplementationResultsACKTRImplementationResultsSACImplementationResultsOverall resultsSummary
Black-Box Optimizations in RL
Black-box methodsEvolution strategiesImplementing ES on CartPoleCartPole resultsES on HalfCheetahImplementing ES on HalfCheetahHalfCheetah resultsGenetic algorithmsGA on CartPoleGA tweaksDeep GANovelty searchGA on HalfCheetahImplementationResultsSummary
Advanced Exploration
Why exploration is importantWhat’s wrong with 𝜖-greedy?Alternative ways of explorationNoisy networksCount-based methodsPrediction-based methodsMountainCar experimentsDQN + 𝜖-greedyDQN + noisy networksDQN + state countsPPO methodPPO + Noisy NetworksPPO + state countsPPO + network distillationComparison of methodsAtari experimentsDQN + 𝜖-greedyDQN + noisy networksPPOSummary
Reinforcement Learning with Human Feedback
Reward functions in complex environmentsTheoretical backgroundMethod overviewRLHF and LLMsRLHF experimentsInitial training using A2CLabeling processReward model trainingCombining A2C with the reward modelFine-tuning with 100 labelsThe second round of the experimentThe third round of the experimentOverall resultsSummary
AlphaGo Zero and MuZero
Comparing model-based and model-free methodsModel-based methods for board gamesThe AlphaGo Zero methodOverviewMCTSSelf-playTraining and evaluationConnect 4 with AlphaGo ZeroThe game modelImplementing MCTSThe modelTrainingTesting and comparisonResultsMuZeroHigh-level modelTraining processConnect 4 with MuZeroHyperparameters and MCTS tree nodesModelsMCTS searchTraining data and gameplayMuZero resultsMuZero and AtariSummary
RL in Discrete Optimization
The Rubik’s cube and discrete optimizationOptimality and God’s numberApproaches to cube solvingActionsStatesThe training processThe NN architectureThe trainingThe model applicationResultsThe code outlineCube environmentsTrainingThe search processThe experiment resultsThe 2 × 2 cubeThe 3 × 3 cubeFurther improvements and experimentsSummary
Multi-Agent RL
What is multi-agent RL?Getting started with the environmentAn overview of MAgentInstalling MAgentSetting up a random environmentDeep Q-network for tigersUnderstanding the codeTraining and resultsCollaboration by the tigersTraining both tigers and deerThe battle environmentSummary
Bibliography
Index

Content preview from Deep Reinforcement Learning Hands-On - Third Edition

12 Actor-Critic Method: A2C and A3C

In Chapter 11, we started to investigate a policy-based alternative to the familiar value-based methods family. In particular, we focused on the method called REINFORCE and its modification, which uses discounted reward to obtain the gradient of the policy (which gives us the direction in which to improve the policy). Both methods worked well for a small CartPole problem, but for a more complicated Pong environment, we got no convergence.

Here, we will discuss another extension to the vanilla policy gradient method, which magically improves the stability and convergence speed of that method. Despite the modification being only minor, the new method has its own name, actor-critic, and it’s one of the most ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781835882702

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Deep Reinforcement Learning Hands-On - Third Edition

by Maxim Lapan

12

Actor-Critic Method: A2C and A3C

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.