book

Deep Reinforcement Learning with Python - Second Edition

Name: Deep Reinforcement Learning with Python - Second Edition
Author: Sudharsan Ravichandiran
ISBN: 9781839210686

by Sudharsan Ravichandiran

September 2020

Intermediate to advanced

760 pages

18h 26m

English

Packt Publishing

Read now

Unlock full access

Preface
Who this book is forWhat this book coversTo get the most out of this bookGet in touch
Fundamentals of Reinforcement Learning
Key elements of RLAgentEnvironmentState and actionRewardThe basic idea of RLThe RL algorithmRL agent in the grid worldHow RL differs from other ML paradigmsMarkov Decision ProcessesThe Markov property and Markov chainThe Markov Reward ProcessThe Markov Decision ProcessFundamental concepts of RLMath essentialsExpectationAction spacePolicyDeterministic policyStochastic policyEpisodeEpisodic and continuous tasksHorizonReturn and discount factorSmall discount factorLarge discount factorWhat happens when we set the discount factor to 0?What happens when we set the discount factor to 1?The value functionQ functionModel-based and model-free learningDifferent types of environmentsDeterministic and stochastic environmentsDiscrete and continuous environmentsEpisodic and non-episodic environmentsSingle and multi-agent environmentsApplications of RLRL glossarySummaryQuestionsFurther reading
A Guide to the Gym Toolkit
Setting up our machineInstalling AnacondaInstalling the Gym toolkitCommon error fixesCreating our first Gym environmentExploring the environmentStatesActionsTransition probability and reward functionGenerating an episode in the Gym environmentAction selectionGenerating an episodeMore Gym environmentsClassic control environmentsState spaceAction spaceCart-Pole balancing with random policyAtari game environmentsGeneral environmentDeterministic environmentNo frame skippingState and action spaceAn agent playing the Tennis gameRecording the gameOther environmentsBox2DMuJoCoRoboticsToy textAlgorithmsEnvironment synopsisSummaryQuestionsFurther reading
The Bellman Equation and Dynamic Programming
The Bellman equationThe Bellman equation of the value functionThe Bellman equation of the Q functionThe Bellman optimality equationThe relationship between the value and Q functionsDynamic programmingValue iterationThe value iteration algorithmSolving the Frozen Lake problem with value iterationPolicy iterationAlgorithm – policy iterationSolving the Frozen Lake problem with policy iterationIs DP applicable to all environments? SummaryQuestions
Monte Carlo Methods
Understanding the Monte Carlo methodPrediction and control tasksPrediction task Control taskMonte Carlo predictionMC prediction algorithmTypes of MC predictionFirst-visit Monte CarloEvery-visit Monte CarloImplementing the Monte Carlo prediction methodUnderstanding the blackjack gameThe blackjack environment in the Gym libraryEvery-visit MC prediction with the blackjack gameFirst-visit MC prediction with the blackjack gameIncremental mean updatesMC prediction (Q function)Monte Carlo controlMC control algorithmOn-policy Monte Carlo controlMonte Carlo exploring startsMonte Carlo with the epsilon-greedy policyImplementing on-policy MC controlOff-policy Monte Carlo controlIs the MC method applicable to all tasks?SummaryQuestions
Understanding Temporal Difference Learning
TD learningTD predictionTD prediction algorithmPredicting the value of states in the Frozen Lake environmentTD controlOn-policy TD control – SARSAComputing the optimal policy using SARSAOff-policy TD control – Q learningComputing the optimal policy using Q learning The difference between Q learning and SARSAComparing the DP, MC, and TD methodsSummaryQuestionsFurther reading
Case Study – The MAB Problem
The MAB problemCreating a bandit in the GymExploration strategiesEpsilon-greedySoftmax explorationUpper confidence bound Thompson samplingApplications of MABFinding the best advertisement banner using banditsCreating a dataset Initialize the variables Define the epsilon-greedy methodRun the bandit testContextual banditsSummaryQuestionsFurther reading
Deep Learning Foundations
Biological and artificial neuronsANN and its layersInput layerHidden layerOutput layerExploring activation functionsThe sigmoid functionThe tanh functionThe Rectified Linear Unit functionThe softmax functionForward propagation in ANNsHow does an ANN learn?Putting it all togetherBuilding a neural network from scratchRecurrent Neural NetworksThe difference between feedforward networks and RNNsForward propagation in RNNsBackpropagating through timeLSTM to the rescueUnderstanding the LSTM cellWhat are CNNs?Convolutional layersStridesPaddingPooling layersFully connected layersThe architecture of CNNsGenerative adversarial networksBreaking down the generatorBreaking down the discriminatorHow do they learn, though?Architecture of a GANDemystifying the loss functionDiscriminator lossGenerator lossTotal lossSummaryQuestionsFurther reading
A Primer on TensorFlow
What is TensorFlow?Understanding computational graphs and sessionsSessionsVariables, constants, and placeholdersVariablesConstantsPlaceholders and feed dictionariesIntroducing TensorBoardCreating a name scopeHandwritten digit classification using TensorFlowImporting the required librariesLoading the datasetDefining the number of neurons in each layerDefining placeholdersForward propagationComputing loss and backpropagationComputing accuracyCreating a summaryTraining the modelVisualizing graphs in TensorBoardIntroducing eager executionMath operations in TensorFlowTensorFlow 2.0 and KerasBonjour KerasDefining the modelCompiling the modelTraining the modelEvaluating the modelMNIST digit classification using TensorFlow 2.0SummaryQuestionsFurther reading
Deep Q Network and Its Variants
What is DQN?Understanding DQNReplay bufferLoss functionTarget networkPutting it all togetherThe DQN algorithmPlaying Atari games using DQNArchitecture of the DQNGetting hands-on with the DQNPreprocess the game screenDefining the DQN classTraining the DQNThe double DQNThe double DQN algorithmDQN with prioritized experience replayTypes of prioritizationProportional prioritizationRank-based prioritizationCorrecting the biasThe dueling DQNUnderstanding the dueling DQNThe architecture of a dueling DQNThe deep recurrent Q networkThe architecture of a DRQNSummaryQuestionsFurther reading

Policy Gradient Method
Why policy-based methods?Policy gradient intuitionUnderstanding the policy gradientDeriving the policy gradientAlgorithm – policy gradientVariance reduction methodsPolicy gradient with reward-to-goAlgorithm – Reward-to-go policy gradientCart pole balancing with policy gradientComputing discounted and normalized rewardBuilding the policy networkTraining the networkPolicy gradient with baselineAlgorithm – REINFORCE with baselineSummaryQuestionsFurther reading
Actor-Critic Methods – A2C and A3C
Overview of the actor-critic methodUnderstanding the actor-critic methodThe actor-critic algorithmAdvantage actor-critic (A2C)Asynchronous advantage actor-critic (A3C)The three AsThe architecture of A3CMountain car climbing using A3CCreating the mountain car environmentDefining the variablesDefining the actor-critic classDefining the worker classTraining the networkVisualizing the computational graphA2C revisitedSummaryQuestionsFurther reading
Learning DDPG, TD3, and SAC
Deep deterministic policy gradient An overview of DDPGActor Critic DDPG componentsCritic networkActor networkPutting it all togetherAlgorithm – DDPGSwinging up a pendulum using DDPGCreating the Gym environmentDefining the variablesDefining the DDPG classTraining the network Twin delayed DDPG Key features of TD3Clipped double Q learningDelayed policy updatesTarget policy smoothingPutting it all togetherAlgorithm – TD3Soft actor-criticUnderstanding soft actor-criticV and Q functions with the entropy termComponents of SACCritic networkActor networkPutting it all together Algorithm – SACSummaryQuestionsFurther reading
TRPO, PPO, and ACKTR Methods
Trust region policy optimizationMath essentialsThe Taylor seriesThe trust region methodThe conjugate gradient methodLagrange multipliers Importance sampling Designing the TRPO objective functionParameterizing the policiesSample-based estimationSolving the TRPO objective functionComputing the search directionPerforming a line search in the search directionAlgorithm – TRPOProximal policy optimizationPPO with a clipped objective Algorithm – PPO-clippedImplementing the PPO-clipped methodCreating the Gym environmentDefining the PPO classTraining the networkPPO with a penalized objectiveAlgorithm – PPO-penaltyActor-critic using Kronecker-factored trust region Math essentialsBlock matrixBlock diagonal matrixThe Kronecker productThe vec operatorProperties of the Kronecker productKronecker-Factored Approximate Curvature (K-FAC) K-FAC in actor-criticIncorporating the trust regionSummaryQuestionsFurther reading
Distributional Reinforcement Learning
Why distributional reinforcement learning?Categorical DQNPredicting the value distributionSelecting an action based on the value distributionTraining the categorical DQNProjection stepPutting it all togetherAlgorithm – categorical DQNPlaying Atari games using a categorical DQNDefining the variablesDefining the replay bufferDefining the categorical DQN classQuantile Regression DQNMath essentials Quantile Inverse CDF (quantile function)Understanding QR-DQNAction selectionLoss functionDistributed Distributional DDPGCritic networkActor networkAlgorithm – D4PGSummaryQuestionsFurther reading
Imitation Learning and Inverse RL
Supervised imitation learning DAggerUnderstanding DAggerAlgorithm – DAggerDeep Q learning from demonstrationsPhases of DQfDPre-training phaseTraining phaseLoss function of DQfDAlgorithm – DQfDInverse reinforcement learningMaximum entropy IRLKey termsBack to maximum entropy IRLComputing the gradientAlgorithm – maximum entropy IRLGenerative adversarial imitation learningFormulation of GAIL SummaryQuestionsFurther reading
Deep Reinforcement Learning with Stable Baselines
Installing Stable BaselinesCreating our first agent with Stable BaselinesEvaluating the trained agentStoring and loading the trained agentViewing the trained agent Putting it all togetherVectorized environments SubprocVecEnvDummyVecEnvIntegrating custom environmentsPlaying Atari games with a DQN and its variants Implementing DQN variantsLunar lander using A2CCreating a custom network Swinging up a pendulum using DDPGViewing the computational graph in TensorBoardTraining an agent to walk using TRPO Installing the MuJoCo environmentImplementing TRPORecording the videoTraining a cheetah bot to run using PPOMaking a GIF of a trained agentImplementing GAILSummaryQuestionsFurther reading
Reinforcement Learning Frontiers
Meta reinforcement learningModel-agnostic meta learningUnderstanding MAMLMAML in a supervised learning settingMAML in a reinforcement learning settingHierarchical reinforcement learningMAXQ value function DecompositionImagination augmented agentsSummaryQuestionsFurther reading
Appendix 1 – Reinforcement Learning Algorithms
Reinforcement learning algorithmValue IterationPolicy IterationFirst-Visit MC PredictionEvery-Visit MC PredictionMC Prediction – the Q FunctionMC Control MethodOn-Policy MC Control – Exploring startsOn-Policy MC Control – Epsilon-GreedyOff-Policy MC ControlTD PredictionOn-Policy TD Control – SARSAOff-Policy TD Control – Q LearningDeep Q LearningDouble DQNREINFORCE Policy GradientPolicy Gradient with Reward-To-GoREINFORCE with BaselineAdvantage Actor CriticAsynchronous Advantage Actor-CriticDeep Deterministic Policy GradientTwin Delayed DDPGSoft Actor-Critic Trust Region Policy OptimizationPPO-ClippedPPO-PenaltyCategorical DQNDistributed Distributional DDPG DAggerDeep Q learning from demonstrationsMaxEnt Inverse Reinforcement LearningMAML in Reinforcement Learning
Appendix 2 – Assessments
Chapter 1 – Fundamentals of Reinforcement LearningChapter 2 – A Guide to the Gym ToolkitChapter 3 – The Bellman Equation and Dynamic ProgrammingChapter 4 – Monte Carlo MethodsChapter 5 – Understanding Temporal Difference LearningChapter 6 – Case Study – The MAB ProblemChapter 7 – Deep Learning FoundationsChapter 8 – A Primer on TensorFlowChapter 9 – Deep Q Network and Its VariantsChapter 10 – Policy Gradient MethodChapter 11 – Actor-Critic Methods – A2C and A3CChapter 12 – Learning DDPG, TD3, and SACChapter 13 – TRPO, PPO, and ACKTR MethodsChapter 14 – Distributional Reinforcement LearningChapter 15 – Imitation Learning and Inverse RLChapter 16 – Deep Reinforcement Learning with Stable BaselinesChapter 17 – Reinforcement Learning Frontiers
Other Books You May Enjoy
Index

Content preview from Deep Reinforcement Learning with Python - Second Edition

3 The Bellman Equation and Dynamic Programming

In the previous chapter, we learned that in reinforcement learning our goal is to find the optimal policy. The optimal policy is the policy that selects the correct action in each state so that the agent can get the maximum return and achieve its goal. In this chapter, we'll learn about two interesting classic reinforcement learning algorithms called the value and policy iteration methods, which we can use to find the optimal policy.

Before diving into the value and policy iteration methods directly, first, we will learn about the Bellman equation. The Bellman equation is ubiquitous in reinforcement learning and it is used for finding the optimal value and Q functions. We will understand what the ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781839210686

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Deep Reinforcement Learning with Python - Second Edition

by Sudharsan Ravichandiran

3

The Bellman Equation and Dynamic Programming

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.