book

Foundations of Deep Reinforcement Learning: Theory and Practice in Python

Name: Foundations of Deep Reinforcement Learning: Theory and Practice in Python
ISBN: 9780135172490

by Laura Graesser, Wah Loon Keng

December 2019

Intermediate to advanced

416 pages

12h 34m

English

Addison-Wesley Professional

Read now

Unlock full access

Cover Page
About This eBook
Half Title Page
Title Page
Copyright Page
Dedication Page
Contents
Foreword
Preface
Acknowledgments

About the Authors
1. Introduction to Reinforcement Learning
1.1 Reinforcement Learning1.2 Reinforcement Learning as MDP1.3 Learnable Functions in Reinforcement Learning1.4 Deep Reinforcement Learning Algorithms1.4.1 Policy-Based Algorithms1.4.2 Value-Based Algorithms1.4.3 Model-Based Algorithms1.4.4 Combined Methods1.4.5 Algorithms Covered in This Book1.4.6 On-Policy and Off-Policy Algorithms1.4.7 Summary1.5 Deep Learning for Reinforcement Learning1.6 Reinforcement Learning and Supervised Learning1.6.1 Lack of an Oracle1.6.2 Sparsity of Feedback1.6.3 Data Generation1.7 Summary
I: Policy-Based and Value-Based Algorithms
2. REINFORCE
2.1 Policy2.2 The Objective Function2.3 The Policy Gradient2.3.1 Policy Gradient Derivation2.4 Monte Carlo Sampling2.5 REINFORCE Algorithm2.5.1 Improving REINFORCE2.6 Implementing REINFORCE2.6.1 A Minimal REINFORCE Implementation2.6.2 Constructing Policies with PyTorch2.6.3 Sampling Actions2.6.4 Calculating Policy Loss2.6.5 REINFORCE Training Loop2.6.6 On-Policy Replay Memory2.7 Training a REINFORCE Agent2.8 Experimental Results2.8.1 Experiment: The Effect of Discount Factor2.8.2 Experiment: The Effect of Baseline2.9 Summary2.10 Further Reading2.11 History
3. SARSA
3.1 The Q- and V-Functions3.2 Temporal Difference Learning3.2.1 Intuition for Temporal Difference Learning3.3 Action Selection in SARSA3.3.1 Exploration and Exploitation3.4 SARSA Algorithm3.4.1 On-Policy Algorithms3.5 Implementing SARSA3.5.1 Action Function: ε-Greedy3.5.2 Calculating the Q-Loss3.5.3 SARSA Training Loop3.5.4 On-Policy Batched Replay Memory3.6 Training a SARSA Agent3.7 Experimental Results3.7.1 Experiment: The Effect of Learning Rate3.8 Summary3.9 Further Reading3.10 History 79
4. Deep Q-Networks (DQN)
4.1 Learning the Q-Function in DQN4.2 Action Selection in DQN4.2.1 The Boltzmann Policy4.3 Experience Replay4.4 DQN Algorithm4.5 Implementing DQN4.5.1 Calculating the Q-Loss4.5.2 DQN Training Loop4.5.3 Replay Memory4.6 Training a DQN Agent4.7 Experimental Results4.7.1 Experiment: The Effect of Network Architecture4.8 Summary4.9 Further Reading4.10 History
5. Improving DQN
5.1 Target Networks5.2 Double DQN5.3 Prioritized Experience Replay (PER)5.3.1 Importance Sampling5.4 Modified DQN Implementation5.4.1 Network Initialization5.4.2 Calculating the Q-Loss5.4.3 Updating the Target Network5.4.4 DQN with Target Networks5.4.5 Double DQN5.4.6 Prioritized Experienced Replay5.5 Training a DQN Agent to Play Atari Games5.6 Experimental Results5.6.1 Experiment: The Effect of Double DQN and PER5.7 Summary5.8 Further Reading
II: Combined Methods
6. Advantage Actor-Critic (A2C)
6.1 The Actor6.2 The Critic6.2.1 The Advantage Function6.2.2 Learning the Advantage Function6.3 A2C Algorithm6.4 Implementing A2C6.4.1 Advantage Estimation6.4.2 Calculating Value Loss and Policy Loss6.4.3 Actor-Critic Training Loop6.5 Network Architecture6.6 Training an A2C Agent6.6.1 A2C with n-Step Returns on Pong6.6.2 A2C with GAE on Pong6.6.3 A2C with n-Step Returns on BipedalWalker6.7 Experimental Results6.7.1 Experiment: The Effect of n-Step Returns6.7.2 Experiment: The Effect of λ of GAE6.8 Summary6.9 Further Reading6.10 History
7. Proximal Policy Optimization (PPO) 165
7.1 Surrogate Objective7.1.1 Performance Collapse7.1.2 Modifying the Objective7.2 Proximal Policy Optimization (PPO)7.3 PPO Algorithm7.4 Implementing PPO7.4.1 Calculating the PPO Policy Loss7.4.2 PPO Training Loop7.5 Training a PPO Agent7.5.1 PPO on Pong7.5.2 PPO on BipedalWalker7.6 Experimental Results7.6.1 Experiment: The Effect of λ of GAE7.6.2 Experiment: The Effect of Clipping Variable ε7.7 Summary7.8 Further Reading
8. Parallelization Methods
8.1 Synchronous Parallelization8.2 Asynchronous Parallelization8.2.1 Hogwild!8.3 Training an A3C Agent8.4 Summary8.5 Further Reading
9. Algorithm Summary
III: Practical Details
10. Getting Deep RL to Work 209
10.1 Software Engineering Practices10.1.1 Unit Tests10.1.2 Code Quality10.1.3 Git Workflow10.2 Debugging Tips10.2.1 Signs of Life10.2.2 Policy Gradient Diagnoses10.2.3 Data Diagnoses10.2.4 Preprocessor10.2.5 Memory10.2.6 Algorithmic Functions10.2.7 Neural Networks10.2.8 Algorithm Simplification10.2.9 Problem Simplification10.2.10 Hyperparameters10.2.11 Lab Workflow10.3 Atari Tricks10.4 Deep RL Almanac10.4.1 Hyperparameter Tables10.4.2 Algorithm Performance Comparison10.5 Summary
11. SLM Lab
11.1 Algorithms Implemented in SLM Lab11.2 Spec File11.2.1 Search Spec Syntax11.3 Running SLM Lab11.3.1 SLM Lab Commands11.4 Analyzing Experiment Results11.4.1 Overview of the Experiment Data11.5 Summary
12. Network Architectures
12.1 Types of Neural Networks12.1.1 Multilayer Perceptrons (MLPs)12.1.2 Convolutional Neural Networks (CNNs)12.1.3 Recurrent Neural Networks (RNNs)12.2 Guidelines for Choosing a Network Family12.2.1 MDPs vs. POMDPs12.2.2 Choosing Networks for Environments12.3 The Net API12.3.1 Input and Output Layer Shape Inference12.3.2 Automatic Network Construction12.3.3 Training Step12.3.4 Exposure of Underlying Methods12.4 Summary12.5 Further Reading
13. Hardware
13.1 Computer13.2 Data Types13.3 Optimizing Data Types in RL13.4 Choosing Hardware13.5 Summary
IV: Environment Design
14. States
14.1 Examples of States14.2 State Completeness14.3 State Complexity14.4 State Information Loss14.4.1 Image Grayscaling14.4.2 Discretization14.4.3 Hash Conflict14.4.4 Metainformation Loss14.5 Preprocessing14.5.1 Standardization14.5.2 Image Preprocessing14.5.3 Temporal Preprocessing14.6 Summary
15. Actions
15.1 Examples of Actions15.2 Action Completeness15.3 Action Complexity15.4 Summary15.5 Further Reading: Action Design in Everyday Things
16. Rewards
16.1 The Role of Rewards16.2 Reward Design Guidelines16.3 Summary
17. Transition Function 333
17.1 Feasibility Checks17.2 Reality Check17.3 Summary
Epilogue
A. Deep Reinforcement Learning Timeline
B. Example Environments
B.1 Discrete EnvironmentsB.1.1 CartPole-v0B.1.2 MountainCar-v0B.1.3 LunarLander-v2B.1.4 PongNoFrameskip-v4B.1.5 BreakoutNoFrameskip-v4B.2 Continuous EnvironmentsB.2.1 Pendulum-v0B.2.2 BipedalWalker-v2
References
Index
Credits

Content preview from Foundations of Deep Reinforcement Learning: Theory and Practice in Python

7. Proximal Policy Optimization (PPO)

One challenge when training agents with policy gradient algorithms is that they are susceptible to performance collapse in which an agent suddenly starts to perform badly. This scenario can be hard to recover from because an agent will start to generate poor trajectories which are then used to further train the policy. We have also seen that on-policy algorithms are sample-inefficient because they cannot reuse data.

Proximal Policy Optimization (PPO) by Schulman et al. [124] is a class of optimization algorithms that addresses these two issues. The main idea behind PPO is to introduce a surrogate objective which avoids performance collapse by guaranteeing monotonic policy improvement. This objective also ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Deep Reinforcement Learning with Python - Second Edition

Publisher Resources

ISBN: 9780135172490

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Foundations of Deep Reinforcement Learning: Theory and Practice in Python

by Laura Graesser, Wah Loon Keng

7. Proximal Policy Optimization (PPO)

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.