Foundations of Deep Reinforcement Learning: Theory and Practice in Python

Book description

The Contemporary Introduction to Deep Reinforcement Learning that Combines Theory and Practice

Deep reinforcement learning (deep RL) combines deep learning and reinforcement learning, in which artificial agents learn to solve sequential decision-making problems. In the past decade deep RL has achieved remarkable results on a range of problems, from single and multiplayer games–such as Go, Atari games, and DotA 2–to robotics.

Foundations of Deep Reinforcement Learning is an introduction to deep RL that uniquely combines both theory and implementation. It starts with intuition, then carefully explains the theory of deep RL algorithms, discusses implementations in its companion software library SLM Lab, and finishes with the practical details of getting deep RL to work.

This guide is ideal for both computer science students and software engineers who are familiar with basic machine learning concepts and have a working understanding of Python.

  • Understand each key aspect of a deep RL problem

  • Explore policy- and value-based algorithms, including REINFORCE, SARSA, DQN, Double DQN, and Prioritized Experience Replay (PER)

  • Delve into combined algorithms, including Actor-Critic and Proximal Policy Optimization (PPO)

  • Understand how algorithms can be parallelized synchronously and asynchronously

  • Run algorithms in SLM Lab and learn the practical implementation details for getting deep RL to work

  • Explore algorithm benchmark results with tuned hyperparameters

  • Understand how deep RL environments are designed

Register your book for convenient access to downloads, updates, and/or corrections as they become available. See inside book for details.

Table of contents

  1. Cover Page
  2. About This eBook
  3. Half Title Page
  4. Title Page
  5. Copyright Page
  6. Dedication Page
  7. Contents
  8. Foreword
  9. Preface
  10. Acknowledgments
  11. About the Authors
  12. 1. Introduction to Reinforcement Learning
    1. 1.1 Reinforcement Learning
    2. 1.2 Reinforcement Learning as MDP
    3. 1.3 Learnable Functions in Reinforcement Learning
    4. 1.4 Deep Reinforcement Learning Algorithms
      1. 1.4.1 Policy-Based Algorithms
      2. 1.4.2 Value-Based Algorithms
      3. 1.4.3 Model-Based Algorithms
      4. 1.4.4 Combined Methods
      5. 1.4.5 Algorithms Covered in This Book
      6. 1.4.6 On-Policy and Off-Policy Algorithms
      7. 1.4.7 Summary
    5. 1.5 Deep Learning for Reinforcement Learning
    6. 1.6 Reinforcement Learning and Supervised Learning
      1. 1.6.1 Lack of an Oracle
      2. 1.6.2 Sparsity of Feedback
      3. 1.6.3 Data Generation
    7. 1.7 Summary
  13. I: Policy-Based and Value-Based Algorithms
    1. 2. REINFORCE
      1. 2.1 Policy
      2. 2.2 The Objective Function
      3. 2.3 The Policy Gradient
        1. 2.3.1 Policy Gradient Derivation
      4. 2.4 Monte Carlo Sampling
      5. 2.5 REINFORCE Algorithm
        1. 2.5.1 Improving REINFORCE
      6. 2.6 Implementing REINFORCE
        1. 2.6.1 A Minimal REINFORCE Implementation
        2. 2.6.2 Constructing Policies with PyTorch
        3. 2.6.3 Sampling Actions
        4. 2.6.4 Calculating Policy Loss
        5. 2.6.5 REINFORCE Training Loop
        6. 2.6.6 On-Policy Replay Memory
      7. 2.7 Training a REINFORCE Agent
      8. 2.8 Experimental Results
        1. 2.8.1 Experiment: The Effect of Discount Factor
        2. 2.8.2 Experiment: The Effect of Baseline
      9. 2.9 Summary
      10. 2.10 Further Reading
      11. 2.11 History
    2. 3. SARSA
      1. 3.1 The Q- and V-Functions
      2. 3.2 Temporal Difference Learning
        1. 3.2.1 Intuition for Temporal Difference Learning
      3. 3.3 Action Selection in SARSA
        1. 3.3.1 Exploration and Exploitation
      4. 3.4 SARSA Algorithm
        1. 3.4.1 On-Policy Algorithms
      5. 3.5 Implementing SARSA
        1. 3.5.1 Action Function: ε-Greedy
        2. 3.5.2 Calculating the Q-Loss
        3. 3.5.3 SARSA Training Loop
        4. 3.5.4 On-Policy Batched Replay Memory
      6. 3.6 Training a SARSA Agent
      7. 3.7 Experimental Results
        1. 3.7.1 Experiment: The Effect of Learning Rate
      8. 3.8 Summary
      9. 3.9 Further Reading
      10. 3.10 History 79
    3. 4. Deep Q-Networks (DQN)
      1. 4.1 Learning the Q-Function in DQN
      2. 4.2 Action Selection in DQN
        1. 4.2.1 The Boltzmann Policy
      3. 4.3 Experience Replay
      4. 4.4 DQN Algorithm
      5. 4.5 Implementing DQN
        1. 4.5.1 Calculating the Q-Loss
        2. 4.5.2 DQN Training Loop
        3. 4.5.3 Replay Memory
      6. 4.6 Training a DQN Agent
      7. 4.7 Experimental Results
        1. 4.7.1 Experiment: The Effect of Network Architecture
      8. 4.8 Summary
      9. 4.9 Further Reading
      10. 4.10 History
    4. 5. Improving DQN
      1. 5.1 Target Networks
      2. 5.2 Double DQN
      3. 5.3 Prioritized Experience Replay (PER)
        1. 5.3.1 Importance Sampling
      4. 5.4 Modified DQN Implementation
        1. 5.4.1 Network Initialization
        2. 5.4.2 Calculating the Q-Loss
        3. 5.4.3 Updating the Target Network
        4. 5.4.4 DQN with Target Networks
        5. 5.4.5 Double DQN
        6. 5.4.6 Prioritized Experienced Replay
      5. 5.5 Training a DQN Agent to Play Atari Games
      6. 5.6 Experimental Results
        1. 5.6.1 Experiment: The Effect of Double DQN and PER
      7. 5.7 Summary
      8. 5.8 Further Reading
  14. II: Combined Methods
    1. 6. Advantage Actor-Critic (A2C)
      1. 6.1 The Actor
      2. 6.2 The Critic
        1. 6.2.1 The Advantage Function
        2. 6.2.2 Learning the Advantage Function
      3. 6.3 A2C Algorithm
      4. 6.4 Implementing A2C
        1. 6.4.1 Advantage Estimation
        2. 6.4.2 Calculating Value Loss and Policy Loss
        3. 6.4.3 Actor-Critic Training Loop
      5. 6.5 Network Architecture
      6. 6.6 Training an A2C Agent
        1. 6.6.1 A2C with n-Step Returns on Pong
        2. 6.6.2 A2C with GAE on Pong
        3. 6.6.3 A2C with n-Step Returns on BipedalWalker
      7. 6.7 Experimental Results
        1. 6.7.1 Experiment: The Effect of n-Step Returns
        2. 6.7.2 Experiment: The Effect of λ of GAE
      8. 6.8 Summary
      9. 6.9 Further Reading
      10. 6.10 History
    2. 7. Proximal Policy Optimization (PPO) 165
      1. 7.1 Surrogate Objective
        1. 7.1.1 Performance Collapse
        2. 7.1.2 Modifying the Objective
      2. 7.2 Proximal Policy Optimization (PPO)
      3. 7.3 PPO Algorithm
      4. 7.4 Implementing PPO
        1. 7.4.1 Calculating the PPO Policy Loss
        2. 7.4.2 PPO Training Loop
      5. 7.5 Training a PPO Agent
        1. 7.5.1 PPO on Pong
        2. 7.5.2 PPO on BipedalWalker
      6. 7.6 Experimental Results
        1. 7.6.1 Experiment: The Effect of λ of GAE
        2. 7.6.2 Experiment: The Effect of Clipping Variable ε
      7. 7.7 Summary
      8. 7.8 Further Reading
    3. 8. Parallelization Methods
      1. 8.1 Synchronous Parallelization
      2. 8.2 Asynchronous Parallelization
        1. 8.2.1 Hogwild!
      3. 8.3 Training an A3C Agent
      4. 8.4 Summary
      5. 8.5 Further Reading
    4. 9. Algorithm Summary
  15. III: Practical Details
    1. 10. Getting Deep RL to Work 209
      1. 10.1 Software Engineering Practices
        1. 10.1.1 Unit Tests
        2. 10.1.2 Code Quality
        3. 10.1.3 Git Workflow
      2. 10.2 Debugging Tips
        1. 10.2.1 Signs of Life
        2. 10.2.2 Policy Gradient Diagnoses
        3. 10.2.3 Data Diagnoses
        4. 10.2.4 Preprocessor
        5. 10.2.5 Memory
        6. 10.2.6 Algorithmic Functions
        7. 10.2.7 Neural Networks
        8. 10.2.8 Algorithm Simplification
        9. 10.2.9 Problem Simplification
        10. 10.2.10 Hyperparameters
        11. 10.2.11 Lab Workflow
      3. 10.3 Atari Tricks
      4. 10.4 Deep RL Almanac
        1. 10.4.1 Hyperparameter Tables
        2. 10.4.2 Algorithm Performance Comparison
      5. 10.5 Summary
    2. 11. SLM Lab
      1. 11.1 Algorithms Implemented in SLM Lab
      2. 11.2 Spec File
        1. 11.2.1 Search Spec Syntax
      3. 11.3 Running SLM Lab
        1. 11.3.1 SLM Lab Commands
      4. 11.4 Analyzing Experiment Results
        1. 11.4.1 Overview of the Experiment Data
      5. 11.5 Summary
    3. 12. Network Architectures
      1. 12.1 Types of Neural Networks
        1. 12.1.1 Multilayer Perceptrons (MLPs)
        2. 12.1.2 Convolutional Neural Networks (CNNs)
        3. 12.1.3 Recurrent Neural Networks (RNNs)
      2. 12.2 Guidelines for Choosing a Network Family
        1. 12.2.1 MDPs vs. POMDPs
        2. 12.2.2 Choosing Networks for Environments
      3. 12.3 The Net API
        1. 12.3.1 Input and Output Layer Shape Inference
        2. 12.3.2 Automatic Network Construction
        3. 12.3.3 Training Step
        4. 12.3.4 Exposure of Underlying Methods
      4. 12.4 Summary
      5. 12.5 Further Reading
    4. 13. Hardware
      1. 13.1 Computer
      2. 13.2 Data Types
      3. 13.3 Optimizing Data Types in RL
      4. 13.4 Choosing Hardware
      5. 13.5 Summary
  16. IV: Environment Design
    1. 14. States
      1. 14.1 Examples of States
      2. 14.2 State Completeness
      3. 14.3 State Complexity
      4. 14.4 State Information Loss
        1. 14.4.1 Image Grayscaling
        2. 14.4.2 Discretization
        3. 14.4.3 Hash Conflict
        4. 14.4.4 Metainformation Loss
      5. 14.5 Preprocessing
        1. 14.5.1 Standardization
        2. 14.5.2 Image Preprocessing
        3. 14.5.3 Temporal Preprocessing
      6. 14.6 Summary
    2. 15. Actions
      1. 15.1 Examples of Actions
      2. 15.2 Action Completeness
      3. 15.3 Action Complexity
      4. 15.4 Summary
      5. 15.5 Further Reading: Action Design in Everyday Things
    3. 16. Rewards
      1. 16.1 The Role of Rewards
      2. 16.2 Reward Design Guidelines
      3. 16.3 Summary
    4. 17. Transition Function 333
      1. 17.1 Feasibility Checks
      2. 17.2 Reality Check
      3. 17.3 Summary
  17. Epilogue
  18. A. Deep Reinforcement Learning Timeline
  19. B. Example Environments
    1. B.1 Discrete Environments
      1. B.1.1 CartPole-v0
      2. B.1.2 MountainCar-v0
      3. B.1.3 LunarLander-v2
      4. B.1.4 PongNoFrameskip-v4
      5. B.1.5 BreakoutNoFrameskip-v4
    2. B.2 Continuous Environments
      1. B.2.1 Pendulum-v0
      2. B.2.2 BipedalWalker-v2
  20. References
  21. Index
  22. Credits

Product information

  • Title: Foundations of Deep Reinforcement Learning: Theory and Practice in Python
  • Author(s): Laura Graesser, Wah Loon Keng
  • Release date: December 2019
  • Publisher(s): Addison-Wesley Professional
  • ISBN: 9780135172490