Grokking Deep Reinforcement Learning

Book description

Grokking Deep Reinforcement Learning uses engaging exercises to teach you how to build deep learning systems. This book combines annotated Python code with intuitive explanations to explore DRL techniques. You'll see how algorithms function and learn to develop your own DRL agents using evaluative feedback.

Table of contents

  1. Grokking Deep Reinforcement Learning
  2. Copyright
  3. dedication
  4. contents
  5. front matter
    1. foreword
    2. preface
    3. acknowledgments
    4. about this book
      1. Who should read this book
      2. How this book is organized: a roadmap
      3. About the code
      4. liveBook discussion forum
    5. about the author
  6. 1 Introduction to deep reinforcement learning
    1. What is deep reinforcement learning?
      1. Deep reinforcement learning is a machine learning approach to artificial intelligence
      2. Deep reinforcement learning is concerned with creating computer programs
      3. Deep reinforcement learning agents can solve problems that require intelligence
      4. Deep reinforcement learning agents improve their behavior through trial-and-error learning
      5. Deep reinforcement learning agents learn from sequential feedback
      6. Deep reinforcement learning agents learn from evaluative feedback
      7. Deep reinforcement learning agents learn from sampled feedback
      8. Deep reinforcement learning agents use powerful non-linear function approximation
    2. The past, present, and future of deep reinforcement learning
      1. Recent history of artificial intelligence and deep reinforcement learning
      2. Artificial intelligence winters
      3. The current state of artificial intelligence
      4. Progress in deep reinforcement learning
      5. Opportunities ahead
    3. The suitability of deep reinforcement learning
      1. What are the pros and cons?
      2. Deep reinforcement learning’s strengths
      3. Deep reinforcement learning’s weaknesses
    4. Setting clear two-way expectations
      1. What to expect from the book?
      2. How to get the most out of this book
      3. Deep reinforcement learning development environment
    5. Summary
  7. 2 Mathematical foundations of reinforcement learning
    1. Components of reinforcement learning
      1. Examples of problems, agents, and environments
      2. The agent: The decision maker
      3. The environment: Everything else
      4. Agent-environment interaction cycle
    2. MDPs: The engine of the environment
      1. States: Specific configurations of the environment
      2. Actions: A mechanism to influence the environment
      3. Transition function: Consequences of agent actions
      4. Reward signal: Carrots and sticks
      5. Horizon: Time changes what’s optimal
      6. Discount: The future is uncertain, value it less
      7. Extensions to MDPs
      8. Putting it all together
    3. Summary
  8. 3 Balancing immediate and long-term goals
    1. The objective of a decision-making agent
      1. Policies: Per-state action prescriptions
      2. State-value function: What to expect from here?
      3. Action-value function: What should I expect from here if I do this?
      4. Action-advantage function: How much better if I do that?
      5. Optimality
    2. Planning optimal sequences of actions
      1. Policy evaluation: Rating policies
      2. Policy improvement: Using ratings to get better
      3. Policy iteration: Improving upon improved behaviors
      4. Value iteration: Improving behaviors early
    3. Summary
  9. 4 Balancing the gathering and use of information
    1. The challenge of interpreting evaluative feedback
      1. Bandits: Single-state decision problems
      2. Regret: The cost of exploration
      3. Approaches to solving MAB environments
      4. Greedy: Always exploit
      5. Random: Always explore
      6. Epsilon-greedy: Almost always greedy and sometimes random
      7. Decaying epsilon-greedy: First maximize exploration, then exploitation
      8. Optimistic initialization: Start off believing it’s a wonderful world
    2. Strategic exploration
      1. Softmax: Select actions randomly in proportion to their estimates
      2. UCB: It’s not about optimism, it’s about realistic optimism
      3. Thompson sampling: Balancing reward and risk
    3. Summary
  10. 5 Evaluating agents’ behaviors
    1. Learning to estimate the value of policies
      1. First-visit Monte Carlo: Improving estimates after each episode
      2. Every-visit Monte Carlo: A different way of handling state visits
      3. Temporal-difference learning: Improving estimates after each step
    2. Learning to estimate from multiple steps
      1. N-step TD learning: Improving estimates after a couple of steps
      2. Forward-view TD(λ): Improving estimates of all visited states
      3. TD(λ): Improving estimates of all visited states after each step
    3. Summary
  11. 6 Improving agents’ behaviors
    1. The anatomy of reinforcement learning agents
      1. Most agents gather experience samples
      2. Most agents estimate something
      3. Most agents improve a policy
      4. Generalized policy iteration
    2. Learning to improve policies of behavior
      1. Monte Carlo control: Improving policies after each episode
      2. SARSA: Improving policies after each step
    3. Decoupling behavior from learning
      1. Q-learning: Learning to act optimally, even if we choose not to
      2. Double Q-learning: A max of estimates for an estimate of a max
    4. Summary
  12. 7 Achieving goals more effectively and efficiently
    1. Learning to improve policies using robust targets
      1. SARSA(λ): Improving policies after each step based on multi-step estimates
      2. Watkins’s Q(λ): Decoupling behavior from learning, again
    2. Agents that interact, learn, and plan
      1. Dyna-Q: Learning sample models
      2. Trajectory sampling: Making plans for the immediate future
    3. Summary
  13. 8 Introduction to value-based deep reinforcement learning
    1. The kind of feedback deep reinforcement learning agents use
      1. Deep reinforcement learning agents deal with sequential feedback
      2. But, if it isn’t sequential, what is it?
      3. Deep reinforcement learning agents deal with evaluative feedback
      4. But, if it isn’t evaluative, what is it?
      5. Deep reinforcement learning agents deal with sampled feedback
      6. But, if it isn’t sampled, what is it?
    2. Introduction to function approximation for reinforcement learning
      1. Reinforcement learning problems can have high-dimensional state and action spaces
      2. Reinforcement learning problems can have continuous state and action spaces
      3. There are advantages when using function approximation
    3. NFQ: The first attempt at value-based deep reinforcement learning
      1. First decision point: Selecting a value function to approximate
      2. Second decision point: Selecting a neural network architecture
      3. Third decision point: Selecting what to optimize
      4. Fourth decision point: Selecting the targets for policy evaluation
      5. Fifth decision point: Selecting an exploration strategy
      6. Sixth decision point: Selecting a loss function
      7. Seventh decision point: Selecting an optimization method
      8. Things that could (and do) go wrong
    4. Summary
  14. 9 More stable value-based methods
    1. DQN: Making reinforcement learning more like supervised learning
      1. Common problems in value-based deep reinforcement learning
      2. Using target networks
      3. Using larger networks
      4. Using experience replay
      5. Using other exploration strategies
    2. Double DQN: Mitigating the overestimation of action-value functions
      1. The problem of overestimation, take two
      2. Separating action selection from action evaluation
      3. A solution
      4. A more practical solution
      5. A more forgiving loss function
      6. Things we can still improve on
    3. Summary
  15. 10 Sample-efficient value-based methods
    1. Dueling DDQN: A reinforcement-learning-aware neural network architecture
      1. Reinforcement learning isn’t a supervised learning problem
      2. Nuances of value-based deep reinforcement learning methods
      3. Advantage of using advantages
      4. A reinforcement-learning-aware architecture
      5. Building a dueling network
      6. Reconstructing the action-value function
      7. Continuously updating the target network
      8. What does the dueling network bring to the table?
    2. PER: Prioritizing the replay of meaningful experiences
      1. A smarter way to replay experiences
      2. Then, what’s a good measure of “important” experiences?
      3. Greedy prioritization by TD error
      4. Sampling prioritized experiences stochastically
      5. Proportional prioritization
      6. Rank-based prioritization
      7. Prioritization bias
    3. Summary
  16. 11 Policy-gradient and actor-critic methods
    1. REINFORCE: Outcome-based policy learning
      1. Introduction to policy-gradient methods
      2. Advantages of policy-gradient methods
      3. Learning policies directly
      4. Reducing the variance of the policy gradient
    2. VPG: Learning a value function
      1. Further reducing the variance of the policy gradient
      2. Learning a value function
      3. Encouraging exploration
    3. A3C: Parallel policy updates
      1. Using actor-workers
      2. Using n-step estimates
      3. Non-blocking model updates
    4. GAE: Robust advantage estimation
      1. Generalized advantage estimation
    5. A2C: Synchronous policy updates
      1. Weight-sharing model
      2. Restoring order in policy updates
    6. Summary
  17. 12 Advanced actor-critic methods
    1. DDPG: Approximating a deterministic policy
      1. DDPG uses many tricks from DQN
      2. Learning a deterministic policy
      3. Exploration with deterministic policies
    2. TD3: State-of-the-art improvements over DDPG
      1. Double learning in DDPG
      2. Smoothing the targets used for policy updates
      3. Delaying updates
    3. SAC: Maximizing the expected return and entropy
      1. Adding the entropy to the Bellman equations
      2. Learning the action-value function
      3. Learning the policy
      4. Automatically tuning the entropy coefficient
    4. PPO: Restricting optimization steps
      1. Using the same actor-critic architecture as A2C
      2. Batching experiences
      3. Clipping the policy updates
      4. Clipping the value function updates
    5. Summary
  18. 13 Toward artificial general intelligence
    1. What was covered and what notably wasn’t?
      1. Markov decision processes
      2. Planning methods
      3. Bandit methods
      4. Tabular reinforcement learning
      5. Value-based deep reinforcement learning
      6. Policy-based and actor-critic deep reinforcement learning
      7. Advanced actor-critic techniques
      8. Model-based deep reinforcement learning
      9. Derivative-free optimization methods
    2. More advanced concepts toward AGI
      1. What is AGI, again?
      2. Advanced exploration strategies
      3. Inverse reinforcement learning
      4. Transfer learning
      5. Multi-task learning
      6. Curriculum learning
      7. Meta learning
      8. Hierarchical reinforcement learning
      9. Multi-agent reinforcement learning
      10. Explainable AI, safety, fairness, and ethical standards
    3. What happens next?
      1. How to use DRL to solve custom problems
      2. Going forward
      3. Get yourself out there! Now!
    4. Summary
  19. index

Product information

  • Title: Grokking Deep Reinforcement Learning
  • Author(s): Miguel Morales
  • Release date: December 2020
  • Publisher(s): Manning Publications
  • ISBN: 9781617295454