book

Deep Reinforcement Learning Hands-On

Name: Deep Reinforcement Learning Hands-On
ISBN: 9781788834247

by Oleg Vasilev, Maxim Lapan, Martijn van Otterlo, Mikhail Yurushkin, Basem O. F. Alijla

June 2018

Intermediate to advanced

546 pages

13h 30m

English

Packt Publishing

Read now

Unlock full access

Deep Reinforcement Learning Hands-On
Table of Contents
Deep Reinforcement Learning Hands-On
Why subscribe?
PacktPub.com
Contributors
About the author
About the reviewers
Packt is Searching for Authors Like You
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code filesDownload the color imagesConventions used

Get in touch
Reviews
1. What is Reinforcement Learning?
Learning – supervised, unsupervised, and reinforcement
RL formalisms and relations
RewardThe agentThe environmentActionsObservations
Markov decision processes
Markov processMarkov reward processMarkov decision process
Summary
2. OpenAI Gym
The anatomy of the agent
Hardware and software requirements
OpenAI Gym API
Action spaceObservation spaceThe environmentCreation of the environmentThe CartPole session
The random CartPole agent
The extra Gym functionality – wrappers and monitors
WrappersMonitor
Summary
3. Deep Learning with PyTorch
TensorsCreation of tensorsScalar tensorsTensor operationsGPU tensors
Gradients
Tensors and gradients
NN building blocks
Custom layers
Final glue – loss functions and optimizers
Loss functionsOptimizers
Monitoring with TensorBoard
TensorBoard 101Plotting stuff
Example – GAN on Atari images
Summary
4. The Cross-Entropy Method
Taxonomy of RL methods
Practical cross-entropy
Cross-entropy on CartPole
Cross-entropy on FrozenLake
Theoretical background of the cross-entropy method
Summary
5. Tabular Learning and the Bellman Equation
Value, state, and optimality
The Bellman equation of optimality
Value of action
The value iteration method
Value iteration in practice
Q-learning for FrozenLake
Summary
6. Deep Q-Networks
Real-life value iteration
Tabular Q-learning
Deep Q-learning
Interaction with the environmentSGD optimizationCorrelation between stepsThe Markov propertyThe final form of DQN training
DQN on Pong
WrappersDQN modelTrainingRunning and performanceYour model in action
Summary
7. DQN Extensions
The PyTorch Agent Net libraryAgentAgent's experienceExperience bufferGym env wrappers
Basic DQN
N-step DQN
Implementation
Double DQN
ImplementationResults
Noisy networks
ImplementationResults
Prioritized replay buffer
ImplementationResults
Dueling DQN
ImplementationResults
Categorical DQN
ImplementationResults
Combining everything
ImplementationResults
Summary
References
8. Stocks Trading Using RL
Trading
Data
Problem statements and key decisions
The trading environment
Models
Training code
Results
The feed-forward modelThe convolution model
Things to try
Summary
9. Policy Gradients – An Alternative
Values and policyWhy policy?Policy representationPolicy gradients
The REINFORCE method
The CartPole exampleResultsPolicy-based versus value-based methods
REINFORCE issues
Full episodes are requiredHigh gradients varianceExplorationCorrelation between samples
PG on CartPole
Results
PG on Pong
Results
Summary
10. The Actor-Critic Method
Variance reduction
CartPole variance
Actor-critic
A2C on Pong
A2C on Pong results
Tuning hyperparameters
Learning rateEntropy betaCount of environmentsBatch size
Summary
11. Asynchronous Advantage Actor-Critic
Correlation and sample efficiency
Adding an extra A to A2C
Multiprocessing in Python
A3C – data parallelism
Results
A3C – gradients parallelism
Results
Summary
12. Chatbots Training with RL
Chatbots overview
Deep NLP basics
Recurrent Neural NetworksEmbeddingsEncoder-Decoder
Training of seq2seq
Log-likelihood trainingBilingual evaluation understudy (BLEU) scoreRL in seq2seqSelf-critical sequence training
The chatbot example
The example structureModules: cornell.py and data.pyBLEU score and utils.pyModelTraining: cross-entropyRunning the trainingChecking the dataTesting the trained modelTraining: SCSTRunning the SCST trainingResultsTelegram bot
Summary
13. Web Navigation
Web navigationBrowser automation and RLMini World of Bits benchmark
OpenAI Universe
InstallationActions and observationsEnvironment creationMiniWoB stability
Simple clicking approach
Grid actionsExample overviewModelTraining codeStarting containersTraining processChecking the learned policyIssues with simple clicking
Human demonstrations
Recording the demonstrationsRecording formatTraining using demonstrationsResultsTicTacToe problem
Adding text description
Results
Things to try
Summary
14. Continuous Action Space
Why a continuous space?
Action space
Environments
The Actor-Critic (A2C) method
ImplementationResultsUsing models and recording videos
Deterministic policy gradients
ExplorationImplementationResultsRecording videos
Distributional policy gradients
ArchitectureImplementationResults
Things to try
Summary
15. Trust Regions – TRPO, PPO, and ACKTR
Introduction
Roboschool
A2C baseline
ResultsVideos recording
Proximal Policy Optimization
ImplementationResults
Trust Region Policy Optimization
ImplementationResults
A2C using ACKTR
ImplementationResults
Summary
16. Black-Box Optimization in RL
Black-box methods
Evolution strategies
ES on CartPole
Results
ES on HalfCheetah
Results
Genetic algorithms
GA on CartPole
Results
GA tweaks
Deep GANovelty search
GA on Cheetah
Results
Summary
References
17. Beyond Model-Free – Imagination
Model-based versus model-free
Model imperfections
Imagination-augmented agent
The environment modelThe rollout policyThe rollout encoderPaper results
I2A on Atari Breakout
The baseline A2C agentEM trainingThe imagination agentThe I2A modelThe Rollout encoderTraining of I2A
Experiment results
The baseline agentTraining EM weightsTraining with the I2A model
Summary
References
18. AlphaGo Zero
Board games
The AlphaGo Zero method
OverviewMonte-Carlo Tree SearchSelf-playTraining and evaluation
Connect4 bot
Game modelImplementing MCTSModelTrainingTesting and comparison
Connect4 results
Summary
References
Book summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
Index

Content preview from Deep Reinforcement Learning Hands-On

Chapter 15. Trust Regions – TRPO, PPO, and ACKTR

In this chapter, we'll take a look at the approaches used to improve the stability of the stochastic policy gradient method. Some attempts have been made to make the policy improvement more stable and we'll focus on three methods: Proximal Policy Optimization (PPO), Trust Region Policy Optimization (TRPO) and Actor-Critic (A2C) using Kronecker-Factored Trust Region (ACKTR).

To compare them to the A2C baseline, we'll use several environments from the roboschool library created by OpenAI.

Introduction

The overall motivation of the methods that we'll take a look at is to improve the stability of the policy update during the training. Intuitively, there is a dilemma: on the one hand, we'd like to train ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Deep Reinforcement Learning Hands-On - Third Edition

Publisher Resources

ISBN: 9781788834247

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Deep Reinforcement Learning Hands-On

by Oleg Vasilev, Maxim Lapan, Martijn van Otterlo, Mikhail Yurushkin, Basem O. F. Alijla

Chapter 15. Trust Regions – TRPO, PPO, and ACKTR

Introduction

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

More than 5,000 organizations count on O’Reilly

Julian F.

Addison B.

Amir M.

Mark W.

You might also like

Deep Reinforcement Learning Hands-On - Third Edition

Grokking Deep Reinforcement Learning

Deep Reinforcement Learning Hands-On - Second Edition

Deep Learning Illustrated: A Visual, Interactive Guide to Artificial Intelligence

Publisher Resources