book

Reinforcement Learning

Name: Reinforcement Learning
Author: Phil Winder
ISBN: 9781098114831

by Phil Winder

November 2020

Intermediate to advanced

408 pages

11h 49m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
ObjectiveWho Should Read This Book?Guiding Principles and StylePrerequisitesScope and OutlineSupplementary MaterialsConventions Used in This BookAcronymsMathematical NotationFair Use PolicyO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Why Reinforcement Learning?
Why Now?Machine LearningReinforcement LearningWhen Should You Use RL?RL ApplicationsTaxonomy of RL ApproachesModel-Free or Model-BasedHow Agents Use and Update Their StrategyDiscrete or Continuous ActionsOptimization MethodsPolicy Evaluation and ImprovementFundamental Concepts in Reinforcement LearningThe First RL AlgorithmIs RL the Same as ML?Reward and FeedbackReinforcement Learning as a DisciplineSummaryFurther Reading
2. Markov Decision Processes, Dynamic Programming, and Monte Carlo Methods
Multi-Arm Bandit TestingReward EngineeringPolicy Evaluation: The Value FunctionPolicy Improvement: Choosing the Best ActionSimulating the EnvironmentRunning the ExperimentImproving the ϵ-greedy AlgorithmMarkov Decision ProcessesInventory ControlInventory Control SimulationPolicies and Value FunctionsDiscounted RewardsPredicting Rewards with the State-Value FunctionPredicting Rewards with the Action-Value FunctionOptimal PoliciesMonte Carlo Policy GenerationValue Iteration with Dynamic ProgrammingImplementing Value IterationResults of Value IterationSummaryFurther Reading
3. Temporal-Difference Learning, Q-Learning, and n-Step Algorithms
Formulation of Temporal-Difference LearningQ-LearningSARSAQ-Learning Versus SARSACase Study: Automatically Scaling Application Containers to Reduce CostIndustrial Example: Real-Time Bidding in AdvertisingDefining the MDPResults of the Real-Time Bidding EnvironmentsFurther ImprovementsExtensions to Q-LearningDouble Q-LearningDelayed Q-LearningComparing Standard, Double, and Delayed Q-learningOpposition Learningn-Step Algorithmsn-Step Algorithms on Grid EnvironmentsEligibility TracesExtensions to Eligibility TracesWatkins’s Q(λ)Fuzzy Wipes in Watkins’s Q(λ)Speedy Q-LearningAccumulating Versus Replacing Eligibility TracesSummaryFurther Reading
4. Deep Q-Networks
Deep Learning ArchitecturesFundamentalsCommon Neural Network ArchitecturesDeep Learning FrameworksDeep Reinforcement LearningDeep Q-LearningExperience ReplayQ-Network ClonesNeural Network ArchitectureImplementing DQNExample: DQN on the CartPole EnvironmentCase Study: Reducing Energy Usage in BuildingsRainbow DQNDistributional RLPrioritized Experience ReplayNoisy NetsDueling NetworksExample: Rainbow DQN on Atari GamesResultsDiscussionOther DQN ImprovementsImproving ExplorationImproving RewardsLearning from Offline DataSummaryFurther Reading
5. Policy Gradient Methods
Benefits of Learning a Policy DirectlyHow to Calculate the Gradient of a PolicyPolicy Gradient TheoremPolicy FunctionsLinear PoliciesArbitrary PoliciesBasic ImplementationsMonte Carlo (REINFORCE)REINFORCE with BaselineGradient Variance Reductionn-Step Actor-Critic and Advantage Actor-Critic (A2C)Eligibility Traces Actor-CriticA Comparison of Basic Policy Gradient AlgorithmsIndustrial Example: Automatically Purchasing Products for CustomersThe Environment: Gym-Shopping-CartExpectationsResults from the Shopping Cart EnvironmentSummaryFurther Reading
6. Beyond Policy Gradients
Off-Policy AlgorithmsImportance SamplingBehavior and Target PoliciesOff-Policy Q-LearningGradient Temporal-Difference LearningGreedy-GQOff-Policy Actor-CriticsDeterministic Policy GradientsDeterministic Policy GradientsDeep Deterministic Policy GradientsTwin Delayed DDPGCase Study: Recommendations Using ReviewsImprovements to DPGTrust Region MethodsKullback–Leibler DivergenceNatural Policy Gradients and Trust Region Policy OptimizationProximal Policy OptimizationExample: Using Servos for a Real-Life ReacherExperiment SetupRL Algorithm ImplementationIncreasing the Complexity of the AlgorithmHyperparameter Tuning in a SimulationResulting PoliciesOther Policy Gradient AlgorithmsRetrace(λ)Actor-Critic with Experience Replay (ACER)Actor-Critic Using Kronecker-Factored Trust Regions (ACKTR)Emphatic MethodsExtensions to Policy Gradient AlgorithmsQuantile Regression in Policy Gradient AlgorithmsSummaryWhich Algorithm Should I Use?A Note on Asynchronous MethodsFurther Reading
7. Learning All Possible Policies with Entropy Methods
What Is Entropy?Maximum Entropy Reinforcement LearningSoft Actor-CriticSAC Implementation Details and Discrete Action SpacesAutomatically Adjusting TemperatureCase Study: Automated Traffic Management to Reduce QueuingExtensions to Maximum Entropy MethodsOther Measures of Entropy (and Ensembles)Optimistic Exploration Using the Upper Bound of Double Q-LearningTinkering with Experience ReplaySoft Policy GradientSoft Q-Learning (and Derivatives)Path Consistency LearningPerformance Comparison: SAC Versus PPOHow Does Entropy Encourage Exploration?How Does the Temperature Parameter Alter Exploration?Industrial Example: Learning to Drive with a Remote Control CarDescription of the ProblemMinimizing Training TimeDramatic ActionsHyperparameter SearchFinal PolicyFurther ImprovementsSummaryEquivalence Between Policy Gradients and Soft Q-LearningWhat Does This Mean For the Future?What Does This Mean Now?
8. Improving How an Agent Learns
Rethinking the MDPPartially Observable Markov Decision ProcessCase Study: Using POMDPs in Autonomous VehiclesContextual Markov Decision ProcessesMDPs with Changing ActionsRegularized MDPsHierarchical Reinforcement LearningNaive HRLHigh-Low Hierarchies with Intrinsic Rewards (HIRO)Learning Skills and Unsupervised RLUsing Skills in HRLHRL ConclusionsMulti-Agent Reinforcement LearningMARL FrameworksCentralized or DecentralizedSingle-Agent AlgorithmsCase Study: Using Single-Agent Decentralized Learning in UAVsCentralized Learning, Decentralized ExecutionDecentralized LearningOther CombinationsChallenges of MARLMARL ConclusionsExpert GuidanceBehavior CloningImitation RLInverse RLCurriculum LearningOther ParadigmsMeta-LearningTransfer LearningSummaryFurther Reading
9. Practical Reinforcement Learning
The RL Project Life CycleLife Cycle DefinitionProblem Definition: What Is an RL Project?RL Problems Are SequentialRL Problems Are StrategicLow-Level RL IndicatorsTypes of LearningRL Engineering and RefinementProcessEnvironment EngineeringState Engineering or State Representation LearningPolicy EngineeringMapping Policies to Action SpacesExplorationReward EngineeringSummaryFurther Reading

10. Operational Reinforcement Learning
ImplementationFrameworksScaling RLEvaluationDeploymentGoalsArchitectureAncillary ToolingSafety, Security, and EthicsSummaryFurther Reading
11. Conclusions and the Future
Tips and TricksFraming the ProblemYour DataTrainingEvaluationDeploymentDebugging${ALGORITHM_NAME} Can’t Solve ${ENVIRONMENT}!Monitoring for DebuggingThe Future of Reinforcement LearningRL Market OpportunitiesFuture RL and Research DirectionsConcluding RemarksNext StepsNow It’s Your TurnFurther Reading
A. The Gradient of a Logistic Policy for Two Actions
B. The Gradient of a Softmax Policy
Glossary
Acronyms and Common TermsSymbols and Notation
Index
About the Author
Contact Details

Content preview from Reinforcement Learning

Appendix B. The Gradient of a Softmax Policy

The derivation of the gradient of a softmax policy is shown in Equation B-1. Note how this is a similar form to the logistic gradients in Appendix A.

Equation B-1. Gradient of a softmax policy

\begin{matrix} \nabla_{θ} \ln π (θ^{⊺} s) & = \nabla_{θ} \ln \frac{e^{θ^{⊺}} s}{\sum_{a} e^{θ_{a}^{⊺} s}} \\ = \nabla_{θ} \ln e^{θ^{⊺}} s - \nabla_{θ} \ln \sum_{a} e^{θ_{a}^{⊺} s} \\ = \nabla_{θ} θ^{⊺} s - \nabla_{θ} \ln \sum_{a} e^{θ_{a}^{⊺} s} \\ = s - \nabla_{θ} \ln \sum_{a} e^{θ_{a}^{⊺} s} \\ = s - \nabla_{θ} \ln \sum_{a} e^{θ_{a}^{⊺} s} \\ = s - \frac{\nabla_{θ} \sum_{a} e^{θ_{a}^{⊺} s}}{\sum_{a} e^{θ_{a}^{⊺} s}} \\ = s - \frac{\sum_{a} s e^{θ_{a}^{⊺} s}}{\sum_{a} e^{θ_{a}^{⊺} s}} \\ = s - \sum_{a} s π (θ^{⊺} s) \end{matrix}

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492072386Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Reinforcement Learning

by Phil Winder

Appendix B. The Gradient of a Softmax Policy

Equation B-1. Gradient of a softmax policy

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.