book

Reinforcement Learning and Stochastic Optimization

Name: Reinforcement Learning and Stochastic Optimization
Author: Warren B. Powell
ISBN: 9781119815037

by Warren B. Powell

March 2022

Intermediate to advanced

1136 pages

29h 55m

English

Wiley

Read now

Unlock full access

Cover
Title page
Copyright
Preface
Acknowledgments
Part I – Introduction
1 Sequential Decision Problems
1.1 The Audience1.2 The Communities of Sequential Decision Problems1.3 Our Universal Modeling Framework1.4 DesigningPolicies for Sequential Decision Problems1.4.1 Policy Search1.4.2 Policies Based on Lookahead Approximations1.4.3 Mixing and Matching1.4.4 Optimality of the Four Classes1.4.5 Pulling it All Together1.5 Learning1.6 Themes1.6.1 Blending Learning and Optimization1.6.2 Bridging Machine Learning to Sequential Decisions1.6.3 From Deterministic to Stochastic Optimization1.6.4 From Single to Multiple Agents1.7 Our Modeling Approach1.8 How to Read this Book1.8.1 Organization of Topics1.8.2 How to Read Each Chapter1.8.3 Organization of Exercises1.9 Bibliographic NotesExercisesBibliography
2 Canonical Problems and Applications
2.1 Canonical Problems2.1.1 Stochastic Search – Derivative-based and Derivative-free2.1.1.1 Derivative-based Stochastic Search2.1.1.2 Derivative-free Stochastic Search2.1.2 Decision Trees2.1.3 Markov Decision Processes2.1.4 Optimal Control2.1.5 Approximate Dynamic Programming2.1.6 Reinforcement Learning2.1.7 Optimal Stopping2.1.8 Stochastic Programming2.1.9 The Multiarmed Bandit Problem2.1.10 Simulation Optimization2.1.11 Active Learning2.1.12 Chance-constrained Programming2.1.13 Model Predictive Control2.1.14 Robust Optimization2.2 A Universal Modeling Framework for Sequential Decision Problems2.2.1 Our Universal Model for Sequential Decision Problems2.2.2 A Compact Modeling Presentation2.2.3 MDP/RL vs. Optimal Control Modeling Frameworks2.3 Applications2.3.1 The Newsvendor Problems2.3.1.1 Basic Newsvendor – Final Reward2.3.1.2 Basic Newsvendor – Cumulative Reward2.3.1.3 Contextual Newsvendor2.3.1.4 Multidimensional Newsvendor Problems2.3.2 Inventory/Storage Problems2.3.2.1 Inventory Without Lags2.3.2.2 Inventory Planning with Forecasts2.3.2.3 Lagged Decisions2.3.3 Shortest Path Problems2.3.3.1 A Deterministic Shortest Path Problem2.3.3.2 A Stochastic Shortest Path Problem2.3.3.3 A Dynamic Shortest Path Problem2.3.3.4 A Robust Shortest Path Problem2.3.4 Some Fleet Management Problems2.3.4.1 The Nomadic Trucker2.3.4.2 From One Driver to a Fleet2.3.5 Pricing2.3.6 Medical Decision Making2.3.7 Scientific Exploration2.3.8 Machine Learning vs. Sequential Decision Problems2.4 Bibliographic NotesExercisesBibliography
3 Online Learning
3.1 Machine Learning for Sequential Decisions3.1.1 Observations and Data in Stochastic Optimization3.1.2 Indexing Input xn and Response yn+13.1.3 FunctionsWe are Learning3.1.4 Sequential Learning: From Very Little Data to … More Data3.1.5 Approximation Strategies3.1.6 From Data Analytics to Decision Analytics3.1.7 Batch vs. Online Learning3.2 Adaptive Learning Using Exponential Smoothing3.3 Lookup Tables with Frequentist Updating3.4 Lookup Tables with Bayesian Updating3.4.1 The Updating Equations for Independent Beliefs3.4.2 Updating for Correlated Beliefs3.4.3 Gaussian Process Regression3.5 Computing Bias and Variance*3.6 Lookup Tables and Aggregation*3.6.1 Hierarchical Aggregation3.6.2 Estimates of Different Levels of Aggregation3.6.3 Combining Multiple Levels of Aggregation3.7 Linear Parametric Models3.7.1 Linear Regression Review3.7.2 Sparse Additive Models and Lasso3.8 Recursive Least Squares for Linear Models3.8.1 Recursive Least Squares for Stationary Data3.8.2 Recursive Least Squares for Nonstationary Data*3.8.3 Recursive Estimation Using Multiple Observations*3.9 Nonlinear Parametric Models3.9.1 Maximum Likelihood Estimation3.9.2 Sampled Belief Models3.9.3 Neural Networks – Parametric*3.9.4 Limitations of Neural Networks3.10 Nonparametric Models*3.10.1 K-Nearest Neighbor3.10.2 Kernel Regression3.10.3 Local Polynomial Regression3.10.4 Deep Neural Networks3.10.5 Support Vector Machines3.10.6 Indexed Functions, Tree Structures, and Clustering3.10.7 Comments on Nonparametric Models3.11 Nonstationary Learning*3.11.1 Nonstationary Learning I – Martingale Truth3.11.2 Nonstationary Learning II – Transient Truth3.11.3 Learning Processes3.12 The Curse of Dimensionality3.13 Designing Approximation Architectures in Adaptive Learning3.14 Why Does ItWork?**3.14.1 Derivation of the Recursive Estimation Equations3.14.2 The Sherman-Morrison Updating Formula3.14.3 Correlations in Hierarchical Estimation3.14.4 Proof of Proposition 3.14.13.15 Bibliographic NotesExercisesBibliography
4 Introduction to Stochastic Search
4.1 Illustrations of the Basic Stochastic Optimization Problem4.2 Deterministic Methods4.2.1 A “Stochastic” Shortest Path Problem4.2.2 A Newsvendor Problem with Known Distribution4.2.3 Chance-Constrained Optimization4.2.4 Optimal Control4.2.5 Discrete Markov Decision Processes4.2.6 Remarks4.3 Sampled Models4.3.1 Formulating a Sampled Model4.3.1.1 A Sampled Stochastic Linear Program4.3.1.2 Sampled Chance-Constrained Models4.3.1.3 Sampled Parametric Models4.3.2 Convergence4.3.3 Creating a Sampled Model4.3.4 Decomposition Strategies*4.4 Adaptive Learning Algorithms4.4.1 Modeling Adaptive Learning Problems4.4.2 Online vs. Offline Applications4.4.2.1 Machine Learning4.4.2.2 Optimization4.4.3 Objective Functions for Learning4.4.4 Designing Policies4.5 Closing Remarks4.6 Bibliographic NotesExercisesBibliography

Part II – Stochastic Search
5 Derivative-Based Stochastic Search
5.1 Some Sample Applications5.2 Modeling Uncertainty5.2.1 Training Uncertainty W1, … WN5.2.2 Model UncertaintyS05.2.3 Testing Uncertainty5.2.4 Policy Evaluation5.2.5 Closing Notes5.3 Stochastic Gradient Methods5.3.1 A Stochastic Gradient Algorithm5.3.2 Introduction to Stepsizes5.3.3 Evaluating a Stochastic Gradient Algorithm5.3.4 A Note on Notation5.4 Styles of Gradients5.4.1 Gradient Smoothing5.4.2 Second-Order Methods5.4.3 Finite Differences5.4.4 SPSA5.4.5 Constrained Problems5.5 Parameter Optimization for Neural Networks*5.5.1 Computing the Gradient5.5.2 The Stochastic Gradient Algorithm5.6 Stochastic Gradient Algorithm as a Sequential Decision Problem5.7 Empirical Issues5.8 Transient Problems*5.9 Theoretical Performance*5.10 Why Does itWork?5.10.1 Some Probabilistic Preliminaries5.10.2 An Older Proof*5.10.3 A More Modern Proof**5.11 Bibliographic NotesExercisesBibliography
6 Stepsize Policies
6.1 Deterministic Stepsize Policies6.1.1 Properties for Convergence6.1.2 A Collection of Deterministic Policies6.1.2.1 Constant Stepsizes6.1.2.2 Generalized Harmonic Stepsizes6.1.2.3 Polynomial Learning Rates6.1.2.4 McClain’s Formula6.1.2.5 Search-then-Converge Learning Policy6.2 Adaptive Stepsize Policies6.2.1 The Case for Adaptive Stepsizes6.2.2 Convergence Conditions6.2.3 A Collection of Stochastic Policies6.2.3.1 Kesten’s Rule6.2.3.2 Trigg’s Formula6.2.3.3 Stochastic Gradient Adaptive Stepsize Rule6.2.3.4 ADAM6.2.3.5 AdaGrad6.2.3.6 RMSProp6.2.4 Experimental Notes6.3 Optimal Stepsize Policies*6.3.1 Optimal Stepsizes for Stationary Data6.3.2 Optimal Stepsizes for Nonstationary Data – I6.3.3 Optimal Stepsizes for Nonstationary Data – II6.4 OptimalStepsizesforApproximateValueIteration*6.5 Convergence6.6 Guidelines for Choosing Stepsize Policies6.7 Why Does itWork*6.7.1 Proof of BAKF Stepsize6.8 Bibliographic NotesExercisesBibliography
7 Derivative-Free Stochastic Search
7.1 Overview of Derivative-free Stochastic Search7.1.1 Applications and Time Scales7.1.2 The Communities of Derivative-free Stochastic Search7.1.3 The Multiarmed Bandit Story7.1.4 From Passive Learning to Active Learning to Bandit Problems7.2 Modeling Derivative-free Stochastic Search7.2.1 The Universal Model7.2.2 Illustration: Optimizing a Manufacturing Process7.2.3 Major Problem Classes7.3 Designing Policies7.4 Policy Function Approximations7.5 Cost Function Approximations7.6 VFA-based Policies7.6.1 An Optimal Policy7.6.2 Beta-Bernoulli Belief Model7.6.3 Backward Approximate Dynamic Programming7.6.4 Gittins Indices for Learning in Steady State7.7 Direct Lookahead Policies7.7.1 When do we Need Lookahead Policies?7.7.2 Single Period Lookahead Policies7.7.3 Restricted Multiperiod Lookahead7.7.4 Multiperiod Deterministic Lookahead7.7.5 Multiperiod Stochastic Lookahead Policies7.7.6 Hybrid Direct Lookahead7.8 The Knowledge Gradient (Continued)*7.8.1 The Belief Model7.8.2 The Knowledge Gradient for Maximizing Final Reward7.8.3 The Knowledge Gradient for Maximizing Cumulative Reward7.8.4 The Knowledge Gradient for Sampled Belief Model*7.8.5 Knowledge Gradient for Correlated Beliefs7.9 Learning in Batches7.10 Simulation Optimization*7.10.1 An Indifference Zone Algorithm7.10.2 Optimal Computing Budget Allocation7.11 Evaluating Policies7.11.1 Alternative Performance Metrics*7.11.2 Perspectives of Optimality*7.12 Designing Policies7.12.1 Characteristics of a Policy7.12.2 The Effect of Scaling7.12.3 Tuning7.13 Extensions*7.13.1 Learning in Nonstationary Settings7.13.2 Strategies for Designing Time-dependent Policies7.13.3 A Transient Learning Model7.13.4 The Knowledge Gradient for Transient Problems7.13.5 Learning with Large or Continuous Choice Sets7.13.6 Learning with Exogenous State Information – the Contextual Bandit Problem7.13.7 State-dependent vs. State-independent Problems7.14 Bibliographic NotesExercisesBibliography
Part III – State-dependent Problems
8 State-dependent Problems
8.1 Graph Problems8.1.1 A Stochastic Shortest Path Problem8.1.2 The Nomadic Trucker8.1.3 The Transformer Replacement Problem8.1.4 Asset Valuation8.2 Inventory Problems8.2.1 A Basic Inventory Problem8.2.2 The Inventory Problem – II8.2.3 The Lagged Asset Acquisition Problem8.2.4 The Batch Replenishment Problem8.3 Complex Resource Allocation Problems8.3.1 The Dynamic Assignment Problem8.3.2 The Blood Management Problem8.4 State-dependent Learning Problems8.4.1 Medical Decision Making8.4.2 Laboratory Experimentation8.4.3 Bidding for Ad-clicks8.4.4 An Information-collecting Shortest Path Problem8.5 A Sequence of Problem Classes8.6 Bibliographic NotesExercisesBibliography
9 Modeling Sequential Decision Problems
9.1 A Simple Modeling Illustration9.2 Notational Style9.3 Modeling Time9.4 The States of Our System9.4.1 Defining the State Variable9.4.2 The Three States of Our System9.4.3 Initial State S0 vs. Subsequent States St t > 0 > 09.4.4 Lagged State Variables*9.4.5 The Post-decision State Variable*9.4.6 A Shortest Path Illustration9.4.7 Belief States*9.4.8 Latent Variables*9.4.9 Rolling Forecasts*9.4.10 Flat vs. Factored State Representations*9.4.11 A Programmer’s Perspective of State Variables9.5 Modeling Decisions9.5.1 Types of Decisions9.5.2 Initial Decision x0 vs. Subsequent Decisions xt, t > 09.5.3 Strategic, Tactical, and Execution Decisions9.5.4 Constraints9.5.5 Introducing Policies9.6 The Exogenous Information Process9.6.1 Basic Notation for Information Processes9.6.2 Outcomes and Scenarios9.6.3 Lagged Information Processes*9.6.4 Models of Information Processes*9.6.5 Supervisory Processes*9.7 The Transition Function9.7.1 A General Model9.7.2 Model-free Dynamic Programming9.7.3 Exogenous Transitions9.8 The Objective Function9.8.1 The Performance Metric9.8.2 Optimizing the Policy9.8.3 Dependence of Optimal Policy on S09.8.4 State-dependent Variations9.8.5 Uncertainty Operators9.9 Illustration: An Energy Storage Model9.9.1 With a Time-series Price Model9.9.2 With Passive Learning9.9.3 With Active Learning9.9.4 With Rolling Forecasts9.10 Base Models and Lookahead Models9.11 A Classification of Problems*9.12 Policy Evaluation*9.13 Advanced Probabilistic Modeling Concepts**9.13.1 A Measure-theoretic View of Information**9.13.2 Policies and Measurability9.14 Looking Forward9.15 Bibliographic NotesExercisesBibliography
10 Uncertainty Modeling
10.1 Sources of Uncertainty10.1.1 Observational Errors10.1.2 Exogenous Uncertainty10.1.3 Prognostic Uncertainty10.1.4 Inferential (or Diagnostic) Uncertainty10.1.5 Experimental Variability10.1.6 Model Uncertainty10.1.7 Transitional Uncertainty10.1.8 Control/implementation Uncertainty10.1.9 Communication Errors and Biases10.1.10 Algorithmic Instability10.1.11 Goal Uncertainty10.1.12 Political/regulatory Uncertainty10.1.13 Discussion10.2 A Modeling Case Study: The COVID Pandemic10.3 Stochastic Modeling10.3.1 Sampling Exogenous Information10.3.2 Types of Distributions10.3.3 Modeling Sample Paths10.3.4 State-action-dependent Processes10.3.5 Modeling Correlations10.4 Monte Carlo Simulation10.4.1 Generating Uniform [0, 1] Random Variables10.4.2 Uniform and Normal Random Variable10.4.3 Generating Random Variables from Inverse Cumulative Distributions10.4.4 Inverse Cumulative From Quantile Distributions10.4.5 Distributions with Uncertain Parameters10.5 Case Study: Modeling Electricity Prices10.5.1 Mean Reversion10.5.2 Jump-diffusion Models10.5.3 Quantile Distributions10.5.4 Regime Shifting10.5.5 Crossing Times10.6 Sampling vs. Sampled Models10.6.1 Iterative Sampling: A Stochastic Gradient Algorithm10.6.2 Static Sampling: Solving a Sampled Model10.6.3 Sampled Representation with Bayesian Updating10.7 Closing Notes10.8 Bibliographic NotesExercisesBibliography
11 Designing Policies
11.1 From Optimization to Machine Learning to Sequential Decision Problems11.2 The Classes of Policies11.3 Policy Function Approximations11.4 Cost Function Approximations11.5 Value Function Approximations11.6 Direct Lookahead Approximations11.6.1 The Basic Idea11.6.2 Modeling the Lookahead Problem11.6.3 The Policy-Within-a-Policy11.7 Hybrid Strategies11.7.1 Cost Function Approximation with Policy Function Approximations11.7.2 Lookahead Policies with Value Function Approximations11.7.3 Lookahead Policies with Cost Function Approximations11.7.4 Tree Search with Rollout Heuristic and a Lookup Table Policy11.7.5 Value Function Approximation with Policy Function Approximation11.7.6 Fitting Value Functions Using ADP and Policy Search11.8 Randomized Policies11.9 Illustration: An Energy Storage Model Revisited11.9.1 Policy Function Approximation11.9.2 Cost Function Approximation11.9.3 Value Function Approximation11.9.4 Deterministic Lookahead11.9.5 Hybrid Lookahead-Cost Function Approximation11.9.6 Experimental Testing11.10 Choosing the Policy Class11.10.1 The Policy Classes11.10.2 Policy Complexity-Computational Tradeoffs11.10.3 Screening Questions11.11 Policy Evaluation11.12 Parameter Tuning11.12.1 The Soft Issues11.12.2 Searching Across Policy Classes11.13 Bibliographic NotesExercisesBibliography
Part IV – Policy Search
12 Policy Function Approximations and Policy Search
12.1 Policy Search as a Sequential Decision Problem12.2 Classes of Policy Function Approximations12.2.1 Lookup Table Policies12.2.2 Boltzmann Policies for Discrete Actions12.2.3 Linear Decision Rules12.2.4 Monotone Policies12.2.5 Nonlinear Policies12.2.6 Nonparametric/Locally Linear Policies12.2.7 Contextual Policies12.3 Problem Characteristics12.4 Flavors of Policy Search12.5 Policy Search with Numerical Derivatives12.6 Derivative-Free Methods for Policy Search12.6.1 Belief Models12.6.2 Learning Through Perturbed PFAs12.6.3 Learning CFAs12.6.4 DLA Using the Knowledge Gradient12.6.5 Comments12.7 Exact Derivatives for Continuous Sequential Problems*12.8 ExactDerivativesforDiscreteDynamicPrograms**12.8.1 A Stochastic Policy12.8.2 The Objective Function12.8.3 The Policy Gradient Theorem12.8.4 Computing the Policy Gradient12.9 Supervised Learning12.10 Why Does itWork?12.10.1 Derivation of the Policy Gradient Theorem12.11 Bibliographic NotesExercisesBibliography
13 Cost Function Approximations
13.1 General Formulation for Parametric CFA13.2 Objective-Modified CFAs13.2.1 Linear Cost Function Correction13.2.2 CFAs for Dynamic Assignment Problems13.2.3 Dynamic Shortest Paths13.2.4 Dynamic Trading Policy13.2.5 Discussion13.3 Constraint-Modified CFAs13.3.1 General Formulation of Constraint-Modified CFAs13.3.2 A Blood Management Problem13.3.3 An Energy Storage Example with Rolling Forecasts13.4 Bibliographic NotesExercisesBibliography
Part V – Lookahead Policies
14 Exact Dynamic Programming
14.1 Discrete Dynamic Programming14.2 The Optimality Equations14.2.1 Bellman’s Equations14.2.2 Computing the Transition Matrix14.2.3 Random Contributions14.2.4 Bellman’s Equation Using Operator Notation*14.3 Finite Horizon Problems14.4 Continuous Problems with Exact Solutions14.4.1 The Gambling Problem14.4.2 The Continuous Budgeting Problem14.5 Infinite Horizon Problems*14.6 Value Iteration for Infinite Horizon Problems*14.6.1 A Gauss-Seidel Variation14.6.2 Relative Value Iteration14.6.3 Bounds and Rates of Convergence14.7 Policy Iteration for Infinite Horizon Problems*14.8 Hybrid Value-Policy Iteration*14.9 Average Reward Dynamic Programming*14.10 The Linear Programming Method for Dynamic Programs**14.11 Linear Quadratic Regulation14.12 Why Does itWork?**14.12.1 The Optimality Equations14.12.2 Convergence of Value Iteration14.12.3 Monotonicity of Value Iteration14.12.4 Bounding the Error from Value Iteration14.12.5 Randomized Policies14.13 Bibliographic NotesExercisesBibliography
15 Backward Approximate Dynamic Programming
15.1 Backward Approximate Dynamic Programming for Finite Horizon Problems15.1.1 Some Preliminaries15.1.2 Backward ADP Using Lookup Tables15.1.3 Backward ADP Algorithm with Continuous Approximations15.2 FittedValueIterationforInfiniteHorizonProblems15.3 Value Function Approximation Strategies15.3.1 Linear Models15.3.2 Monotone Functions15.3.3 Other Approximation Models15.4 Computational Observations15.4.1 Experimental Benchmarking of Backward ADP15.4.2 Computational Notes15.5 Bibliographic NotesExercisesBibliography
16 Forward ADP I: The Value of a Policy
16.1 Sampling the Value of a Policy16.1.1 Direct Policy Evaluation for Finite Horizon Problems16.1.2 Policy Evaluation for Infinite Horizon Problems16.1.3 Temporal Difference Updates16.1.4 TD(γ)16.1.5 TD(0) and Approximate Value Iteration16.1.6 TD Learning for Infinite Horizon Problems16.2 Stochastic Approximation Methods16.3 Bellman’s Equation Using a Linear Model*16.3.1 A Matrix-based Derivation**16.3.2 A Simulation-based Implementation16.3.3 Least Squares Temporal Difference Learning (LSTD)16.3.4 Least Squares Policy Evaluation16.4 Analysis of TD(0), LSTD, and LSPE Using a Single State*16.4.1 Recursive Least Squares and TD(0)16.4.2 Least Squares Policy Evaluation16.4.3 Least Squares Temporal Difference Learning16.4.4 Discussion16.5 Gradient-based Methods for Approximate Value Iteration*16.5.1 Approximate Value Iteration with Linear Models**16.5.2 A Geometric View of Linear Models*16.6 Value Function Approximations Based on Bayesian Learning*16.6.1 Minimizing Bias for Infinite Horizon Problems16.6.2 Lookup Tables with Correlated Beliefs16.6.3 Parametric Models16.6.4 Creating the Prior16.7 Learning Algorithms and Atepsizes16.7.1 Least Squares Temporal Differences16.7.2 Least Squares Policy Evaluation16.7.3 Recursive Least Squares16.7.4 Bounding 1/n Convergence for Approximate value Iteration16.7.5 Discussion16.8 Bibliographic NotesExercisesBibliography
17 Forward ADP II: Policy Optimization
17.1 Overview of Algorithmic Strategies17.2 Approximate Value Iteration and Q-Learning Using Lookup Tables17.2.1 Value Iteration Using a Pre-Decision State Variable17.2.2 Q-Learning17.2.3 Value Iteration Using a Post-Decision State Variable17.2.4 Value Iteration Using a Backward Pass17.3 Styles of Learning17.3.1 Offline Learning17.3.2 From Offline to Online17.3.3 Evaluating Offline and Online Learning Policies17.3.4 Lookahead Policies17.4 Approximate Value Iteration Using Linear Models17.5 On-policy vs. off-policy learning and the exploration–exploitation problem17.5.1 Terminology17.5.2 Learning with Lookup Tables17.5.3 Learning with Generalized Belief Models17.6 Applications17.6.1 Pricing an American Option17.6.2 Playing “Lose Tic-Tac-Toe”17.6.3 Approximate Dynamic Programming for Deterministic Problems17.7 Approximate Policy Iteration17.7.1 Finite Horizon Problems Using Lookup Tables17.7.2 Finite Horizon Problems Using Linear Models17.7.3 LSTD for Infinite Horizon Problems Using Linear Models17.8 The Actor–Critic Paradigm17.9 Statistical Bias in the Max Operator*17.10 The Linear Programming Method Using Linear Models*17.11 Finite Horizon Approximations for Steady-State Applications17.12 Bibliographic NotesExercisesBibliography
18 Forward ADP III: Convex Resource Allocation Problems
18.1 Resource Allocation Problems 18.1.1 The Newsvendor Problem18.1.2 Two-Stage Resource Allocation Problems18.1.3 A General Multiperiod Resource Allocation Model*18.2 Values Versus Marginal Values18.3 Piecewise Linear Approximations for Scalar Funtions18.3.1 The Leveling Algorithm18.3.2 The CAVE Algorithm18.4 Regression Methods18.5 Separable Piecewise Linear Approximations18.6 Benders Decomposition for Nonseparable Approximations**18.6.1 Benders’ Decomposition for Two-Stage Problems18.6.2 Asymptotic Analysis of Benders with Regularization**18.6.3 Benders with Regularization18.7 Linear Approximations for High-Dimensional Applications18.8 Resource Allocation with Exogenous Information State18.9 Closing Notes18.10 Bibliographic NotesExercisesBibliography
19 Direct Lookahead Policies
19.1 Optimal Policies Using Lookahead Models19.2 Creating an Approximate Lookahead Model19.2.1 Modeling the Lookahead Model19.2.2 Strategies for Approximating the Lookahead Model19.3 Modified Objectives in Lookahead Models19.3.1 Managing Risk19.3.2 Utility Functions for Multiobjective Problems19.3.3 Model Discounting19.4 Evaluating DLA Policies19.4.1 Evaluating Policies in a Simulator19.4.2 Evaluating Risk-Adjusted Policies19.4.3 Evaluating Policies in the Field19.4.4 Tuning Direct Lookahead Policies19.5 Why Use a DLA?19.6 Deterministic Lookaheads19.6.1 A Deterministic Lookahead: Shortest Path Problems19.6.2 Parameterized Lookaheads19.7 A Tour of Stochastic Lookahead Policies19.7.1 Lookahead PFAs19.7.2 Lookahead CFAs19.7.3 Lookahead VFAs for the Lookahead Model19.7.4 Lookahead DLAs for the Lookahead Model19.7.5 Discussion19.8 Monte Carlo Tree Search for Discrete Decisions19.8.1 Basic Idea19.8.2 The Steps of MCTS19.8.3 Discussion19.8.4 Optimistic Monte Carlo Tree Search19.9 Two-Stage Stochastic Programming for Vector Decisions*19.9.1 The Basic Two-Stage Stochastic Program19.9.2 Two-Stage Approximation of a Sequential Problem19.9.3 Discussion19.10 Observations on DLA Policies19.11 Bibliographic NotesExercisesBibliography
Part VI – Multiagent Systems
20 Multiagent Modeling and Learning
20.1 Overview of Multiagent Systems20.1.1 Dimensions of a Multiagent System20.1.2 Communication20.1.3 Modeling a Multiagent System20.1.4 Controlling Architectures20.2 A Learning Problem – Flu Mitigation20.2.1 Model 1: A Static Model20.2.2 Variations of Our Flu Model20.2.3 Two-Agent Learning Models20.2.4 Transition Functions for Two-Agent Model20.2.5 Designing Policies for the Flu Problem20.3 The POMDP Perspective*20.4 The Two-Agent Newsvendor Problem20.5 Multiple Independent Agents – An HVAC Controller Model20.5.1 Model20.5.2 Designing Policies20.6 Cooperative Agents – A Spatially Distributed Blood Management Problem20.7 Closing Notes20.8 Why Does itWork?20.8.1 Derivation of the POMDP Belief Transition Function20.9 Bibliographic NotesExercisesBibliography
Index
End User License Agreement

Content preview from Reinforcement Learning and Stochastic Optimization

Part V – Lookahead Policies

Lookahead policies are based on estimates of the impact of a decision on the future. There are two broad strategies for doing this:

Value function approximations If we are in a state S_t and take an action x_t, then we observe new information W_t+1 (which is random at time t) which takes us to a new state S_t+1, we might be able to approximate the value of being in state S_t+1. We can then use this to help us make a better decision x_t now if we can do a good job of approximating the value of being in state.
Direct lookahead approximations Here we explicitly plan decisions now, x_t, and into the future, x_t+1,...,x_t+H, to help us make the best decision x_t to implement now. The problem in stochastic models is that the decisions x_tt for t^' > t depend on future information, so they are random.

The choice between using value functions versus direct lookaheads boils down to a single equation which gives the optimal policy at time t when we are in state S:

begin X superscript pi superscript * end superscript subscript t end subscript open bracket S subscript t end subscript equalls arg max subscript x subscript t end subscript epsilon x subscript t end subscript open bracket C open bracket S subscript t end subscript comma x subscript t end subscript + open parrenthesis max subscript pi epsilon II end subscript E open parrenthesis summation superscript T end superscript subscript t superscript ' end superscript = t+1 end subscript C open bracket S subscript t superscript ' end superscript end subscriptcomma X superscript pi end superscript subscript t superscript ' end superscript open bracket S subscript t superscript ' end superscript end subscriptclose bracket close bracket | S subscriptt end subscript comma x subscript t end subscript close parrenthesis close bracket dot

(13.37)

The challenge is balancing the contributions now, given by C(S_t,x_t), against future contributions. If we could compute the future contributions, this would be an optimal policy. However, computing future contributions in the presence of a (random) sequential information process is almost always computationally intractable.

There are problems where we can create reasonable approximations of the future contributions. ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781119815037Purchase Link

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Reinforcement Learning and Stochastic Optimization

by Warren B. Powell

Part V – Lookahead Policies

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.