Part V – Lookahead Policies

Lookahead policies are based on estimates of the impact of a decision on the future. There are two broad strategies for doing this:

  • Value function approximations If we are in a state St and take an action xt, then we observe new information Wt+1 (which is random at time t) which takes us to a new state St+1, we might be able to approximate the value of being in state St+1. We can then use this to help us make a better decision xt now if we can do a good job of approximating the value of being in state.
  • Direct lookahead approximations Here we explicitly plan decisions now, xt, and into the future, xt+1,...,xt+H, to help us make the best decision xt to implement now. The problem in stochastic models is that the decisions xtt for t' > t depend on future information, so they are random.

The choice between using value functions versus direct lookaheads boils down to a single equation which gives the optimal policy at time t when we are in state S:

begin X superscript pi superscript * end superscript subscript t end subscript open bracket S subscript t end subscript equalls arg max subscript x subscript t end subscript epsilon x subscript t end subscript open bracket C open bracket S subscript t end subscript comma x subscript t end subscript + open parrenthesis max subscript pi epsilon II end subscript E open parrenthesis summation superscript T end superscript subscript t superscript ' end superscript = t+1 end subscript C open bracket S subscript t superscript ' end superscript end subscriptcomma X superscript pi end superscript subscript t superscript ' end superscript open bracket S subscript t superscript ' end superscript end subscriptclose bracket close bracket | S subscriptt end subscript comma x subscript t end subscript close parrenthesis close bracket dot  (13.37)

The challenge is balancing the contributions now, given by C(St,xt), against future contributions. If we could compute the future contributions, this would be an optimal policy. However, computing future contributions in the presence of a (random) sequential information process is almost always computationally intractable.

There are problems where we can create reasonable approximations of the future contributions. ...

Get Reinforcement Learning and Stochastic Optimization now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.