6 Stepsize Policies

There is a wide range of adaptive learning problems that depend on an iteration of the form we first saw in chapter 5 that looks like

(6.1)

The stochastic gradient $nabla Subscript x Baseline upper F left-parenthesis x Superscript n Baseline comma upper W Superscript n plus 1 Baseline right-parenthesis$ tells us what direction to go in, but we need the stepsize $α_{n}$ to tell us how far we should move.

There are two important settings where this formula is used. The first is where we are maximizing some metric such as contributions, utility, or performance. In these settings, the units of $nabla Subscript x Baseline upper F left-parenthesis x Superscript n Baseline comma upper W Superscript n plus 1 Baseline right-parenthesis$ and the decision variable $x$ are different, so the stepsize has to perform the scaling so that the size of $alpha Subscript n Baseline nabla Subscript x Baseline upper F left-parenthesis x Superscript n Baseline comma upper W Superscript n plus 1 Baseline right-parenthesis$ is not too large or too small relative to $x^{n}$ .

A second and very important setting arises in what is known as supervised learning. In this context, we are trying to estimate some function $f left-parenthesis x vertical-bar theta right-parenthesis$ using observations $y = f (x | θ) + ε$ . In this context, $f left-parenthesis x vertical-bar theta right-parenthesis$ and $y$ have the same scale. We encounter these problems in three settings:

Approximating the function $double-struck upper E upper F left-parenthesis x comma upper W right-parenthesis$ to create an estimate $ModifyingAbove upper F With bar left-parenthesis x right-parenthesis$ that can be optimized.
Approximating the value $upper V Subscript t Baseline left-parenthesis upper S Subscript t Baseline right-parenthesis$ of being in a state $S_{t}$ and then following some policy (we encounter this problem starting in chapters 16 and 17 when we introduce approximate dynamic programming).
Creating a parameterized policy $upper X Superscript pi Baseline left-parenthesis upper S vertical-bar theta right-parenthesis$ to fit observed decisions. Here, we assume we have access to some method of creating a decision $x$ and then we use this to create a parameterized ...

Get Reinforcement Learning and Stochastic Optimization now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Reinforcement Learning and Stochastic Optimization by Warren B. Powell

6 Stepsize Policies

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly