Chapter 8

Direct Policy Search by

Expectation-Maximization

Gradient-based direct policy search methods introduced in Chapter 7 are

useful particularly in controlling continuous systems. However, appropriately

choosing the step size of gradient ascent is often diﬃcult in practice. In

this chapter, we introduce another direct policy search method based on the

expectation-maximization (EM) algorithm that does not contain the step size

parameter. In Se ction 8.1, the main idea of the EM-based method is described,

which is expected to converge faster because policies are more aggressively up-

dated than the gradient-based approach. In practice, however, direct policy

search often requires a large number of samples to obtain a stable policy

update estimator. To improve the stability when the sample size is small,

reusing previously c ollected samples is a pro mis ing approach. In Section 8.2,

the sample-reuse technique that has been successfully used to improve the

performance of policy iteration (see Chapter 4) is applied to the EM-ba sed

metho d. Then its experimental performance is evaluated in Sectio n 8.3 and

this chapter is concluded in Section 8.4.

8.1 Expectation-Maximization Approach

The gradient-ba sed optimization algorithms introduced in Section 7.2

gradually up date policy parameters over iterations. Although this is advan-

tageous when controlling a physical system, it requires many iterations until

convergence. In this section, the expectation-maximization (EM) algorithm

(Dempster et al., 1977) is used to cope with this problem.

The basic idea of EM-based policy search is to iteratively update the policy

parameter θ by maximizing a lower bound o f the expected return J(θ):

J(θ) =

Z

p(h|θ)R(h)dh.

To derive a lower bound of J(θ), Jensen’s inequality (Bishop, 2006) is utilized:

Z

q(h)f(g(h))dh ≥ f

Z

q(h)g(h)dh

,

117

118 Statistical Reinforcement Learning

where q is a probability density, f is a c onvex function, and g is a non-negative

function. For f(t) = −log t, Jensen’s inequality yields

Z

q(h) log g(h)dh ≤ log

Z

q(h)g(h)dh. (8.1)

Assume that the re turn R(h) is nonnegative. Let

e

θ be the c urrent policy

parameter during the optimization procedure, and q and g in Eq. (8.1) are set

as

q(h) =

p(h|

e

θ)R(h)

J(

e

θ)

and g(h) =

p(h|θ)

p(h|

e

θ)

.

Then the following lower bound holds for all θ:

log

J(θ)

J(

e

θ)

= log

Z

p(h|θ)R(h)

J(

e

θ)

dh

= log

Z

p(h|

e

θ)R(h)

J(

e

θ)

p(h|θ)

p(h|

e

θ)

dh

≥

Z

p(h|

e

θ)R(h)

J(

e

θ)

log

p(h|θ)

p(h|

e

θ)

dh.

This yields

log J(θ) ≥ log

e

J(θ),

where

log

e

J(θ) =

Z

R(h)p(h|

e

θ)

J(

e

θ)

log

p(h|θ)

p(h|

e

θ)

dh + log J(

e

θ).

In the EM approach, the parameter θ is iteratively up dated by ma ximizing

the lower b ound

e

J(θ):

b

θ = argmax

θ

e

J(θ).

Since log

e

J(

e

θ) = log J(

e

θ), the lower bound

e

J touches the targe t function J at

the curr ent solution

e

θ:

e

J(

e

θ) = J(

e

θ).

Thus, monotone non-decrease of the expected return is guaranteed:

J(

b

θ) ≥ J(

e

θ).

This update is iterated until convergence (see Figure 8.1).

Let us employ the Gaussian policy model deﬁned as

π(a|s, θ) = π(a|s, µ, σ)

Direct Policy Search by Expectation-Maximization 119

FIGURE 8.1: Policy parameter update in the EM-based policy search. The

policy parameter θ is updated iter atively by maximizing the lower bound

e

J(θ), which touches the true expected re turn J(θ) at the current solution

e

θ.

=

1

σ

√

2π

exp

−

(a − µ

⊤

φ(s))

2

2σ

2

,

where θ = (µ

⊤

, σ)

⊤

and φ(s) denotes the basis function.

The maximizer

b

θ = (

b

µ

⊤

, bσ)

⊤

of the lower bound

e

J(θ) can be analytically

obtained as

b

µ =

Z

p(h|

e

θ)R(h)

T

X

t=1

φ(s

t

)φ(s

t

)

⊤

dh

!

−1

Z

p(h|

e

θ)R(h)

T

X

t=1

a

t

φ(s

t

)dh

!

≈

N

X

n=1

R(h

n

)

T

X

t=1

φ(s

t,n

)φ(s

t,n

)

⊤

!

−1

N

X

n=1

R(h

n

)

T

X

t=1

a

t,n

φ(s

t,n

)

!

,

bσ

2

=

Z

p(h|

e

θ)R(h)dh

−1

Z

p(h|

e

θ)R(h)

1

T

T

X

t=1

(a

t

−

b

µ

⊤

φ(s

t

))

2

dh

!

≈

N

X

n=1

R(h

n

)

!

−1

N

X

n=1

R(h

n

)

1

T

T

X

t=1

(a

t,n

−

b

µ

⊤

φ(s

t,n

))

2

!

,

where the expectation over h is approximated by the average over roll-out

samples H = {h

n

}

N

n=1

from the current policy

e

θ:

h

n

= [s

1,n

, a

1,n

, . . . , s

T,n

, a

T,n

].

Note that EM-based policy search for Gaussian models is called reward-

weighted regression (RWR) (Peters & Schaal, 2007).

120 Statistical Reinforcement Learning

8.2 Sample Reuse

In practice, a large number of samples is needed to obtain a stable policy

update estimator in the EM-based policy se arch. In this section, the s ample-

reuse technique is applied to the EM method to cope with the instability

problem.

8.2.1 Episodic Importance Weighting

The orig inal RWR method is an on-policy algor ithm that uses data drawn

from the current policy. On the other hand, the situation called oﬀ-policy rein-

forcement learning is considered here, where the sampling policy for collecting

data samples is diﬀerent from the target policy. More speciﬁcally, N trajec-

tory samples are gathered following the policy π

ℓ

in the ℓ-th policy update

iteration:

H

π

ℓ

= {h

π

ℓ

1

, . . . , h

π

ℓ

N

},

where each trajectory sample h

π

ℓ

n

is given as

h

π

ℓ

n

= [s

π

ℓ

1,n

, a

π

ℓ

1,n

, . . . , s

π

ℓ

T,n

, a

π

ℓ

T,n

, s

π

ℓ

T +1,n

].

We want to utilize all these samples to improve the current policy.

Suppo se that we are currently at the L-th policy update iteration. If the

policies {π

ℓ

}

L

ℓ=1

remain unchanged over the RWR updates, just using the

plain update rules provided in Section 8.1 gives a consistent estimator

b

θ

NIW

L+1

=

(

b

µ

NIW

L+1

⊤

, bσ

NIW

L+1

)

⊤

, where

b

µ

NIW

L+1

=

L

X

ℓ=1

N

X

n=1

R(h

π

ℓ

n

)

T

X

t=1

φ(s

π

ℓ

t,n

)φ(s

π

ℓ

t,n

)

⊤

!

−1

×

L

X

ℓ=1

N

X

n=1

R(h

π

ℓ

n

)

T

X

t=1

a

π

ℓ

t,n

φ(s

π

ℓ

t,n

)

!

,

(bσ

NIW

L+1

)

2

=

L

X

ℓ=1

N

X

n=1

R(h

π

ℓ

n

)

!

−1

×

L

X

ℓ=1

N

X

n=1

R(h

π

ℓ

n

)

1

T

T

X

t=1

a

π

ℓ

t,n

−

b

µ

NIW

L+1

⊤

φ(s

π

ℓ

t,n

)

2

!

.

The superscript “NIW” stands for “no importance weight.” However, since

policies are upda ted in each RWR iteration, data sa mples {H

π

ℓ

}

L

ℓ=1

collected

over iterations generally follow diﬀerent probability distributions induced by

diﬀerent policies. Therefore, naive use of the above update rules will result in

an inconsistent estimator.

Get *Statistical Reinforcement Learning* now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.