 Chapter 8
Direct Policy Search by
Expectation-Maximization
Gradient-based direct policy search methods introduced in Chapter 7 are
useful particularly in controlling continuous systems. However, appropriately
choosing the step size of gradient ascent is often diﬃcult in practice. In
this chapter, we introduce another direct policy search method based on the
expectation-maximization (EM) algorithm that does not contain the step size
parameter. In Se ction 8.1, the main idea of the EM-based method is described,
which is expected to converge faster because policies are more aggressively up-
dated than the gradient-based approach. In practice, however, direct policy
search often requires a large number of samples to obtain a stable policy
update estimator. To improve the stability when the sample size is small,
reusing previously c ollected samples is a pro mis ing approach. In Section 8.2,
the sample-reuse technique that has been successfully used to improve the
performance of policy iteration (see Chapter 4) is applied to the EM-ba sed
metho d. Then its experimental performance is evaluated in Sectio n 8.3 and
this chapter is concluded in Section 8.4.
8.1 Expectation-Maximization Approach
The gradient-ba sed optimization algorithms introduced in Section 7.2
tageous when controlling a physical system, it requires many iterations until
convergence. In this section, the expectation-maximization (EM) algorithm
(Dempster et al., 1977) is used to cope with this problem.
The basic idea of EM-based policy search is to iteratively update the policy
parameter θ by maximizing a lower bound o f the expected return J(θ):
J(θ) =
Z
p(h|θ)R(h)dh.
To derive a lower bound of J(θ), Jensen’s inequality (Bishop, 2006) is utilized:
Z
q(h)f(g(h))dh f
Z
q(h)g(h)dh
,
117 118 Statistical Reinforcement Learning
where q is a probability density, f is a c onvex function, and g is a non-negative
function. For f(t) = log t, Jensen’s inequality yields
Z
q(h) log g(h)dh log
Z
q(h)g(h)dh. (8.1)
Assume that the re turn R(h) is nonnegative. Let
e
θ be the c urrent policy
parameter during the optimization procedure, and q and g in Eq. (8.1) are set
as
q(h) =
p(h|
e
θ)R(h)
J(
e
θ)
and g(h) =
p(h|θ)
p(h|
e
θ)
.
Then the following lower bound holds for all θ:
log
J(θ)
J(
e
θ)
= log
Z
p(h|θ)R(h)
J(
e
θ)
dh
= log
Z
p(h|
e
θ)R(h)
J(
e
θ)
p(h|θ)
p(h|
e
θ)
dh
Z
p(h|
e
θ)R(h)
J(
e
θ)
log
p(h|θ)
p(h|
e
θ)
dh.
This yields
log J(θ) log
e
J(θ),
where
log
e
J(θ) =
Z
R(h)p(h|
e
θ)
J(
e
θ)
log
p(h|θ)
p(h|
e
θ)
dh + log J(
e
θ).
In the EM approach, the parameter θ is iteratively up dated by ma ximizing
the lower b ound
e
J(θ):
b
θ = argmax
θ
e
J(θ).
Since log
e
J(
e
θ) = log J(
e
θ), the lower bound
e
J touches the targe t function J at
the curr ent solution
e
θ:
e
J(
e
θ) = J(
e
θ).
Thus, monotone non-decrease of the expected return is guaranteed:
J(
b
θ) J(
e
θ).
This update is iterated until convergence (see Figure 8.1).
Let us employ the Gaussian policy model deﬁned as
π(a|s, θ) = π(a|s, µ, σ) Direct Policy Search by Expectation-Maximization 119
FIGURE 8.1: Policy parameter update in the EM-based policy search. The
policy parameter θ is updated iter atively by maximizing the lower bound
e
J(θ), which touches the true expected re turn J(θ) at the current solution
e
θ.
=
1
σ
2π
exp
(a µ
φ(s))
2
2σ
2
,
where θ = (µ
, σ)
and φ(s) denotes the basis function.
The maximizer
b
θ = (
b
µ
, bσ)
of the lower bound
e
J(θ) can be analytically
obtained as
b
µ =
Z
p(h|
e
θ)R(h)
T
X
t=1
φ(s
t
)φ(s
t
)
dh
!
1
Z
p(h|
e
θ)R(h)
T
X
t=1
a
t
φ(s
t
)dh
!
N
X
n=1
R(h
n
)
T
X
t=1
φ(s
t,n
)φ(s
t,n
)
!
1
N
X
n=1
R(h
n
)
T
X
t=1
a
t,n
φ(s
t,n
)
!
,
bσ
2
=
Z
p(h|
e
θ)R(h)dh
1
Z
p(h|
e
θ)R(h)
1
T
T
X
t=1
(a
t
b
µ
φ(s
t
))
2
dh
!
N
X
n=1
R(h
n
)
!
1
N
X
n=1
R(h
n
)
1
T
T
X
t=1
(a
t,n
b
µ
φ(s
t,n
))
2
!
,
where the expectation over h is approximated by the average over roll-out
samples H = {h
n
}
N
n=1
from the current policy
e
θ:
h
n
= [s
1,n
, a
1,n
, . . . , s
T,n
, a
T,n
].
Note that EM-based policy search for Gaussian models is called reward-
weighted regression (RWR) (Peters & Schaal, 2007). 120 Statistical Reinforcement Learning
8.2 Sample Reuse
In practice, a large number of samples is needed to obtain a stable policy
update estimator in the EM-based policy se arch. In this section, the s ample-
reuse technique is applied to the EM method to cope with the instability
problem.
8.2.1 Episodic Importance Weighting
The orig inal RWR method is an on-policy algor ithm that uses data drawn
from the current policy. On the other hand, the situation called oﬀ-policy rein-
forcement learning is considered here, where the sampling policy for collecting
data samples is diﬀerent from the target policy. More speciﬁcally, N trajec-
tory samples are gathered following the policy π
in the -th policy update
iteration:
H
π
= {h
π
1
, . . . , h
π
N
},
where each trajectory sample h
π
n
is given as
h
π
n
= [s
π
1,n
, a
π
1,n
, . . . , s
π
T,n
, a
π
T,n
, s
π
T +1,n
].
We want to utilize all these samples to improve the current policy.
Suppo se that we are currently at the L-th policy update iteration. If the
policies {π
}
L
=1
remain unchanged over the RWR updates, just using the
plain update rules provided in Section 8.1 gives a consistent estimator
b
θ
NIW
L+1
=
(
b
µ
NIW
L+1
, bσ
NIW
L+1
)
, where
b
µ
NIW
L+1
=
L
X
=1
N
X
n=1
R(h
π
n
)
T
X
t=1
φ(s
π
t,n
)φ(s
π
t,n
)
!
1
×
L
X
=1
N
X
n=1
R(h
π
n
)
T
X
t=1
a
π
t,n
φ(s
π
t,n
)
!
,
(bσ
NIW
L+1
)
2
=
L
X
=1
N
X
n=1
R(h
π
n
)
!
1
×
L
X
=1
N
X
n=1
R(h
π
n
)
1
T
T
X
t=1
a
π
t,n
b
µ
NIW
L+1
φ(s
π
t,n
)
2
!
.
The superscript “NIW” stands for “no importance weight.” However, since
policies are upda ted in each RWR iteration, data sa mples {H
π
}
L
=1
collected
over iterations generally follow diﬀerent probability distributions induced by
diﬀerent policies. Therefore, naive use of the above update rules will result in
an inconsistent estimator.

Get Statistical Reinforcement Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.