
Scaled conjugate gradient training 439
and therefo re, replacing index k by ℓ,
d
ℓ
⊤
Hw
ℓ
= d
ℓ
⊤
Hw
1
.
The step lengths are thus
α
ℓ
= −
d
ℓ
⊤
(b + Hw
ℓ
)
d
ℓ
⊤
Hd
ℓ
, ℓ = 1 . . . n
w
.
Finally, using the notation g
ℓ
= g(w
ℓ
) = b + Hw
ℓ
and substituting ℓ → k,
α
k
= −
d
k
⊤
g
k
d
k
⊤
Hd
k
, k = 1 . . . n
w
. (B.28)
For want of a better alternative, we can choo se the first search direction
along the negative local gradient
d
1
= −g
1
= −
∂
∂w
E(w
1
).
(Note that d
1
is not a unit vector.) We move, according to Equation (B.28),
a distance
α
1
=
d
1
⊤
d
1
d
1
⊤
Hd
1
along this direc tion to the point w
2
, at which the local gradient g
2
is orthog-
onal to d
1
. We then choose the new conjugate search directio n d
2
as a linear
combination ...