The preceding policy optimization using the Monte Carlo policy gradient approach leads to high variance. In order to tackle this issue, we use a critic to estimate the state-action value function, that is as follows:
This gives rise to the famous actor-critic algorithms. The actor-critic algorithm, as the name suggests, maintains two networks for the following purposes:
- One network acts as a critic, which updates the weight w parameter vector of the function approximator of the state-action
- Other network acts as an Actor, which updates the policy parameter vector as per the direction given by the critic
The following ...