Value network

The last step of the pipeline involves training a value network to evaluate the board state, or in other words, to determine how favorable a particular board state is for winning the game. Formally speaking, given a particular policy, , and state, , we would like to predict the expected reward, . The network is trained by minimizing the mean-squared error (MSE) between the predicted value, , and the final outcome:

Where  are the ...

Get Python Reinforcement Learning Projects now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.