AlphaGo value network
The value network was used to reduce the error in the system's play by guiding MCTS toward certain nodes. It helped reduce the depth of the search. The AlphaGo value network was trained by playing further games against itself in order to optimize the policy that it learned from the policy networks by estimating the value function, specifically the action value function. Recall from Chapter 8, Reinforcement Learning, that action value functions describe the value of taking a certain action while in a specific state. It measures the cumulative reward from a pair of states and actions; for a given state and action we take, how much will this increase our reward? It lets us postulate what would happen by taking a different ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access