In AlphaGo, the policy and value networks are combined with MCTS to provide a look-ahead search when selecting actions in a game. Previously, we discussed how MCTS keeps track of the mean reward and number of visits made to each node. In AlphaGo, we have a few more values to keep track of:
- : Which is the mean action value of choosing a particular action
- : The probability of taking an action for a given board state given by the larger supervised learning policy network
- : The value evaluation of a state that ...