AlphaGo policy network
The goal of the policy network is to capture and understand the general actions of players on the board in order to aid the MCTS by guiding the algorithm toward promising actions during the search process; this reduces the breadth of the search. Architecturally, the policy network comes in two parts: a supervised learning policy network and a reinforcement learning policy network.
The first network, the supervised network, is a 13-layer Convolutional Neural Network (CNN). It was trained by observing the moves that humans make while playing the game – 30 million, to be exact – and outputs a probability distribution for each action given a certain state. We call this type of supervised learning behavior cloning.
The ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access