Since we don't use human-generated data for training or supervision, how does AlphaGo Zero learn at all? The novel reinforcement learning algorithm developed by DeepMind involves using MCTS as a teacher for the neural network, which represents both policy and value functions.
In particular, the outputs of MCTS are 1) probabilities, , for each selecting move during the simulation, and 2) the final outcome of the game, . The neural network, , takes in a board state, , and also outputs a tuple of , where is a vector of move ...