This file contains our implementation of PolicyValueNetwork. In short, we construct a tf.estimator.Estimator that is trained using board states, policies, and self-play outcomes produced by MCTS self-play. The network has two heads: one acting as a value function, and the other acting as a policy network.

First, we define some layers that will be used by PolicyValueNetwork:

import functoolsimport loggingimport os.pathimport tensorflow as tfimport featuresimport preprocessingimport utilsfrom config import GLOBAL_PARAMETER_STORE, GOPARAMETERSfrom constants import *logger = logging.getLogger(__name__)logger.setLevel(logging.INFO)def create_partial_bn_layer(params):    return functools.partial(tf.layers.batch_normalization,        momentum

Get Python Reinforcement Learning Projects now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.