December 2018
Beginner to intermediate
684 pages
21h 9m
English
The agent's performance is quite sensitive to several hyperparameters. We will start with the discount and learning rates:
gamma=.99, # discount factorlearning_rate=5e-5 # learning rate
We will update the target network every 100 time steps, store up to 1 m past episodes in the replay memory, and sample mini-batches of 1,024 from memory to train the agent:
tau=100 # target network update frequencyreplay_capacity=int(1e6)minibatch_size=1024
The ε-greedy policy starts with pure exploration at ε=1, linear decay to 0.05 over 20,000 time steps, and exponential decay thereafter:
epsilon_start=1.0epsilon_end=0.05epsilon_linear_steps=2e4epsilon_exp_decay=0.99