In this experiment, we use 0.01 as a learning rate and 0.8 as a discount value. This is the code to initialize the environment and hyperparameters for the training:
async function qlearning() { const episodes = []; for (let i = 0; i < 1000; i++) { episodes.push(i); } // Initialize Environment const env = new Environment(); // Initialize the action-value function as the 2-dim tensor // with the shape [numState, numActions] let actionValue = tf.fill([env.getNumStates(), env.getNumActions()], 10); // Learning Rate const alpha = 0.01; // Discount Value const discount = 0.8; // Optimization with Q-learning // ...}
We can update the action-value function by observing the result from the environment in episodes (1,000 episodes ...