Appendix B. RL4J and Reinforcement Learning
We begin this appendix with an introduction to reinforcement learning, followed by a detailed explanation of Deep Q-Networks (DQNs) for pixel inputs, and then we conclude by showing you an RL4J example. Let’s begin with a look at the core concepts of reinforcement learning.
Reinforcement learning is an exciting area of machine learning. It is, basically, the learning of an efficient strategy in a given environment. Informally, this is very similar to Pavlovian conditioning: you assign a reward for a given behavior, and, over time, the agents learn to reproduce that behavior in order to receive more rewards.
Markov Decision Process
Formally, an environment is defined as a Markov Decision Process (MDP). Behind this scary name is nothing other than the combination of (5-tuple):
- A set of states SS (e.g., in chess, a state is the board configuration)
- A set of possible action AA (in chess, every possible move in every configuration possible; e.g., e4–e5).
- The conditional distribution P(s′|s,a)P(s′|,a) of the next state, given a current state and an action. (In a deterministic environment like chess, there is only one state s′ with probability 1, and all the others with probability 0. Nevertheless, in a stochastic (involving randomness, like a a coin toss) environment, the distribution is not as simple.)
- The reward function of transitioning from state s to s′: R(s,s′) (e.g., in chess, +1 for a final move that leads ...