Chapter 4. Simulated Data
It is often said that data is the new oil, but this analogy is not quite right. Oil is a finite resource that must be extracted and refined, whereas data is an infinite resource that is constantly being generated and refined.
Halevy et al. (2009)
A major drawback of the financial environment as introduced in the previous chapter is that it relies by default on a single, historical financial time series. This is a too-limited data set with which to train a deep Q-learning (DQL) agent. It is like training an AI on a single game of chess and expecting it to perform well overall in chess.
This chapter introduces simulation-based approaches to augmenting the available data for the training of a DQL agent. The first approach, as introduced in “Noisy Time Series Data”, is to add random noise to a static financial time series. Although it is commonly agreed upon that financial time series data generally already contains noise—as compared to price movements or returns that are information induced—the idea is to train the agent on a large number of similar time series in the hope that it learns to distinguish information from noise.
The second approach, discussed in “Simulated Time Series Data”, is to generate financial time series data through simulation under certain constraints and assumptions. In general, a stochastic differential equation is assumed for the dynamics of the time series. The time series is then simulated given a discretization scheme and appropriate boundary conditions. This is one of the core numerical approaches used in computational finance to price financial derivatives or to manage financial risks, for example (see Glasserman [2004]).
Both data augmentation methods discussed in this chapter make it possible to generate an unlimited amount of training, validation, and test data for reinforcement learning.
Noisy Time Series Data
This section adjusts the first Finance
environment from “Finance Environment” to add white noise, which is normally distributed data, to the original financial time series. First, add the helper class for the action space:
In
[
1
]:
class
ActionSpace
:
def
sample
(
self
):
return
random
.
randint
(
0
,
1
)
The new NoisyData
environment class only requires a few adjustments compared with the original Finance
class. In the following Python code, two parameters are added to the initialization method:
In
[
2
]
:
import
numpy
as
np
import
pandas
as
pd
from
numpy
.
random
import
default_rng
In
[
3
]
:
rng
=
default_rng
(
seed
=
100
)
In
[
4
]
:
class
NoisyData
:
url
=
'
https://certificate.tpq.io/findata.csv
'
def
__init__
(
self
,
symbol
,
feature
,
n_features
=
4
,
min_accuracy
=
0.485
,
noise
=
True
,
noise_std
=
0.001
)
:
self
.
symbol
=
symbol
self
.
feature
=
feature
self
.
n_features
=
n_features
self
.
noise
=
noise
self
.
noise_std
=
noise_std
self
.
action_space
=
ActionSpace
(
)
self
.
min_accuracy
=
min_accuracy
self
.
_get_data
(
)
self
.
_prepare_data
(
)
def
_get_data
(
self
)
:
self
.
raw
=
pd
.
read_csv
(
self
.
url
,
index_col
=
0
,
parse_dates
=
True
)
The random number generator is imported and initialized.
The flag that specifies whether noise is added or not.
The noise level to be used when adjusting the data; it is to be given in % of the price level.
The following part of the Python class code is the most important one. It is where the noise is added to the original time series data:
In
[
5
]
:
class
NoisyData
(
NoisyData
)
:
def
_prepare_data
(
self
)
:
self
.
data
=
pd
.
DataFrame
(
self
.
raw
[
self
.
symbol
]
)
.
dropna
(
)
if
self
.
noise
:
std
=
self
.
data
.
mean
(
)
*
self
.
noise_std
self
.
data
[
self
.
symbol
]
=
(
self
.
data
[
self
.
symbol
]
+
rng
.
normal
(
0
,
std
,
len
(
self
.
data
)
)
)
self
.
data
[
'
r
'
]
=
np
.
log
(
self
.
data
/
self
.
data
.
shift
(
1
)
)
self
.
data
[
'
d
'
]
=
np
.
where
(
self
.
data
[
'
r
'
]
>
0
,
1
,
0
)
self
.
data
.
dropna
(
inplace
=
True
)
ma
,
mi
=
self
.
data
.
max
(
)
,
self
.
data
.
min
(
)
self
.
data_
=
(
self
.
data
-
mi
)
/
(
ma
-
mi
)
def
reset
(
self
)
:
if
self
.
noise
:
self
.
_prepare_data
(
)
self
.
bar
=
self
.
n_features
self
.
treward
=
0
state
=
self
.
data_
[
self
.
feature
]
.
iloc
[
self
.
bar
-
self
.
n_features
:
self
.
bar
]
.
values
return
state
,
{
}
The standard deviation for the noise is calculated in absolute terms.
The white noise is added to the time series data.
The features data is normalized through min-max scaling.
A new noisy time series data set is generated.
Information Versus Noise
Generally, it is assumed that financial time series data includes a certain amount of noise already. Investopedia defines noise as follows: “Noise refers to information or activity that confuses or misrepresents genuine underlying trends.” In this section, we take the historical price series as given and actively add noise to it. The idea is that a DQL agent learns about the fundamental price and/or return trends embodied by the historical data set.
The final part of the Python class, the .step()
method, can remain unchanged:
In
[
6
]:
class
NoisyData
(
NoisyData
):
def
step
(
self
,
action
):
if
action
==
self
.
data
[
'd'
]
.
iloc
[
self
.
bar
]:
correct
=
True
else
:
correct
=
False
reward
=
1
if
correct
else
0
self
.
treward
+=
reward
self
.
bar
+=
1
self
.
accuracy
=
self
.
treward
/
(
self
.
bar
-
self
.
n_features
)
if
self
.
bar
>=
len
(
self
.
data
):
done
=
True
elif
reward
==
1
:
done
=
False
elif
(
self
.
accuracy
<
self
.
min_accuracy
and
self
.
bar
>
self
.
n_features
+
15
):
done
=
True
else
:
done
=
False
next_state
=
self
.
data_
[
self
.
feature
]
.
iloc
[
self
.
bar
-
self
.
n_features
:
self
.
bar
]
.
values
return
next_state
,
reward
,
done
,
False
,
{}
Every time the financial environment is reset, a new time series is created by adding noise to the original time series. The following Python code illustrates this numerically:
In
[
7
]
:
fin
=
NoisyData
(
symbol
=
'
EUR=
'
,
feature
=
'
EUR=
'
,
noise
=
True
,
noise_std
=
0.005
)
In
[
8
]
:
fin
.
reset
(
)
Out
[
8
]
:
(
array
(
[
0.79295659
,
0.81097879
,
0.78840972
,
0.80597193
]
)
,
{
}
)
In
[
9
]
:
fin
.
reset
(
)
Out
[
9
]
:
(
array
(
[
0.80642276
,
0.77840938
,
0.80096369
,
0.76938581
]
)
,
{
}
)
In
[
10
]
:
fin
=
NoisyData
(
'
EUR=
'
,
'
r
'
,
n_features
=
4
,
noise
=
True
,
noise_std
=
0.005
)
In
[
11
]
:
fin
.
reset
(
)
Out
[
11
]
:
(
array
(
[
0.54198375
,
0.30674865
,
0.45688528
,
0.52884033
]
)
,
{
}
)
In
[
12
]
:
fin
.
reset
(
)
Out
[
12
]
:
(
array
(
[
0.37967631
,
0.40190291
,
0.49196183
,
0.47536065
]
)
,
{
}
)
Different initial states for the normalized price data
Different initial states for the normalized returns data
Finally, the following code visualizes several noisy time series data sets (see Figure 4-1):
In
[
13
]:
from
pylab
import
plt
,
mpl
plt
.
style
.
use
(
'seaborn-v0_8'
)
mpl
.
rcParams
[
'figure.dpi'
]
=
300
mpl
.
rcParams
[
'savefig.dpi'
]
=
300
mpl
.
rcParams
[
'font.family'
]
=
'serif'
In
[
14
]:
import
warnings
warnings
.
simplefilter
(
'ignore'
)
In
[
15
]:
for
_
in
range
(
5
):
fin
.
reset
()
fin
.
data
[
fin
.
symbol
]
.
loc
[
'2022-7-1'
:]
.
plot
(
lw
=
0.75
,
c
=
'b'
)
Using the new type of environment, the DQL agent—see the Python class in “DQLAgent Python Class”—can now be trained with a new, noisy data set for each episode. As the following Python code shows, the agent learns to distinguish between information (original movements) and the noisy components quite well:
In
[
16
]:
%
run
dqlagent
.
py
In
[
17
]:
os
.
environ
[
'TF_CPP_MIN_LOG_LEVEL'
]
=
'3'
In
[
18
]:
agent
=
DQLAgent
(
fin
.
symbol
,
fin
.
feature
,
fin
.
n_features
,
fin
)
In
[
19
]:
%
time
agent
.
learn
(
250
)
episode
=
250
|
treward
=
8.00
|
max
=
1441.00
CPU
times
:
user
27.3
s
,
sys
:
3.92
s
,
total
:
31.2
s
Wall
time
:
26.9
s
In
[
20
]:
agent
.
test
(
5
)
total
reward
=
2604
|
accuracy
=
0.601
total
reward
=
2604
|
accuracy
=
0.590
total
reward
=
2604
|
accuracy
=
0.597
total
reward
=
2604
|
accuracy
=
0.593
total
reward
=
2604
|
accuracy
=
0.617
Simulated Time Series Data
In “Noisy Time Series Data”, a historical financial time series is adjusted by adding white noise to it. Here the financial time series itself is simulated under suitable assumptions. Both approaches have in common that they allow the generation of an infinite number of different paths. However, using the Monte Carlo simulation (MCS) approach in this section leads to quite different paths in general that only, on average, show desired properties—such as a certain drift or a certain volatility.
In the following, a stochastic process according to Vasicek (1977) is simulated. Originally used to model the stochastic evolution of interest rates, it allows the simulation of trending or mean-reverting financial time series. The Vasicek model with proportional volatility is described through the following stochastic differential equation:1
The variables and parameters have the following meanings: is the process level at date t, is the mean-reversion factor, is the long-term mean of the process, and is the constant volatility parameter for , which is a standard Brownian motion.
For the simulations, an Euler-Maruyama discretization scheme is used (with and being standard normal):
The Simulation
class implements a financial environment that relies on the simulation of the stochastic process previously mentioned. The following Python code shows the initialization part of the class:
In
[
21
]
:
class
Simulation
:
def
__init__
(
self
,
symbol
,
feature
,
n_features
,
start
,
end
,
periods
,
min_accuracy
=
0.525
,
x0
=
100
,
kappa
=
1
,
theta
=
100
,
sigma
=
0.2
,
normalize
=
True
,
new
=
False
)
:
self
.
symbol
=
symbol
self
.
feature
=
feature
self
.
n_features
=
n_features
self
.
start
=
start
self
.
end
=
end
self
.
periods
=
periods
self
.
x0
=
x0
self
.
kappa
=
kappa
self
.
theta
=
theta
self
.
sigma
=
sigma
self
.
min_accuracy
=
min_accuracy
self
.
normalize
=
normalize
self
.
new
=
new
self
.
action_space
=
ActionSpace
(
)
self
.
_simulate_data
(
)
self
.
_prepare_data
(
)
The start date for the simulation
The end date for the simulation
The number of periods to be simulated
The model parameters for the simulation
The minimum accuracy required to continue
The parameter indicating whether normalization is applied to the data or not
The parameter indicating whether a new simulation is initiated for every episode or not
The following Python code shows the core method of the class. It implements the MCS for the stochastic process:
In
[
22
]
:
import
math
class
Simulation
(
Simulation
)
:
def
_simulate_data
(
self
)
:
index
=
pd
.
date_range
(
start
=
self
.
start
,
end
=
self
.
end
,
periods
=
self
.
periods
)
x
=
[
self
.
x0
]
dt
=
(
index
[
-
1
]
-
index
[
0
]
)
.
days
/
365
/
self
.
periods
for
t
in
range
(
1
,
len
(
index
)
)
:
x_
=
(
x
[
t
-
1
]
+
self
.
kappa
*
(
self
.
theta
-
x
[
t
-
1
]
)
*
dt
+
x
[
t
-
1
]
*
self
.
sigma
*
math
.
sqrt
(
dt
)
*
random
.
gauss
(
0
,
1
)
)
x
.
append
(
x_
)
self
.
data
=
pd
.
DataFrame
(
x
,
columns
=
[
self
.
symbol
]
,
index
=
index
)
The initial value of the process (the boundary condition).
The length of the time interval, given the one-year horizon and the number of steps.
The Euler-Maruyama discretization scheme for the simulation itself.
The simulated value is appended to the
list
object.The simulated process is transformed into a
DataFrame
object.
Data preparation is taken care of by the following code:
In
[
23
]
:
class
Simulation
(
Simulation
)
:
def
_prepare_data
(
self
)
:
self
.
data
[
'
r
'
]
=
np
.
log
(
self
.
data
/
self
.
data
.
shift
(
1
)
)
self
.
data
.
dropna
(
inplace
=
True
)
if
self
.
normalize
:
self
.
mu
=
self
.
data
.
mean
(
)
self
.
std
=
self
.
data
.
std
(
)
self
.
data_
=
(
self
.
data
-
self
.
mu
)
/
self
.
std
else
:
self
.
data_
=
self
.
data
.
copy
(
)
self
.
data
[
'
d
'
]
=
np
.
where
(
self
.
data
[
'
r
'
]
>
0
,
1
,
0
)
self
.
data
[
'
d
'
]
=
self
.
data
[
'
d
'
]
.
astype
(
int
)
Derives the log returns for the simulated process
Applies Gaussian normalization to the data
Derives the directional values from the log returns
The following methods are helper methods and allow you, for example, to reset the environment:
In
[
24
]
:
class
Simulation
(
Simulation
)
:
def
_get_state
(
self
)
:
return
self
.
data_
[
self
.
feature
]
.
iloc
[
self
.
bar
-
self
.
n_features
:
self
.
bar
]
def
seed
(
self
,
seed
)
:
random
.
seed
(
seed
)
tf
.
random
.
set_seed
(
seed
)
def
reset
(
self
)
:
self
.
treward
=
0
self
.
accuracy
=
0
self
.
bar
=
self
.
n_features
if
self
.
new
:
self
.
_simulate_data
(
)
self
.
_prepare_data
(
)
state
=
self
.
_get_state
(
)
return
state
.
values
,
{
}
The final method .step()
is the same as for the NoisyData
class:
In
[
25
]:
class
Simulation
(
Simulation
):
def
step
(
self
,
action
):
if
action
==
self
.
data
[
'd'
]
.
iloc
[
self
.
bar
]:
correct
=
True
else
:
correct
=
False
reward
=
1
if
correct
else
0
self
.
treward
+=
reward
self
.
bar
+=
1
self
.
accuracy
=
self
.
treward
/
(
self
.
bar
-
self
.
n_features
)
if
self
.
bar
>=
len
(
self
.
data
):
done
=
True
elif
reward
==
1
:
done
=
False
elif
(
self
.
accuracy
<
self
.
min_accuracy
and
self
.
bar
>
25
):
done
=
True
else
:
done
=
False
next_state
=
self
.
data_
[
self
.
feature
]
.
iloc
[
self
.
bar
-
self
.
n_features
:
self
.
bar
]
.
values
return
next_state
,
reward
,
done
,
False
,
{}
With the complete Simulation
class, different processes can be simulated. The next code snippet uses three different sets of parameters:
- Baseline
-
No volatility and trending (long-term mean > initial value)
- Trend
-
Volatility and trending (long-term mean > initial value)
- Mean-reversion
-
Volatility and mean-reverting (long-term mean = initial value)
Figure 4-2 shows the simulated processes graphically:
In
[
26
]
:
sym
=
'
EUR=
'
In
[
27
]
:
env_base
=
Simulation
(
sym
,
sym
,
5
,
start
=
'
2024-1-1
'
,
end
=
'
2025-1-1
'
,
periods
=
252
,
x0
=
1
,
kappa
=
1
,
theta
=
1.1
,
sigma
=
0.0
,
normalize
=
True
)
env_base
.
seed
(
100
)
In
[
28
]
:
env_trend
=
Simulation
(
sym
,
sym
,
5
,
start
=
'
2024-1-1
'
,
end
=
'
2025-1-1
'
,
periods
=
252
,
x0
=
1
,
kappa
=
1
,
theta
=
2
,
sigma
=
0.1
,
normalize
=
True
)
env_trend
.
seed
(
100
)
In
[
29
]
:
env_mrev
=
Simulation
(
sym
,
sym
,
5
,
start
=
'
2024-1-1
'
,
end
=
'
2025-1-1
'
,
periods
=
252
,
x0
=
1
,
kappa
=
1
,
theta
=
1
,
sigma
=
0.1
,
normalize
=
True
)
env_mrev
.
seed
(
100
)
In
[
30
]
:
env_mrev
.
data
[
sym
]
.
iloc
[
:
3
]
Out
[
30
]
:
2024
-
01
-
02
10
:
59
:
45.657370517
1.004236
2024
-
01
-
03
21
:
59
:
31.314741035
1.009752
2024
-
01
-
05
08
:
59
:
16.972111553
1.011010
Name
:
EUR
=
,
dtype
:
float64
In
[
31
]
:
env_base
.
data
[
sym
]
.
plot
(
figsize
=
(
10
,
6
)
,
label
=
'
baseline
'
,
style
=
'
r
'
)
env_trend
.
data
[
sym
]
.
plot
(
label
=
'
trend
'
,
style
=
'
b:
'
)
env_mrev
.
data
[
sym
]
.
plot
(
label
=
'
mean-reversion
'
,
style
=
'
g--
'
)
plt
.
legend
(
)
;
Model Parameter Choice
The Vasicek (1977) model provides a certain degree of flexibility to simulate stochastic processes with different characteristics. However, in practical applications, the parameters would not be chosen arbitrarily but rather derived—through optimization methods—from market-observed data. This procedure is generally called model calibration and has a long tradition in computational finance. See, for example, Hilpisch (2015) for more details.
By default, resetting the Simulation
environment generates a new simulated process, as Figure 4-3 illustrates:
In
[
32
]:
sim
=
Simulation
(
sym
,
'r'
,
4
,
start
=
'2024-1-1'
,
end
=
'2028-1-1'
,
periods
=
2
*
252
,
min_accuracy
=
0.485
,
x0
=
1
,
kappa
=
2
,
theta
=
2
,
sigma
=
0.15
,
normalize
=
True
,
new
=
True
)
sim
.
seed
(
100
)
In
[
33
]:
for
_
in
range
(
10
):
sim
.
reset
()
sim
.
data
[
sym
]
.
plot
(
figsize
=
(
10
,
6
),
lw
=
1.0
,
c
=
'b'
);
The DQLAgent
from “DQLAgent Python Class” works with this environment in the same way it worked with the NoisyData
environment in the previous section. The following example uses the parametrization from before for the Simulation
environment, which is a trending case. The agent learns quite well to predict the future directional movement:
In
[
34
]:
agent
=
DQLAgent
(
sim
.
symbol
,
sim
.
feature
,
sim
.
n_features
,
sim
,
lr
=
0.0001
)
In
[
35
]:
%
time
agent
.
learn
(
500
)
episode
=
500
|
treward
=
265.00
|
max
=
286.00
CPU
times
:
user
42.1
s
,
sys
:
5.87
s
,
total
:
47.9
s
Wall
time
:
40.1
s
In
[
36
]:
agent
.
test
(
5
)
total
reward
=
499
|
accuracy
=
0.547
total
reward
=
499
|
accuracy
=
0.515
total
reward
=
499
|
accuracy
=
0.561
total
reward
=
499
|
accuracy
=
0.533
total
reward
=
499
|
accuracy
=
0.549
The next example assumes a mean-reverting case, in which the DQLAgent
is not able to predict the future directional movements as well as before. It seems that learning a trend might be easier than learning from simulated mean-reverting processes:
In
[
37
]:
sim
=
Simulation
(
sym
,
'r'
,
4
,
start
=
'2024-1-1'
,
end
=
'2028-1-1'
,
periods
=
2
*
252
,
min_accuracy
=
0.6
,
x0
=
1
,
kappa
=
1.25
,
theta
=
1
,
sigma
=
0.15
,
normalize
=
True
,
new
=
True
)
sim
.
seed
(
100
)
In
[
38
]:
agent
=
DQLAgent
(
sim
.
symbol
,
sim
.
feature
,
sim
.
n_features
,
sim
,
lr
=
0.0001
)
In
[
39
]:
%
time
agent
.
learn
(
500
)
episode
=
500
|
treward
=
12.00
|
max
=
70.00
CPU
times
:
user
17.8
s
,
sys
:
2.66
s
,
total
:
20.4
s
Wall
time
:
16.3
s
In
[
40
]:
agent
.
test
(
5
)
total
reward
=
499
|
accuracy
=
0.487
total
reward
=
499
|
accuracy
=
0.495
total
reward
=
499
|
accuracy
=
0.511
total
reward
=
499
|
accuracy
=
0.487
total
reward
=
499
|
accuracy
=
0.449
Conclusions
The addition of white noise to a historical financial time series allows, in principle, the generation of an unlimited number of data sets to train a DQL agent. Varying the degree of noise (i.e., the standard deviation) may cause the adjusted time series data to be close to or very different from the original time series. In turn, this can make it easier or more difficult for the DQL agent to learn to distinguish information from the added noise.
Simulation approaches were introduced to finance long before the widespread adoption of computers in the industry. Boyle (1977) is considered the seminal article in this regard. Glasserman (2004) provides a comprehensive overview of MCS techniques for finance.
Using MCS for stochastic processes allows the simulation of trending and mean-reverting processes. Typical trending financial time series are stock index levels or individual stock prices. Typical mean-reverting financial time series are foreign exchange (FX) rates or commodity prices.
In this chapter, the parameters for the simulation are assumed “out-of-the-blue.” In a more realistic setting, appropriate parameter values could be found, for example, through the calibration of the Vasicek (1977) model to the prices of liquidly traded options—an approach with a long tradition in computational finance.3
The examples in this chapter show that the DQLAgent
can more easily learn about trending time series than about mean-reverting ones. The next chapter turns our attention to generative approaches to the creation of synthetic time series data based on neural networks.
References
-
Boyle, Phelim P. “Options: A Monte Carlo Approach.” Journal of Financial Economics 4, no. 3 (1977): 323–338.
-
Brennan, M. J., and E. S. Schwartz. “An Equilibrium Model of Bond Pricing and a Test of Market Efficiency.” Journal of Financial and Quantitative Analysis, 15, no. 3 (1980): 361–372.
-
Glasserman, Paul. Monte Carlo Methods in Financial Engineering. New York: Springer, 2004.
-
Halevy, Alon, Peter Norvig, and Fernando Preira. “The Unreasonable Effectiveness of Data.” IEEE Intelligent Systems 24, no. 2 (May 2009): 8–12.
-
Hilpisch, Yves. Derivatives Analytics with Python: Data Analysis, Models, Simulation, Calibration, and Hedging. Chichester, MA: Wiley Finance, 2015.
-
Hilpisch, Yves. Python for Finance: Mastering Data-Driven Finance. 2nd ed. Sebastopol, CA: O’Reilly, 2018.
-
Vasicek, Oldrich. “An Equilibrium Characterization of the Term Structure.” Journal of Financial Economics 5, no. 2 (November 1977): 177–188.
DQLAgent Python Class
The following Python code is from the dqlagent.py
module and contains the DQLAgent
class used in this chapter:
#
# Deep Q-Learning Agent
#
# (c) Dr. Yves J. Hilpisch
# Reinforcement Learning for Finance
#
import
os
import
random
import
warnings
import
numpy
as
np
import
tensorflow
as
tf
from
tensorflow
import
keras
from
collections
import
deque
from
keras.layers
import
Dense
,
Flatten
from
keras.models
import
Sequential
warnings
.
simplefilter
(
'ignore'
)
os
.
environ
[
'TF_CPP_MIN_LOG_LEVEL'
]
=
'3'
from
tensorflow.python.framework.ops
import
disable_eager_execution
disable_eager_execution
()
opt
=
keras
.
optimizers
.
legacy
.
Adam
class
DQLAgent
:
def
__init__
(
self
,
symbol
,
feature
,
n_features
,
env
,
hu
=
24
,
lr
=
0.001
):
self
.
epsilon
=
1.0
self
.
epsilon_decay
=
0.9975
self
.
epsilon_min
=
0.1
self
.
memory
=
deque
(
maxlen
=
2000
)
self
.
batch_size
=
32
self
.
gamma
=
0.5
self
.
trewards
=
list
()
self
.
max_treward
=
-
np
.
inf
self
.
n_features
=
n_features
self
.
env
=
env
self
.
episodes
=
0
self
.
_create_model
(
hu
,
lr
)
def
_create_model
(
self
,
hu
,
lr
):
self
.
model
=
Sequential
()
self
.
model
.
add
(
Dense
(
hu
,
activation
=
'relu'
,
input_dim
=
self
.
n_features
))
self
.
model
.
add
(
Dense
(
hu
,
activation
=
'relu'
))
self
.
model
.
add
(
Dense
(
2
,
activation
=
'linear'
))
self
.
model
.
compile
(
loss
=
'mse'
,
optimizer
=
opt
(
learning_rate
=
lr
))
def
_reshape
(
self
,
state
):
state
=
state
.
flatten
()
return
np
.
reshape
(
state
,
[
1
,
len
(
state
)])
def
act
(
self
,
state
):
if
random
.
random
()
<
self
.
epsilon
:
return
self
.
env
.
action_space
.
sample
()
return
np
.
argmax
(
self
.
model
.
predict
(
state
)[
0
])
def
replay
(
self
):
batch
=
random
.
sample
(
self
.
memory
,
self
.
batch_size
)
for
state
,
action
,
next_state
,
reward
,
done
in
batch
:
if
not
done
:
reward
+=
self
.
gamma
*
np
.
amax
(
self
.
model
.
predict
(
next_state
)[
0
])
target
=
self
.
model
.
predict
(
state
)
target
[
0
,
action
]
=
reward
self
.
model
.
fit
(
state
,
target
,
epochs
=
1
,
verbose
=
False
)
if
self
.
epsilon
>
self
.
epsilon_min
:
self
.
epsilon
*=
self
.
epsilon_decay
def
learn
(
self
,
episodes
):
for
e
in
range
(
1
,
episodes
+
1
):
self
.
episodes
+=
1
state
,
_
=
self
.
env
.
reset
()
state
=
self
.
_reshape
(
state
)
treward
=
0
for
f
in
range
(
1
,
5000
):
self
.
f
=
f
action
=
self
.
act
(
state
)
next_state
,
reward
,
done
,
trunc
,
_
=
self
.
env
.
step
(
action
)
treward
+=
reward
next_state
=
self
.
_reshape
(
next_state
)
self
.
memory
.
append
(
[
state
,
action
,
next_state
,
reward
,
done
])
state
=
next_state
if
done
:
self
.
trewards
.
append
(
treward
)
self
.
max_treward
=
max
(
self
.
max_treward
,
treward
)
templ
=
f
'episode=
{
self
.
episodes
:
4d
}
| '
templ
+=
f
'treward=
{
treward
:
7.3f
}
'
templ
+=
f
' | max=
{
self
.
max_treward
:
7.3f
}
'
(
templ
,
end
=
'
\r
'
)
break
if
len
(
self
.
memory
)
>
self
.
batch_size
:
self
.
replay
()
()
def
test
(
self
,
episodes
,
min_accuracy
=
0.0
,
min_performance
=
0.0
,
verbose
=
True
,
full
=
True
):
ma
=
self
.
env
.
min_accuracy
self
.
env
.
min_accuracy
=
min_accuracy
if
hasattr
(
self
.
env
,
'min_performance'
):
mp
=
self
.
env
.
min_performance
self
.
env
.
min_performance
=
min_performance
self
.
performances
=
list
()
for
e
in
range
(
1
,
episodes
+
1
):
state
,
_
=
self
.
env
.
reset
()
state
=
self
.
_reshape
(
state
)
for
f
in
range
(
1
,
5001
):
action
=
np
.
argmax
(
self
.
model
.
predict
(
state
)[
0
])
state
,
reward
,
done
,
trunc
,
_
=
self
.
env
.
step
(
action
)
state
=
self
.
_reshape
(
state
)
if
done
:
templ
=
f
'total reward=
{
f
:
4d
}
| '
templ
+=
f
'accuracy=
{
self
.
env
.
accuracy
:
.3f
}
'
if
hasattr
(
self
.
env
,
'min_performance'
):
self
.
performances
.
append
(
self
.
env
.
performance
)
templ
+=
f
' | performance=
{
self
.
env
.
performance
:
.3f
}
'
if
verbose
:
if
full
:
(
templ
)
else
:
(
templ
,
end
=
'
\r
'
)
break
self
.
env
.
min_accuracy
=
ma
if
hasattr
(
self
.
env
,
'min_performance'
):
self
.
env
.
min_performance
=
mp
()
1 For more details on MCS with Python, see Chapter 12 of Hilpisch (2018). The Vasicek model with proportional volatility is also called the Brennan-Schwartz model. It dates back to the Brennan and Schwartz (1980) paper.
2 The careful observer will notice that the three processes do not start at exactly the same point on the graph. This is because the initial value gets “lost” after the calculation of the log returns and the cleanup of the DataFrame
object.
3 For details, numerical techniques, and Python code examples in the context of financial model calibration, see Hilpisch (2015).
Get Reinforcement Learning for Finance now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.