Chapter 4. Data-Driven Finance
If artificial intelligence is the new electricity, big data is the oil that powers the generators.
Kai-Fu Lee (2018)
Nowadays, analysts sift through non-traditional information such as satellite imagery and credit card data, or use artificial intelligence techniques such as machine learning and natural language processing to glean fresh insights from traditional sources such as economic data and earnings-call transcripts.
Robin Wigglesworth (2019)
This chapter discusses central aspects of data-driven finance. For the purposes of this book, data-driven finance is understood to be a financial context (theory, model, application, and so on) that is primarily driven by and based on insights gained from data.
“Scientific Method” discusses the scientific method, which is about generally accepted principles that should guide scientific effort. “Financial Econometrics and Regression” is about financial econometrics and related topics. “Data Availability” sheds light on which types of (financial) data are available today and in what quality and quantity via programmatic APIs. “Normative Theories Revisited” revisits the normative theories of Chapter 3 and analyzes them based on real financial time series data. Also based on real financial data, “Debunking Central Assumptions” debunks two of the most commonly found assumptions in financial models and theories: normality of returns and linear relationships.
Scientific Method
The scientific method refers to a set of generally accepted principles that should guide any scientific project. Wikipedia defines the scientific method as follows:
The scientific method is an empirical method of acquiring knowledge that has characterized the development of science since at least the 17th century. It involves careful observation, applying rigorous skepticism about what is observed, given that cognitive assumptions can distort how one interprets the observation. It involves formulating hypotheses, via induction, based on such observations; experimental and measurement-based testing of deductions drawn from the hypotheses; and refinement (or elimination) of the hypotheses based on the experimental findings. These are principles of the scientific method, as distinguished from a definitive series of steps applicable to all scientific enterprises.
Given this definition, normative finance, as discussed in Chapter 3, is in stark contrast to the scientific method. Normative financial theories mostly rely on assumptions and axioms in combination with deduction as the major analytical method to arrive at their central results.
-
Expected utility theory (EUT) assumes that agents have the same utility function no matter what state of the world unfolds and that they maximize expected utility under conditions of uncertainty.
-
Mean-variance portfolio (MVP) theory describes how investors should invest under conditions of uncertainty assuming that only the expected return and the expected volatility of a portfolio over one period count.
-
The capital asset pricing model (CAPM) assumes that only the nondiversifiable market risk explains the expected return and the expected volatility of a stock over one period.
-
Arbitrage pricing theory (APT) assumes that a number of identifiable risk factors explains the expected return and the expected volatility of a stock over time; admittedly, compared to the other theories, the formulation of APT is rather broad and allows for wide-ranging interpretations.
What characterizes the aforementioned normative financial theories is that they were originally derived under certain assumptions and axioms using “pen and paper” only, without any recourse to real-world data or observations. From a historical point of view, many of these theories were rigorously tested against real-world data only long after their publication dates. This can be explained primarily with better data availability and increased computational capabilities over time. After all, data and computation are the main ingredients for the application of statistical methods in practice. The discipline at the intersection of mathematics, statistics, and finance that applies such methods to financial market data is typically called financial econometrics, the topic of the next section.
Financial Econometrics and Regression
Adapting the definition provided by Investopedia for econometrics, one can define financial econometrics as follows:
[Financial] econometrics is the quantitative application of statistical and mathematical models using [financial] data to develop financial theories or test existing hypotheses in finance and to forecast future trends from historical data. It subjects real-world [financial] data to statistical trials and then compares and contrasts the results against the [financial] theory or theories being tested.
Alexander (2008b) provides a thorough and broad introduction to the field of financial econometrics. The second chapter of the book covers single- and multifactor models, such as the CAPM and APT. Alexander (2008b) is part of a series of four books called Market Risk Analysis. The first in the series, Alexander (2008a), covers theoretical background concepts, topics, and methods, such as MVP theory and the CAPM themselves. The book by Campbell (2018) is another comprehensive resource for financial theory and related econometric research.
One of the major tools in financial econometrics is regression, in both its univariate and multivariate forms. Regression is also a central tool in statistical learning in general. What is the difference between traditional mathematics and statistical learning? Although there is no general answer to this question (after all, statistics is a sub-field of mathematics), a simple example should emphasize a major difference relevant to the context of this book.
First is the standard mathematical way. Assume a mathematical function is given as follows:
Given multiple values of , one can derive function values for by applying the above definition:
The following Python code illustrates this based on a simple numerical example:
In
[
1
]:
import
numpy
as
np
In
[
2
]:
def
f
(
x
):
return
2
+
1
/
2
*
x
In
[
3
]:
x
=
np
.
arange
(
-
4
,
5
)
x
Out
[
3
]:
array
([
-
4
,
-
3
,
-
2
,
-
1
,
0
,
1
,
2
,
3
,
4
])
In
[
4
]:
y
=
f
(
x
)
y
Out
[
4
]:
array
([
0.
,
0.5
,
1.
,
1.5
,
2.
,
2.5
,
3.
,
3.5
,
4.
])
Second is the approach taken in statistical learning. Whereas in the preceding example, the function comes first and then the data is derived, this sequence is reversed in statistical learning. Here, the data is generally given and a functional relationship is to be found. In this context, is often called the independent variable and the dependent variable. Consequently, consider the following data:
The problem is to find, for example, parameters such that:
Another way of writing this is by including residual values :
In the context of ordinary least-squares (OLS) regression, are chosen to minimize the mean-squared error between the approximated values and the real values . The minimization problem, then, is as follows:
In the case of simple OLS regression, as described previously, the optimal solutions are known in closed form and are as follows:
Here, stands for the covariance, for the variance, and for the mean values of .
Returning to the preceding numerical example, these insights can be used to derive optimal parameters and, in this particular case, to recover the original definition of :
In
[
5
]
:
x
Out
[
5
]
:
array
(
[
-
4
,
-
3
,
-
2
,
-
1
,
0
,
1
,
2
,
3
,
4
]
)
In
[
6
]
:
y
Out
[
6
]
:
array
(
[
0.
,
0.5
,
1.
,
1.5
,
2.
,
2.5
,
3.
,
3.5
,
4.
]
)
In
[
7
]
:
beta
=
np
.
cov
(
x
,
y
,
ddof
=
0
)
[
0
,
1
]
/
x
.
var
(
)
beta
Out
[
7
]
:
0.49999999999999994
In
[
8
]
:
alpha
=
y
.
mean
(
)
-
beta
*
x
.
mean
(
)
alpha
Out
[
8
]
:
2.0
In
[
9
]
:
y_
=
alpha
+
beta
*
x
In
[
10
]
:
np
.
allclose
(
y_
,
y
)
Out
[
10
]
:
True
as derived from the covariance matrix and the variance
as derived from and the mean values
Estimated values , given
The preceding example and those in Chapter 1 illustrate that the application of OLS regression to a given data set is in general straightforward. There are more reasons why OLS regression has become one of the central tools in econometrics and financial econometrics. Among them are the following:
- Centuries old
-
The least-squares approach, particularly in combination with regression, has been used for more than 200 years.1
- Simplicity
-
The mathematics behind OLS regression is easy to understand and easy to implement in programming.
- Scalability
-
There is basically no limit regarding the data size to which OLS regression can be applied.
- Flexibility
-
OLS regression can be applied to a wide range of problems and data sets.
- Speed
-
OLS regression is fast to evaluate, even on larger data sets.
- Availability
-
Efficient implementations in Python and many other programming languages are readily available.
However, as easy and straightforward as the application of OLS regression might be in general, the method rests on a number of assumptions—most of them related to the residuals—that are not always satisfied in practice.
- Linearity
-
The model is linear in its parameters, with regard to both the coefficients and the residuals.
- Independence
-
Independent variables are not perfectly (to a high degree) correlated with each other (no multicollinearity).
- Zero mean
-
The mean value of the residuals is (close to) zero.
- No correlation
-
Residuals are not (strongly) correlated with the independent variables.
- Homoscedasticity
-
The standard deviation of the residuals is (almost) constant.
- No autocorrelation
-
The residuals are not (strongly) correlated with each other.
In practice, it is in general quite simple to test for the validity of the assumptions given a specific data set.
Data Availability
Financial econometrics is driven by statistical methods, such as regression, and the availability of financial data. From the 1950s to the 1990s, and even into the early 2000s, theoretical and empirical financial research was mainly driven by relatively small data sets compared to today’s standards, and was mostly comprised of end-of-day (EOD) data. Data availability is something that has changed dramatically over the last decade or so, with more and more types of financial and other data available in ever increasing granularity, quantity, and velocity.
Programmatic APIs
With regard to data-driven finance, what is important is not only what data is available but also how it can be accessed and processed. For quite a while now, finance professionals have relied on data terminals from companies such as Refinitiv (see Eikon Terminal) or Bloomberg (see Bloomberg Terminal), to mention just two of the leading providers. Newspapers, magazines, financial reports, and the like have long been replaced by such terminals as the primary source for financial information. However, the sheer volume and variety of data provided by such terminals cannot be consumed systematically by a single user or even large groups of finance professionals. Therefore, the major breakthrough in data-driven finance is to be seen in the programmatic availability of data via application programming interfaces (APIs) that allow the usage of computer code to select, retrieve, and process arbitrary data sets.
The remainder of this section is devoted to the illustration of such APIs by which even academics and retail investors can retrieve a wealth of different data sets. Before such examples are provided, Table 4-1 offers an overview of categories of data that are in general relevant in a financial context, as well as typical examples. In the table, structured data refers to numerical data types that often come in tabular structures, while unstructured data refers to data in the form of standard text that often has no structure beyond headers or paragraphs, for example. Alternative data refers to data types that are typically not considered financial data.
Time | Structured data | Unstructured data | Alternative data |
---|---|---|---|
Historical |
Prices, fundamentals |
News, texts |
Web, social media, satellites |
Streaming |
Prices, volumes |
News, filings |
Web, social media, satellites, Internet of Things |
Structured Historical Data
First, structured historical data types will be retrieved programmatically. To this end, the following Python code uses the Eikon Data API.2
To access data via the Eikon Data API, a local application, such as Refinitiv Workspace, must be running and the API access must be configured on the Python level:
In
[
11
]:
import
eikon
as
ek
import
configparser
In
[
12
]:
c
=
configparser
.
ConfigParser
()
c
.
read
(
'../aiif.cfg'
)
ek
.
set_app_key
(
c
[
'eikon'
][
'app_id'
])
2020
-
08
-
04
10
:
30
:
18
,
05
9
P
[
14938
]
[
MainThread
4521459136
]
Error
on
handshake
port
9000
:
ReadTimeout
(
ReadTimeout
())
If these requirements are met, historical structured data can be retrieved via a single function call. For example, the following Python code retrieves EOD data for a set of symbols and a specified time interval:
In
[
14
]
:
symbols
=
[
'
AAPL.O
'
,
'
MSFT.O
'
,
'
NFLX.O
'
,
'
AMZN.O
'
]
In
[
15
]
:
data
=
ek
.
get_timeseries
(
symbols
,
fields
=
'
CLOSE
'
,
start_date
=
'
2019-07-01
'
,
end_date
=
'
2020-07-01
'
)
In
[
16
]
:
data
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
254
entries
,
2019
-
07
-
01
to
2020
-
07
-
01
Data
columns
(
total
4
columns
)
:
# Column Non-Null Count Dtype
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
0
AAPL
.
O
254
non
-
null
float64
1
MSFT
.
O
254
non
-
null
float64
2
NFLX
.
O
254
non
-
null
float64
3
AMZN
.
O
254
non
-
null
float64
dtypes
:
float64
(
4
)
memory
usage
:
9.9
KB
In
[
17
]
:
data
.
tail
(
)
Out
[
17
]
:
CLOSE
AAPL
.
O
MSFT
.
O
NFLX
.
O
AMZN
.
O
Date
2020
-
06
-
25
364.84
200.34
465.91
2754.58
2020
-
06
-
26
353.63
196.33
443.40
2692.87
2020
-
06
-
29
361.78
198.44
447.24
2680.38
2020
-
06
-
30
364.80
203.51
455.04
2758.82
2020
-
07
-
01
364.11
204.70
485.64
2878.70
Defines a list of
RICs
(symbols) to retrieve data for3Retrieves EOD
Close
prices for the list ofRICs
Shows the meta information for the returned
DataFrame
objectShows the final rows of the
DataFrame
object
Similarly, one-minute bars with OHLC
fields can be retrieved with appropriate adjustments of the parameters:
In
[
18
]
:
data
=
ek
.
get_timeseries
(
'
AMZN.O
'
,
fields
=
'
*
'
,
start_date
=
'
2020-08-03
'
,
end_date
=
'
2020-08-04
'
,
interval
=
'
minute
'
)
In
[
19
]
:
data
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
911
entries
,
2020
-
08
-
03
08
:
01
:
00
to
2020
-
08
-
04
00
:
00
:
00
Data
columns
(
total
6
columns
)
:
# Column Non-Null Count Dtype
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
0
HIGH
911
non
-
null
float64
1
LOW
911
non
-
null
float64
2
OPEN
911
non
-
null
float64
3
CLOSE
911
non
-
null
float64
4
COUNT
911
non
-
null
float64
5
VOLUME
911
non
-
null
float64
dtypes
:
float64
(
6
)
memory
usage
:
49.8
KB
In
[
20
]
:
data
.
head
(
)
Out
[
20
]
:
AMZN
.
O
HIGH
LOW
OPEN
CLOSE
COUNT
VOLUME
Date
2020
-
08
-
03
08
:
01
:
00
3190.00
3176.03
3176.03
3178.17
18.0
383.0
2020
-
08
-
03
08
:
02
:
00
3183.02
3176.03
3180.00
3177.01
15.0
513.0
2020
-
08
-
03
08
:
03
:
00
3179.91
3177.05
3179.91
3177.05
5.0
14.0
2020
-
08
-
03
08
:
04
:
00
3184.00
3179.91
3179.91
3184.00
8.0
102.0
2020
-
08
-
03
08
:
05
:
00
3184.91
3182.91
3183.30
3184.00
12.0
403.0
One can retrieve more than structured financial time series data from the Eikon Data API. Fundamental data can also be retrieved for a number of RICs
and a number of different data fields at the same time, as the following Python code illustrates:
In
[
21
]
:
data_grid
,
err
=
ek
.
get_data
(
[
'
AAPL.O
'
,
'
IBM
'
,
'
GOOG.O
'
,
'
AMZN.O
'
]
,
[
'
TR.TotalReturnYTD
'
,
'
TR.WACCBeta
'
,
'
YRHIGH
'
,
'
YRLOW
'
,
'
TR.Ebitda
'
,
'
TR.GrossProfit
'
]
)
In
[
22
]
:
data_grid
Out
[
22
]
:
Instrument
YTD
Total
Return
Beta
YRHIGH
YRLOW
EBITDA
\
0
AAPL
.
O
49.141271
1.221249
425.66
192.5800
7.647700e+10
1
IBM
-
5.019570
1.208156
158.75
90.5600
1.898600e+10
2
GOOG
.
O
10.278829
1.067084
1586.99
1013.5361
4.757900e+10
3
AMZN
.
O
68.406897
1.338106
3344.29
1626.0318
3.025600e+10
Gross
Profit
0
98392000000
1
36488000000
2
89961000000
3
114986000000
Programmatic Data Availability
Basically all structured financial data is available nowadays in programmatic fashion. Financial time series data, in this context, is the paramount example. However, other structured data types such as fundamental data are available in the same way, simplifying the work of quantitative analysts, traders, portfolio managers, and the like significantly.
Structured Streaming Data
Many applications in finance require real-time structured data, such as in algorithmic trading or market risk management. The following Python code makes use of the API of the Oanda Trading Platform and streams in real time a number of time stamps, bid quotes, and ask quotes for the Bitcoin price in USD:
In
[
23
]
:
import
tpqoa
In
[
24
]
:
oa
=
tpqoa
.
tpqoa
(
'
../aiif.cfg
'
)
In
[
25
]
:
oa
.
stream_data
(
'
BTC_USD
'
,
stop
=
5
)
2020
-
08
-
04
T08
:
30
:
38.621075583
Z
11298.8
11334.8
2020
-
08
-
04
T08
:
30
:
50.485678488
Z
11298.3
11334.3
2020
-
08
-
04
T08
:
30
:
50.801666847
Z
11297.3
11333.3
2020
-
08
-
04
T08
:
30
:
51.326269990
Z
11296.0
11332.0
2020
-
08
-
04
T08
:
30
:
54.423973431
Z
11296.6
11332.6
Printing out the streamed data fields is, of course, only for illustration. Certain financial applications might require sophisticated processing of the retrieved data and the generation of signals or statistics, for instance. Particularly during weekdays and trading hours, the number of price ticks streamed for financial instruments increases steadily, demanding powerful data processing capabilities on the end of financial institutions that need to process such data in real time or at least in near-real time (“near time”).
The significance of this observation becomes clear when looking at Apple Inc. stock prices. One can calculate that there are roughly EOD closing quotes for the Apple stock over a period of 40 years. (Apple Inc. went public on December 12, 1980.) The following code retrieves tick data for the Apple stock price for one hour only. The retrieved data set, which might not even be complete for the given time interval, has 50,000 data rows, or five times as many tick quotes as the EOD quotes accumulated over 40 years of trading:
In
[
26
]
:
data
=
ek
.
get_timeseries
(
'
AAPL.O
'
,
fields
=
'
*
'
,
start_date
=
'
2020-08-03 15:00:00
'
,
end_date
=
'
2020-08-03 16:00:00
'
,
interval
=
'
tick
'
)
In
[
27
]
:
data
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
50000
entries
,
2020
-
08
-
03
15
:
26
:
24.889000
to
2020
-
08
-
03
15
:
59
:
59.762000
Data
columns
(
total
2
columns
)
:
# Column Non-Null Count Dtype
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
0
VALUE
49953
non
-
null
float64
1
VOLUME
50000
non
-
null
float64
dtypes
:
float64
(
2
)
memory
usage
:
1.1
MB
In
[
28
]
:
data
.
head
(
)
Out
[
28
]
:
AAPL
.
O
VALUE
VOLUME
Date
2020
-
08
-
03
15
:
26
:
24.889
439.06
175.0
2020
-
08
-
03
15
:
26
:
24.889
439.08
3.0
2020
-
08
-
03
15
:
26
:
24.890
439.08
100.0
2020
-
08
-
03
15
:
26
:
24.890
439.08
5.0
2020
-
08
-
03
15
:
26
:
24.899
439.10
35.0
EOD Versus Tick Data
Most of the financial theories still applied today have their origin in when EOD data was basically the only type of financial data available. Today, financial institutions, and even retail traders and investors, are confronted with never-ending streams of real-time data. The example of Apple stock illustrates that for a single stock during one trading hour, there might be four times as many ticks coming in as the amount of EOD data accumulated over a period of 40 years. This not only challenges actors in financial markets, but also puts into question whether existing financial theories can be applied to such an environment at all.
Unstructured Historical Data
Many important data sources in finance provide unstructured data only, such as financial news or company filings. Undoubtedly, machines are much better and faster than humans at crunching large amounts of structured, numerical data. However, recent advances in natural language processing (NLP) make machines better and faster at processing financial news too, for example. In 2020, data service providers ingest roughly 1.5 million news articles on a daily basis. It is clear that this vast amount of text-based data cannot be processed properly by human beings.
Fortunately, unstructured data is also to a large extent available these days via programmatic APIs. The following Python code retrieves a number of news articles from the Eikon Data API related to the company Tesla, Inc. and its production. One article is selected and shown in full:
In
[
29
]
:
news
=
ek
.
get_news_headlines
(
'
R:TSLA.O PRODUCTION
'
,
date_from
=
'
2020-06-01
'
,
date_to
=
'
2020-08-01
'
,
count
=
7
)
In
[
30
]
:
news
Out
[
30
]
:
versionCreated
\
2020
-
07
-
29
11
:
02
:
31.276
2020
-
07
-
29
11
:
02
:
31.276000
+
00
:
00
2020
-
07
-
28
00
:
59
:
48.000
2020
-
07
-
28
00
:
59
:
48
+
00
:
00
2020
-
07
-
23
21
:
20
:
36.090
2020
-
07
-
23
21
:
20
:
36.090000
+
00
:
00
2020
-
07
-
23
08
:
22
:
17.000
2020
-
07
-
23
08
:
22
:
17
+
00
:
00
2020
-
07
-
23
07
:
08
:
48.000
2020
-
07
-
23
07
:
46
:
56
+
00
:
00
2020
-
07
-
23
00
:
55
:
54.000
2020
-
07
-
23
00
:
55
:
54
+
00
:
00
2020
-
07
-
22
21
:
35
:
42.640
2020
-
07
-
22
22
:
13
:
26.597000
+
00
:
00
text
\
2020
-
07
-
29
11
:
02
:
31.276
Tesla
Launches
Hiring
Spree
in
China
as
It
Pre
.
.
.
2020
-
07
-
28
00
:
59
:
48.000
Tesla
hiring
in
Shanghai
as
production
ramps
up
2020
-
07
-
23
21
:
20
:
36.090
Tesla
speeds
up
Model
3
production
in
Shanghai
2020
-
07
-
23
08
:
22
:
17.000
UPDATE
1
-
'
Please mine more nickel,
'
Musk
urges
.
.
.
2020
-
07
-
23
07
:
08
:
48.000
'
Please mine more nickel,
'
Musk
urges
as
Tesla
.
.
.
2020
-
07
-
23
00
:
55
:
54.000
USA
-
Tesla
choisit
le
Texas
pour
la
production
.
.
.
2020
-
07
-
22
21
:
35
:
42.640
TESLA
INC
-
THE
REAL
LIMITATION
ON
TESLA
GROWT
.
.
.
storyId
\
2020
-
07
-
29
11
:
02
:
31.276
urn
:
newsml
:
reuters
.
com
:
20200729
:
nCXG3W8s9X
:
1
2020
-
07
-
28
00
:
59
:
48.000
urn
:
newsml
:
reuters
.
com
:
20200728
:
nL3N2EY3PG
:
8
2020
-
07
-
23
21
:
20
:
36.090
urn
:
newsml
:
reuters
.
com
:
20200723
:
nNRAcf1v8f
:
1
2020
-
07
-
23
08
:
22
:
17.000
urn
:
newsml
:
reuters
.
com
:
20200723
:
nL3N2EU1P9
:
1
2020
-
07
-
23
07
:
08
:
48.000
urn
:
newsml
:
reuters
.
com
:
20200723
:
nL3N2EU0HH
:
1
2020
-
07
-
23
00
:
55
:
54.000
urn
:
newsml
:
reuters
.
com
:
20200723
:
nL5N2EU03M
:
1
2020
-
07
-
22
21
:
35
:
42.640
urn
:
newsml
:
reuters
.
com
:
20200722
:
nFWN2ET120
:
2
sourceCode
2020
-
07
-
29
11
:
02
:
31.276
NS
:
CAIXIN
2020
-
07
-
28
00
:
59
:
48.000
NS
:
RTRS
2020
-
07
-
23
21
:
20
:
36.090
NS
:
SOUTHC
2020
-
07
-
23
08
:
22
:
17.000
NS
:
RTRS
2020
-
07
-
23
07
:
08
:
48.000
NS
:
RTRS
2020
-
07
-
23
00
:
55
:
54.000
NS
:
RTRS
2020
-
07
-
22
21
:
35
:
42.640
NS
:
RTRS
In
[
31
]
:
storyId
=
news
[
'
storyId
'
]
[
1
]
In
[
32
]
:
from
IPython.display
import
HTML
In
[
33
]
:
HTML
(
ek
.
get_news_story
(
storyId
)
[
:
1148
]
)
Out
[
33
]
:
<
IPython
.
core
.
display
.
HTML
object
>
Jan 06, 2020 Tesla, Inc.TSLA registered record production and deliveries of 104,891 and 112,000 vehicles, respectively, in the fourth quarter of 2019. Notably, the company's Model S/X and Model 3 reported record production and deliveries in the fourth quarter. The Model S/X division recorded production and delivery volume of 17,933 and 19,450 vehicles, respectively. The Model 3 division registered production of 86,958 vehicles, while 92,550 vehicles were delivered. In 2019, Tesla delivered 367,500 vehicles, reflecting an increase of 50%, year over year, and nearly in line with the company's full-year guidance of 360,000 vehicles.
Unstructured Streaming Data
In the same way that historical unstructured data is retrieved, programmatic APIs can be used to stream unstructured news data, for example, in real time or at least near time. One such API is available for DNA: the Data, News, Analytics platform from Dow Jones. Figure 4-1 shows the screenshot of a web application that streams “Commodity and Financial News” articles and processes these with NLP techniques in real time.
The news-streaming application has the following main features:
- Full text
-
The full text of each article is available by clicking on the article header.
- Keyword summary
-
A keyword summary is created and printed on the screen.
- Sentiment analysis
-
Sentiment scores are calculated and visualized as colored arrows. Details become visible through a click on the arrows.
- Word cloud
-
A word cloud summary bitmap is created, shown as a thumbnail and visible after a click on the thumbnail (see Figure 4-2).
Alternative Data
Nowadays, financial institutions, and in particular hedge funds, systematically mine a number of alternative data sources to gain an edge in trading and investing. A recent article by Bloomberg lists, among others, the following alternative data sources:
-
Web-scraped data
-
Crowd-sourced data
-
Credit cards and point-of-sales (POS) systems
-
Social media sentiment
-
Search trends
-
Web traffic
-
Supply chain data
-
Energy production data
-
Consumer profiles
-
Satellite imagery/geospacial data
-
App installs
-
Ocean vessel tracking
-
Wearables, drones, Internet of Things (IoT) sensors
In the following, the usage of alternative data is illustrated by two examples. The first retrieves and processes Apple Inc. press releases in the form of HTML pages. The following Python code makes use of a set of helper functions as shown in “Python Code”. In the code, a list of URLs is defined, each representing an HTML page with a press release from Apple Inc. The raw HTML code is then retrieved for each press release. Then the raw code is cleaned up, and an excerpt for one press release is printed:
In
[
34
]
:
import
nlp
import
requests
In
[
35
]
:
sources
=
[
'
https://nr.apple.com/dE0b1T5G3u
'
,
# iPad Pro
'
https://nr.apple.com/dE4c7T6g1K
'
,
# MacBook Air
'
https://nr.apple.com/dE4q4r8A2A
'
,
# Mac Mini
]
In
[
36
]
:
html
=
[
requests
.
get
(
url
)
.
text
for
url
in
sources
]
In
[
37
]
:
data
=
[
nlp
.
clean_up_text
(
t
)
for
t
in
html
]
In
[
38
]
:
data
[
0
]
[
536
:
1001
]
Out
[
38
]
:
'
display, powerful a12x bionic chip and face id introducing the new ipad pro
with
all
-
screen
design
and
next
-
generation
performance
.
new
york
apple
today
introduced
the
new
ipad
pro
with
all
-
screen
design
and
next
-
generation
performance
,
marking
the
biggest
change
to
ipad
ever
.
the
all
-
new
design
pushes
11
-
inch
and
12.9
-
inch
liquid
retina
displays
to
the
edges
of
ipad
pro
and
integrates
face
id
to
securely
unlock
ipad
with
just
a
glance
.
1
the
a12x
bionic
chip
w
'
Imports the NLP helper functions
Defines the URLs for the three press releases
Retrieves the raw HTML codes for the three press releases
Cleans up the raw HTML codes (for example, HTML tags are removed)
Prints an excerpt from one press release
Of course, defining alternative data as broadly as is done in this section implies that there is a limitless amount of data that one can retrieve and process for financial purposes. At its core, this is the business of search engines such as the one from Google LLC. In a financial context, it would be of paramount importance to specify exactly what unstructured alternative data sources to tap into.
The second example is about the retrieval of data from the social network Twitter, Inc. To this end, Twitter provides API access to tweets on its platform, provided one has set up a Twitter account appropriately. The following Python code connects to the Twitter API and retrieves and prints the five most recent tweets from my home timeline and user timeline, respectively:
In
[
39
]
:
from
import
,
OAuth
In
[
40
]
:
t
=
(
auth
=
OAuth
(
c
[
'
'
]
[
'
access_token
'
]
,
c
[
'
'
]
[
'
access_secret_token
'
]
,
c
[
'
'
]
[
'
api_key
'
]
,
c
[
'
'
]
[
'
api_secret_key
'
]
)
,
retry
=
True
)
In
[
41
]
:
l
=
t
.
statuses
.
home_timeline
(
count
=
5
)
In
[
42
]
:
for
e
in
l
:
(
e
[
'
text
'
]
)
The
Bank
of
England
is
effectively
subsidizing
polluting
industries
in
its
pandemic
rescue
program
,
a
think
tank
sa
…
https
:
/
/
t
.
co
/
Fq5jl2CIcp
Cool
shared
task
:
mining
scientific
contributions
(
by
@SeeTedTalk
@SoerenAuer
and
Jennifer
D
'
Souza)
https
:
/
/
t
.
co
/
dm56DMUrWm
Twelve
people
were
hospitalized
in
Wyoming
on
Monday
after
a
hot
air
balloon
crash
,
officials
said
.
Three
hot
air
…
https
:
/
/
t
.
co
/
EaNBBRXVar
President
Trump
directed
controversial
Pentagon
pick
into
new
role
with
similar
duties
after
nomination
failed
https
:
/
/
t
.
co
/
ZyXpPcJkcQ
Company
announcement
:
Revolut
launches
Open
Banking
for
its
400
,
000
Italian
.
.
.
https
:
/
/
t
.
co
/
OfvbgwbeJW
#fintech
In
[
43
]
:
l
=
t
.
statuses
.
user_timeline
(
screen_name
=
'
dyjh
'
,
count
=
5
)
In
[
44
]
:
for
e
in
l
:
(
e
[
'
text
'
]
)
#Python for #AlgoTrading (focus on the process) & #AI in #Finance (focus
on
prediction
methods
)
will
complement
eac
…
https
:
/
/
t
.
co
/
P1s8fXCp42
Currently
putting
finishing
touches
on
#AI in #Finance (@OReillyMedia). Book
going
into
production
shortly
.
https
:
/
/
t
.
co
/
JsOSA3sfBL
Chinatown
Is
Coming
Back
,
One
Noodle
at
a
Time
https
:
/
/
t
.
co
/
In5kXNeVc5
Alt
data
industry
balloons
as
hedge
funds
strive
for
Covid
edge
via
@FT
|
"
We remain of the view that alternative d… https://t.co/9HtUOjoEdz
@Wolf_Of_BTC
Just
follow
me
on
(
or
)
.
Then
you
will
notice
for
sure
when
it
is
out
.
Connects to the Twitter API
Retrieves and prints five (most recent) tweets from home timeline
Retrieves and prints five (most recent) tweets from user timeline
The Twitter API allows also for searches, based on which most recent tweets can be retrieved and processed:
In
[
45
]
:
d
=
t
.
search
.
tweets
(
q
=
'
#Python
'
,
count
=
7
)
In
[
46
]
:
for
e
in
d
[
'
statuses
'
]
:
(
e
[
'
text
'
]
)
RT
@KirkDBorne
:
#AI is Reshaping Programming — Tips on How to Stay on Top:
https
:
/
/
t
.
co
/
CFNu1i352C
—
—
Courses
:
1
:
#MachineLearning — Jupyte…
RT
@reuvenmlerner
:
Today
,
a
#Python student's code didn't print:
x
=
5
if
x
==
5
:
:
(
'
yes!
'
)
There
was
a
typo
,
namely
:
after
pr
…
RT
@GavLaaaaaaaa
:
Javascript
Does
Not
Need
a
StringBuilder
https
:
/
/
t
.
co
/
aS7NzHLO65
#programming #softwareengineering #bigdata
#datascience…
RT
@CodeFlawCo
:
It
is
necessary
to
publish
regular
updates
on
#programmer #coder #developer #technology RT @pak_aims: Learning to C…
RT
@GavLaaaaaaaa
:
Javascript
Does
Not
Need
a
StringBuilder
https
:
/
/
t
.
co
/
aS7NzHLO65
#programming #softwareengineering #bigdata
#datascience…
One can also collect a larger number of tweets from a Twitter user and create a summary in the form of a word cloud (see Figure 4-3). The following Python code again makes use of the NLP helper functions as shown in “Python Code”:
In
[
47
]
:
l
=
t
.
statuses
.
user_timeline
(
screen_name
=
'
elonmusk
'
,
count
=
50
)
In
[
48
]
:
tl
=
[
e
[
'
text
'
]
for
e
in
l
]
In
[
49
]
:
tl
[
:
5
]
Out
[
49
]
:
[
'
@flcnhvy @Lindw0rm @cleantechnica True
'
,
'
@Lindw0rm @cleantechnica Highly likely down the road
'
,
'
@cleantechnica True fact
'
,
'
@NASASpaceflight Scrubbed for the day. A Raptor turbopump spin start valve
didn
’
t
open
,
triggering
an
automatic
abo
…
https
:
/
/
t
.
co
/
QDdlNXFgJg
'
,
'
@Erdayastronaut I’m in the Boca control room. Hop attempt in ~33 minutes.
'
]
In
[
50
]
:
wc
=
nlp
.
generate_word_cloud
(
'
'
.
join
(
tl
)
,
35
,
name
=
'
../../images/ch04/musk_twitter_wc.png
'
)
Retrieves the 50 most recent tweets for the user
elonmusk
Collects the texts in a
list
objectShows excerpts for the final five tweets
Generates a word cloud summary and shows it
Once a financial practitioner defines the “relevant financial data” to go beyond structured financial time series data, the data sources seem limitless in terms of volume, variety, and velocity. The way the tweets are retrieved from the Twitter API is almost in near time since the most recent tweets are accessed in the examples. These and similar API-based data sources therefore provide a never-ending stream of alternative data for which, as previously pointed out, it is important to specify exactly what one is looking for. Otherwise, any financial data science effort might easily drown in too much data and/or too noisy data.
Normative Theories Revisited
Chapter 3 introduces normative financial theories such as the MVP theory or the CAPM. For quite a long time, students and academics learning and studying such theories were more or less constrained to the theory itself. With all the available financial data, as discussed and illustrated in the previous section, in combination with powerful open source software for data analysis—such as Python, NumPy
, pandas
, and so on—it has become pretty easy and straightforward to put financial theories to real-world tests. It does not require small teams and larger studies anymore to do so. A typical notebook, internet access, and a standard Python environment suffice. This is what this section is about. However, before diving into data-driven finance, the following sub-section discusses briefly some famous paradoxes in the context of EUT and how corporations model and predict the behavior of individuals in practice.
Expected Utility and Reality
In economics, risk describes a situation in which possible future states and probabilities for those states to unfold are known in advance to the decision maker. This is the standard assumption in finance and the context of EUT. On the other hand, ambiguity describes situations in economics in which probabilities, or even possible future states, are not known in advance to a decision maker. Uncertainty subsumes the two different decision-making situations.
There is a long tradition of analyzing the concrete decision-making behavior of individuals (“agents”) under uncertainty. Innumerable studies and experiments have been conducted to observe and analyze how agents behave when faced with uncertainty as compared to what theories such as EUT predict. For centuries, paradoxa have played an important role in decision-making theory and research.
One such paradox, the St. Petersburg paradox, gave rise to the invention of utility functions and EUT in the first place. Daniel Bernoulli presented the paradox—and a solution to it—in 1738. The paradox is based on the following coin tossing game . An agent is faced with a game during which a (perfect) coin is tossed potentially infinitely many times. If after the first toss heads prevails, the agent receives a payoff of 1 (currency unit). As long as heads is observed, the coin is tossed again. Otherwise the game ends. If heads prevails a second time, the agent receives an additional payoff of 2. If it does a third time, the additional payoff is 4. For the fourth time it is 8, and so on. This is a situation of risk since all possible future states, as well as their associated probabilities, are known in advance.
The expected payoff of this game is infinite. This can be seen from the following infinite sum of which every element is strictly positive:
However, faced with such a game, a decision maker in general would be willing to pay a finite sum only to play the game. A major reason for this is the fact that relatively large payoffs only happen with a relatively small probability. Consider the potential payoff :
The probability of winning such a payoff is pretty low. To be exact, it is only 0.001953125. The probability for such a payoff or a smaller one, on the other hand, is pretty high:
In other words, in 998 out of 1,000 games the payoff is 511 or smaller. Therefore, an agent would probably not wager much more than 511 to play this game. The way out of this paradox is the introduction of a utility function with positive but decreasing marginal utility. In the context of the St. Petersburg paradox, this means that there is a function that assigns to every positive payoff a real value . Positive but decreasing marginal utility then formally translates into the following:
As seen in Chapter 3, one such candidate function is with:
The expected utility then is finite, as the calculation of the following infinite sum illustrates:
The expected utility of = 0.693147 is obviously a pretty small number in comparison to the expected payoff of infinity. Bernoulli utility functions and EUT resolve the St. Petersburg paradox.
Other paradoxa, such as the Allais paradox published in Allais (1953), address the EUT itself. This paradox is based on an experiment with four different games that test subjects should rank. Table 4-2 shows the four games . The ranking is to be done for the two pairs and . The independence axiom postulates that the first row in the table should not have any influence on the ordering of since the payoff is the same for both games.
Probability | Game A | Game B | Game A’ | Game B’ |
---|---|---|---|---|
0.66 |
2,400 |
2,400 |
0 |
0 |
0.33 |
2,500 |
2,400 |
2,500 |
2,400 |
0.01 |
0 |
2,400 |
0 |
2,400 |
In experiments, the majority of decision makers rank the games as follows: and . The ranking leads to the following inequalities, where :
The ranking in turn leads to the following inequalities:
These inequalities obviously contradict each other and lead to the Allais paradox. One possible explanation is that decision makers in general value certainty higher than the typical models, such as EUT, predict. Most people would probably rather choose to receive $1 million with certainty than play a game in which they can win $100 million with a probability of 5%, although there are a number of suitable utility functions available that under EUT would have the decision maker choose the game instead of the certain amount.
Another explanation lies in framing decisions and the psychology of decision makers. It is well known that more people would accept a surgery if it has a “95% chance of success” than a “5% chance of death.” Simply changing the wording might lead to behavior that is inconsistent with decision-making theories such as EUT.
Another famous paradox addressing shortcomings of EUT in its subjective form, according to Savage (1954, 1972), is the Ellsberg paradox, which dates back to the seminal paper by Ellsberg (1961). It addresses the importance of ambiguity in many real-world decision situations. A standard setting for this paradox comprises two different urns, both of which contain exactly 100 balls. For urn 1, it is known that it contains exactly 50 black and 50 red balls. For urn 2, it is only known that it contains black and red balls but not in which proportion.
Test subjects can choose among the following game options:
-
Game 1: red 1, black 1, or indifferent
-
Game 2: red 2, black 2, or indifferent
-
Game 3: red 1, red 2, or indifferent
-
Game 4: black 1, black 2, or indifferent
Here, “red 1,” for example, means that a red ball is drawn from urn 1. Typically, a test subject would answer as follows:
-
Game 1: indifferent
-
Game 2: indifferent
-
Game 3: red 1
-
Game 4: black 1
This set of decisions—which is not the only one to be observed but is a common one—exemplifies what is called ambiguity aversion. Since the probabilities for black and red balls, respectively, are not known for urn 2, decision makers prefer a situation of risk instead of ambiguity.
The two paradoxa of Allais and Ellsberg show that real test subjects quite often behave contrary to what well-established decision theories in economics predict. In other words, human beings as decision makers can in general not be compared to machines that carefully collect data and then crunch the numbers to make a decision under uncertainty, be it in the form of risk or ambiguity. Human behavior is more complex than most, if not all, theories currently suggest. How difficult and complex it can be to explain human behavior is clear after reading, for example, the 800-page book Behave by Sapolsky (2018). It covers multiple facets of this topic, ranging from biochemical processes to genetics, human evolution, tribes, language, religion, and more, in an integrative manner.
If standard economic decision paradigms such as EUT do not explain real-world decision making too well, what alternatives are available? Economic experiments that build the basis for the Allais and Ellsberg paradoxa are a good starting point in learning how decision makers behave in specific, controlled situations. Such experiments and their sometimes surprising and paradoxical results have indeed motivated a great number of researchers to come up with alternative theories and models that resolve the paradoxa. The book The Experiment in the History of Economics by Fontaine and Leonard (2005) is about the historical role of experiments in economics. There is, for example, a whole string of literature that addresses issues arising from the Ellsberg paradox. This literature deals with, among other topics, nonadditive probabilities, Choquet integrals, and decision heuristics such as maximizing the minimum payoff (“max-min”) or minimizing the maximum loss (“min-max”). These alternative approaches have proven superior to EUT, at least in certain decision-making scenarios. But they are far from being mainstream in finance.
What, after all, has proven to be useful in practice? Not too surprisingly, the answer lies in data and machine learning algorithms. The internet, with its billions of users, generates a treasure trove of data describing real-world human behavior, or what is sometimes called revealed preferences. The big data generated on the web has a scale that is multiple orders of magnitude larger than what single experiments can generate. Companies such as Amazon, Facebook, Google, and Twitter are able to make billions of dollars by recording user behavior (that is, their revealed preferences) and capitalizing on the insights generated by ML algorithms trained on this data.
The default ML approach taken in this context is supervised learning. The algorithms themselves are in general theory- and model-free; variants of neural networks are often applied. Therefore, when companies today predict the behavior of their users or customers, more often than not a model-free ML algorithm is deployed. Traditional decision theories like EUT or one of its successors generally do not play a role at all. This makes it somewhat surprising that such theories still, at the beginning of the 2020s, are a cornerstone of most economic and financial theories applied in practice. And this is not even to mention the large number of financial textbooks that cover traditional decision theories in detail. If one of the most fundamental building blocks of financial theory seems to lack meaningful empirical support or practical benefits, what about the financial models that build on top of it? More on this appears in subsequent sections and chapters.
Data-Driven Predictions of Behavior
Standard economic decision theories are intellectually appealing to many, even to those who, faced with a concrete decision under uncertainty, would behave in contrast to the theories’ predictions. On the other hand, big data and model-free, supervised learning approaches prove useful and successful in practice for predicting user and customer behavior. In a financial context, this might imply that one should not really worry about why and how financial agents decide the way they decide. One should rather focus on their indirectly revealed preferences based on features data (new information) that describes the state of a financial market and labels data (outcomes) that reflects the impact of the decisions made by financial agents. This leads to a data-driven instead of a theory- or model-driven view of decision making in financial markets. Financial agents become data-processing organisms that can be much better modeled, for example, by complex neural networks than, say, a simple utility function in combination with an assumed probability distribution.
Mean-Variance Portfolio Theory
Assume a data-driven investor wants to apply MVP theory to invest in a portfolio of technology stocks and wants to add a gold-related exchange-traded fund (ETF) for diversification. Probably, the investor would access relevant historical price data via an API to a trading platform or a data provider. To make the following analysis reproducible, it relies on a CSV data file stored in a remote location. The following Python code retrieves the data file, selects a number of symbols given the investor’s goal, and calculates log returns from the price time series data. Figure 4-4 compares the normalized price time series for the selected symbols:
In
[
51
]
:
import
numpy
as
np
import
pandas
as
pd
from
pylab
import
plt
,
mpl
from
scipy.optimize
import
minimize
plt
.
style
.
use
(
'
seaborn
'
)
mpl
.
rcParams
[
'
savefig.dpi
'
]
=
300
mpl
.
rcParams
[
'
font.family
'
]
=
'
serif
'
np
.
set_printoptions
(
precision
=
5
,
suppress
=
True
,
formatter
=
{
'
float
'
:
lambda
x
:
f
'
{x:6.3f}
'
}
)
In
[
52
]
:
url
=
'
http://hilpisch.com/aiif_eikon_eod_data.csv
'
In
[
53
]
:
raw
=
pd
.
read_csv
(
url
,
index_col
=
0
,
parse_dates
=
True
)
.
dropna
(
)
In
[
54
]
:
raw
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
2516
entries
,
2010
-
01
-
04
to
2019
-
12
-
31
Data
columns
(
total
12
columns
)
:
# Column Non-Null Count Dtype
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
0
AAPL
.
O
2516
non
-
null
float64
1
MSFT
.
O
2516
non
-
null
float64
2
INTC
.
O
2516
non
-
null
float64
3
AMZN
.
O
2516
non
-
null
float64
4
GS
.
N
2516
non
-
null
float64
5
SPY
2516
non
-
null
float64
6
.
SPX
2516
non
-
null
float64
7
.
VIX
2516
non
-
null
float64
8
EUR
=
2516
non
-
null
float64
9
XAU
=
2516
non
-
null
float64
10
GDX
2516
non
-
null
float64
11
GLD
2516
non
-
null
float64
dtypes
:
float64
(
12
)
memory
usage
:
255.5
KB
In
[
55
]
:
symbols
=
[
'
AAPL.O
'
,
'
MSFT.O
'
,
'
INTC.O
'
,
'
AMZN.O
'
,
'
GLD
'
]
In
[
56
]
:
rets
=
np
.
log
(
raw
[
symbols
]
/
raw
[
symbols
]
.
shift
(
1
)
)
.
dropna
(
)
In
[
57
]
:
(
raw
[
symbols
]
/
raw
[
symbols
]
.
iloc
[
0
]
)
.
plot
(
figsize
=
(
10
,
6
)
)
;
Retrieves historical EOD data from a remote location
Specifies the symbols (
RICs
) to be invested inCalculates the log returns for all time series
Plots the normalized financial time series for the selected symbols
The data-driven investor wants to first set a baseline for performance as given by an equally weighted portfolio over the whole period of the available data. To this end, the following Python code defines functions to calculate the portfolio return, the portfolio volatility, and the portfolio Sharpe ratio given a set of weights for the selected symbols:
In
[
58
]
:
weights
=
len
(
rets
.
columns
)
*
[
1
/
len
(
rets
.
columns
)
]
In
[
59
]
:
def
port_return
(
rets
,
weights
)
:
return
np
.
dot
(
rets
.
mean
(
)
,
weights
)
*
252
In
[
60
]
:
port_return
(
rets
,
weights
)
Out
[
60
]
:
0.15694764653018106
In
[
61
]
:
def
port_volatility
(
rets
,
weights
)
:
return
np
.
dot
(
weights
,
np
.
dot
(
rets
.
cov
(
)
*
252
,
weights
)
)
*
*
0.5
In
[
62
]
:
port_volatility
(
rets
,
weights
)
Out
[
62
]
:
0.16106507848480675
In
[
63
]
:
def
port_sharpe
(
rets
,
weights
)
:
return
port_return
(
rets
,
weights
)
/
port_volatility
(
rets
,
weights
)
In
[
64
]
:
port_sharpe
(
rets
,
weights
)
Out
[
64
]
:
0.97443622172255
Equally weighted portfolio
Portfolio return
Portfolio volatility
Portfolio Sharpe ratio (with zero short rate)
The investor also wants to analyze which combinations of portfolio risk and return—and consequently Sharpe ratio—are roughly possible by applying Monte Carlo simulation to randomize the portfolio weights. Short sales are excluded, and the portfolio weights are assumed to add up to 100%. The following Python code implements the simulation and visualizes the results (see Figure 4-5):
In
[
65
]
:
w
=
np
.
random
.
random
(
(
1000
,
len
(
symbols
)
)
)
w
=
(
w
.
T
/
w
.
sum
(
axis
=
1
)
)
.
T
In
[
66
]
:
w
[
:
5
]
Out
[
66
]
:
array
(
[
[
0.184
,
0.157
,
0.227
,
0.353
,
0.079
]
,
[
0.207
,
0.282
,
0.258
,
0.023
,
0.230
]
,
[
0.313
,
0.284
,
0.051
,
0.340
,
0.012
]
,
[
0.238
,
0.181
,
0.145
,
0.191
,
0.245
]
,
[
0.246
,
0.256
,
0.315
,
0.181
,
0.002
]
]
)
In
[
67
]
:
pvr
=
[
(
port_volatility
(
rets
[
symbols
]
,
weights
)
,
port_return
(
rets
[
symbols
]
,
weights
)
)
for
weights
in
w
]
pvr
=
np
.
array
(
pvr
)
In
[
68
]
:
psr
=
pvr
[
:
,
1
]
/
pvr
[
:
,
0
]
In
[
69
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
fig
=
plt
.
scatter
(
pvr
[
:
,
0
]
,
pvr
[
:
,
1
]
,
c
=
psr
,
cmap
=
'
coolwarm
'
)
cb
=
plt
.
colorbar
(
fig
)
cb
.
set_label
(
'
Sharpe ratio
'
)
plt
.
xlabel
(
'
expected volatility
'
)
plt
.
ylabel
(
'
expected return
'
)
plt
.
title
(
'
|
'
.
join
(
symbols
)
)
;
Simulates portfolio weights adding up to 100%
Derives the resulting portfolio volatilities and returns
Calculates the resulting Sharpe ratios
The data-driven investor now wants to backtest the performance of a portfolio that was set up at the beginning of 2011. The optimal portfolio composition was derived from the financial time series data available from 2010. At the beginning of 2012, the portfolio composition was adjusted given the available data from 2011, and so on. To this end, the following Python code derives the portfolio weights for every relevant year that maximizes the Sharpe ratio:
In
[
70
]
:
bnds
=
len
(
symbols
)
*
[
(
0
,
1
)
,
]
bnds
Out
[
70
]
:
[
(
0
,
1
)
,
(
0
,
1
)
,
(
0
,
1
)
,
(
0
,
1
)
,
(
0
,
1
)
]
In
[
71
]
:
cons
=
{
'
type
'
:
'
eq
'
,
'
fun
'
:
lambda
weights
:
weights
.
sum
(
)
-
1
}
In
[
72
]
:
opt_weights
=
{
}
for
year
in
range
(
2010
,
2019
)
:
rets_
=
rets
[
symbols
]
.
loc
[
f
'
{year}-01-01
'
:
f
'
{year}-12-31
'
]
ow
=
minimize
(
lambda
weights
:
-
port_sharpe
(
rets_
,
weights
)
,
len
(
symbols
)
*
[
1
/
len
(
symbols
)
]
,
bounds
=
bnds
,
constraints
=
cons
)
[
'
x
'
]
opt_weights
[
year
]
=
ow
In
[
73
]
:
opt_weights
Out
[
73
]
:
{
2010
:
array
(
[
0.366
,
0.000
,
0.000
,
0.056
,
0.578
]
)
,
2011
:
array
(
[
0.543
,
0.000
,
0.077
,
0.000
,
0.380
]
)
,
2012
:
array
(
[
0.324
,
0.000
,
0.000
,
0.471
,
0.205
]
)
,
2013
:
array
(
[
0.012
,
0.305
,
0.219
,
0.464
,
0.000
]
)
,
2014
:
array
(
[
0.452
,
0.115
,
0.419
,
0.000
,
0.015
]
)
,
2015
:
array
(
[
0.000
,
0.000
,
0.000
,
1.000
,
0.000
]
)
,
2016
:
array
(
[
0.150
,
0.260
,
0.000
,
0.058
,
0.533
]
)
,
2017
:
array
(
[
0.231
,
0.203
,
0.031
,
0.109
,
0.426
]
)
,
2018
:
array
(
[
0.000
,
0.295
,
0.000
,
0.705
,
0.000
]
)
}
Specifies the bounds for the single asset weights
Specifies that all weights need to add up to 100%
Selects the relevant data set for the given year
Derives the portfolio weights that maximize the Sharpe ratio
The optimal portfolio compositions as derived for the relevant years illustrate that MVP theory in its original form quite often leads to (relative) extreme situations in the sense that one or more assets are not included at all or that even a single asset makes up 100% of the portfolio. Of course, this can be actively avoided by setting, for example, a minimum weight for every asset considered. The results also indicate that this approach leads to significant rebalancings in the portfolio, driven by the previous year’s realized statistics and correlations.
To complete the backtest, the following code compares the expected portfolio statistics (from the optimal composition of the previous year applied to the previous year’s data) with the realized portfolio statistics for the current year (from the optimal composition from the previous year applied to the current year’s data):
In
[
74
]
:
res
=
pd
.
DataFrame
(
)
for
year
in
range
(
2010
,
2019
)
:
rets_
=
rets
[
symbols
]
.
loc
[
f
'
{year}-01-01
'
:
f
'
{year}-12-31
'
]
epv
=
port_volatility
(
rets_
,
opt_weights
[
year
]
)
epr
=
port_return
(
rets_
,
opt_weights
[
year
]
)
esr
=
epr
/
epv
rets_
=
rets
[
symbols
]
.
loc
[
f
'
{year + 1}-01-01
'
:
f
'
{year + 1}-12-31
'
]
rpv
=
port_volatility
(
rets_
,
opt_weights
[
year
]
)
rpr
=
port_return
(
rets_
,
opt_weights
[
year
]
)
rsr
=
rpr
/
rpv
res
=
res
.
append
(
pd
.
DataFrame
(
{
'
epv
'
:
epv
,
'
epr
'
:
epr
,
'
esr
'
:
esr
,
'
rpv
'
:
rpv
,
'
rpr
'
:
rpr
,
'
rsr
'
:
rsr
}
,
index
=
[
year
+
1
]
)
)
In
[
75
]
:
res
Out
[
75
]
:
epv
epr
esr
rpv
rpr
rsr
2011
0.157440
0.303003
1.924564
0.160622
0.133836
0.833235
2012
0.173279
0.169321
0.977156
0.182292
0.161375
0.885256
2013
0.202460
0.278459
1.375378
0.168714
0.166897
0.989228
2014
0.181544
0.368961
2.032353
0.197798
0.026830
0.135645
2015
0.160340
0.309486
1.930190
0.211368
-
0.024560
-
0.116194
2016
0.326730
0.778330
2.382179
0.296565
0.103870
0.350242
2017
0.106148
0.090933
0.856663
0.079521
0.230630
2.900235
2018
0.086548
0.260702
3.012226
0.157337
0.038234
0.243004
2019
0.323796
0.228008
0.704174
0.207672
0.275819
1.328147
In
[
76
]
:
res
.
mean
(
)
Out
[
76
]
:
epv
0.190920
epr
0.309689
esr
1.688320
rpv
0.184654
rpr
0.123659
rsr
0.838755
dtype
:
float64
Figure 4-6 compares the expected and realized portfolio volatilities for the single years. MVP theory does quite a good job in predicting the portfolio volatility. This is also supported by a relatively high correlation between the two time series:
In
[
77
]:
res
[[
'epv'
,
'rpv'
]]
.
corr
()
Out
[
77
]:
epv
rpv
epv
1.000000
0.765733
rpv
0.765733
1.000000
In
[
78
]:
res
[[
'epv'
,
'rpv'
]]
.
plot
(
kind
=
'bar'
,
figsize
=
(
10
,
6
),
title
=
'Expected vs. Realized Portfolio Volatility'
);
However, the conclusions are the opposite when comparing the expected with the realized portfolio returns (see Figure 4-7). MVP theory obviously fails in predicting the portfolio returns, as is confirmed by the negative correlation between the two time series:
In
[
79
]:
res
[[
'epr'
,
'rpr'
]]
.
corr
()
Out
[
79
]:
epr
rpr
epr
1.000000
-
0.350437
rpr
-
0.350437
1.000000
In
[
80
]:
res
[[
'epr'
,
'rpr'
]]
.
plot
(
kind
=
'bar'
,
figsize
=
(
10
,
6
),
title
=
'Expected vs. Realized Portfolio Return'
);
Similar, or even worse, conclusions need to be drawn with regard to the Sharpe ratio (see Figure 4-8). For the data-driven investor who aims at maximizing the Sharpe ratio of the portfolio, the theory’s predictions are generally significantly off from the realized values. The correlation between the two time series is even lower than for the returns:
In
[
81
]:
res
[[
'esr'
,
'rsr'
]]
.
corr
()
Out
[
81
]:
esr
rsr
esr
1.000000
-
0.698607
rsr
-
0.698607
1.000000
In
[
82
]:
res
[[
'esr'
,
'rsr'
]]
.
plot
(
kind
=
'bar'
,
figsize
=
(
10
,
6
),
title
=
'Expected vs. Realized Sharpe Ratio'
);
Predictive Power of MVP Theory
MVP theory applied to real-world data reveals its practical shortcomings. Without additional constraints, optimal portfolio compositions and rebalancings can be extreme. The predictive power with regard to portfolio return and Sharpe ratio is pretty bad in the numerical example, whereas the predictive power with regard to portfolio risk seems acceptable. However, investors generally are interested in risk-adjusted performance measures, such as the Sharpe ratio, and this is the statistic for which MVP theory fails worst in the example.
Capital Asset Pricing Model
A similar approach can be applied to put the CAPM to a real-world test. Assume that the data-driven technology investor from before wants to apply the CAPM to derive expected returns for the four technology stocks from before. The following Python code first derives the beta for every stock for a given year, and then calculates the expected return for the stock in the next year, given its beta and the performance of the market portfolio. The market portfolio is approximated by the S&P 500 stock index:
In
[
83
]
:
r
=
0.005
In
[
84
]
:
market
=
'
.SPX
'
In
[
85
]
:
rets
=
np
.
log
(
raw
/
raw
.
shift
(
1
)
)
.
dropna
(
)
In
[
86
]
:
res
=
pd
.
DataFrame
(
)
In
[
87
]
:
for
sym
in
rets
.
columns
[
:
4
]
:
(
'
\n
'
+
sym
)
(
54
*
'
=
'
)
for
year
in
range
(
2010
,
2019
)
:
rets_
=
rets
.
loc
[
f
'
{year}-01-01
'
:
f
'
{year}-12-31
'
]
muM
=
rets_
[
market
]
.
mean
(
)
*
252
cov
=
rets_
.
cov
(
)
.
loc
[
sym
,
market
]
var
=
rets_
[
market
]
.
var
(
)
beta
=
cov
/
var
rets_
=
rets
.
loc
[
f
'
{year + 1}-01-01
'
:
f
'
{year + 1}-12-31
'
]
muM
=
rets_
[
market
]
.
mean
(
)
*
252
mu_capm
=
r
+
beta
*
(
muM
-
r
)
mu_real
=
rets_
[
sym
]
.
mean
(
)
*
252
res
=
res
.
append
(
pd
.
DataFrame
(
{
'
symbol
'
:
sym
,
'
mu_capm
'
:
mu_capm
,
'
mu_real
'
:
mu_real
}
,
index
=
[
year
+
1
]
)
,
sort
=
True
)
(
'
{} | beta: {:.3f} | mu_capm: {:6.3f} | mu_real: {:6.3f}
'
.
format
(
year
+
1
,
beta
,
mu_capm
,
mu_real
)
)
Specifies the risk-less short rate
Defines the market portfolio
Derives the beta of the stock
Calculates the expected return given previous year’s beta and current year market portfolio performance
Calculates the realized performance of the stock for the current year
Collects and prints all results
The preceding code provides the following output:
AAPL
.
O
======================================================
2011
|
beta
:
1.052
|
mu_capm
:
-
0.000
|
mu_real
:
0.228
2012
|
beta
:
0.764
|
mu_capm
:
0.098
|
mu_real
:
0.275
2013
|
beta
:
1.266
|
mu_capm
:
0.327
|
mu_real
:
0.053
2014
|
beta
:
0.630
|
mu_capm
:
0.070
|
mu_real
:
0.320
2015
|
beta
:
0.833
|
mu_capm
:
-
0.005
|
mu_real
:
-
0.047
2016
|
beta
:
1.144
|
mu_capm
:
0.103
|
mu_real
:
0.096
2017
|
beta
:
1.009
|
mu_capm
:
0.180
|
mu_real
:
0.381
2018
|
beta
:
1.379
|
mu_capm
:
-
0.091
|
mu_real
:
-
0.071
2019
|
beta
:
1.252
|
mu_capm
:
0.316
|
mu_real
:
0.621
MSFT
.
O
======================================================
2011
|
beta
:
0.890
|
mu_capm
:
0.001
|
mu_real
:
-
0.072
2012
|
beta
:
0.816
|
mu_capm
:
0.104
|
mu_real
:
0.029
2013
|
beta
:
1.109
|
mu_capm
:
0.287
|
mu_real
:
0.337
2014
|
beta
:
0.876
|
mu_capm
:
0.095
|
mu_real
:
0.216
2015
|
beta
:
0.955
|
mu_capm
:
-
0.007
|
mu_real
:
0.178
2016
|
beta
:
1.249
|
mu_capm
:
0.113
|
mu_real
:
0.113
2017
|
beta
:
1.224
|
mu_capm
:
0.217
|
mu_real
:
0.321
2018
|
beta
:
1.303
|
mu_capm
:
-
0.086
|
mu_real
:
0.172
2019
|
beta
:
1.442
|
mu_capm
:
0.364
|
mu_real
:
0.440
INTC
.
O
======================================================
2011
|
beta
:
1.081
|
mu_capm
:
-
0.000
|
mu_real
:
0.142
2012
|
beta
:
0.842
|
mu_capm
:
0.108
|
mu_real
:
-
0.163
2013
|
beta
:
1.081
|
mu_capm
:
0.280
|
mu_real
:
0.230
2014
|
beta
:
0.883
|
mu_capm
:
0.096
|
mu_real
:
0.335
2015
|
beta
:
1.055
|
mu_capm
:
-
0.008
|
mu_real
:
-
0.052
2016
|
beta
:
1.009
|
mu_capm
:
0.092
|
mu_real
:
0.051
2017
|
beta
:
1.261
|
mu_capm
:
0.223
|
mu_real
:
0.242
2018
|
beta
:
1.163
|
mu_capm
:
-
0.076
|
mu_real
:
0.017
2019
|
beta
:
1.376
|
mu_capm
:
0.347
|
mu_real
:
0.243
AMZN
.
O
======================================================
2011
|
beta
:
1.102
|
mu_capm
:
-
0.001
|
mu_real
:
-
0.039
2012
|
beta
:
0.958
|
mu_capm
:
0.122
|
mu_real
:
0.374
2013
|
beta
:
1.116
|
mu_capm
:
0.289
|
mu_real
:
0.464
2014
|
beta
:
1.262
|
mu_capm
:
0.135
|
mu_real
:
-
0.251
2015
|
beta
:
1.473
|
mu_capm
:
-
0.013
|
mu_real
:
0.778
2016
|
beta
:
1.122
|
mu_capm
:
0.102
|
mu_real
:
0.104
2017
|
beta
:
1.118
|
mu_capm
:
0.199
|
mu_real
:
0.446
2018
|
beta
:
1.300
|
mu_capm
:
-
0.086
|
mu_real
:
0.251
2019
|
beta
:
1.619
|
mu_capm
:
0.408
|
mu_real
:
0.207
Figure 4-9 compares the predicted (expected) return for a single stock, given the beta from the previous year and market portfolio performance of the current year, with the realized return of the stock for the current year. Obviously, the CAPM in its original form does not prove really useful in predicting a stock’s performance based on beta only:
In
[
88
]:
sym
=
'AMZN.O'
In
[
89
]:
res
[
res
[
'symbol'
]
==
sym
]
.
corr
()
Out
[
89
]:
mu_capm
mu_real
mu_capm
1.000000
-
0.004826
mu_real
-
0.004826
1.000000
In
[
90
]:
res
[
res
[
'symbol'
]
==
sym
]
.
plot
(
kind
=
'bar'
,
figsize
=
(
10
,
6
),
title
=
sym
);
Figure 4-10 compares the averages of the CAPM-predicted stock returns with the averages of the realized returns. Also here, the CAPM does not do a good job.
What is easy to see is that the CAPM predictions do not vary that much on average for the stocks analyzed; they are between 12.2% and 14.4%. However, the realized average returns of the stocks show a high variability; these are between 9.4% and 29.2%. Market portfolio performance and beta alone obviously cannot account for the observed returns of the (technology) stocks:
In
[
91
]:
grouped
=
res
.
groupby
(
'symbol'
)
.
mean
()
grouped
Out
[
91
]:
mu_capm
mu_real
symbol
AAPL
.
O
0.110855
0.206158
AMZN
.
O
0.128223
0.259395
INTC
.
O
0.117929
0.116180
MSFT
.
O
0.120844
0.192655
In
[
92
]:
grouped
.
plot
(
kind
=
'bar'
,
figsize
=
(
10
,
6
),
title
=
'Average Values'
);
Predictive Power of the CAPM
The predictive power of the CAPM with regard to the future performance of stocks, relative to the market portfolio, is pretty low or even nonexistent for certain stocks. One of the reasons is probably the fact that the CAPM rests on the same central assumptions as MVP theory, namely that investors care about only the (expected) return and (expected) volatility of a portfolio and/or stock. From a modeling point of view, one can ask whether the single risk factor is enough to explain variability in stock returns or whether there might be a nonlinear relationship between a stock’s return and the market portfolio performance.
Arbitrage Pricing Theory
The predictive power of the CAPM seems quite limited given the results from the previous numerical example. A valid question is whether the market portfolio performance alone is enough to explain variability in stock returns. The answer of the APT is no—there can be more (even many more) factors that together explain variability in stock returns. “Arbitrage Pricing Theory” formally describes the framework of APT that also relies on a linear relationship between the factors and a stock’s return.
The data-driven investor recognizes that the CAPM is not sufficient to reliably predict a stock’s performance relative to the market portfolio performance. Therefore, the investor decides to add to the market portfolio three additional factors that might drive a stock’s performance:
-
Market volatility (as represented by the VIX index,
.VIX
) -
Exchange rates (as represented by the EUR/USD rate,
EUR=
) -
Commodity prices (as represented by the gold price,
XAU=
)
The following Python code implements a simple APT approach by using the four factors in combination with multivariate regression to explain a stock’s future performance in relation to the factors:
In
[
93
]
:
factors
=
[
'
.SPX
'
,
'
.VIX
'
,
'
EUR=
'
,
'
XAU=
'
]
In
[
94
]
:
res
=
pd
.
DataFrame
(
)
In
[
95
]
:
np
.
set_printoptions
(
formatter
=
{
'
float
'
:
lambda
x
:
f
'
{x:5.2f}
'
}
)
In
[
96
]
:
for
sym
in
rets
.
columns
[
:
4
]
:
(
'
\n
'
+
sym
)
(
71
*
'
=
'
)
for
year
in
range
(
2010
,
2019
)
:
rets_
=
rets
.
loc
[
f
'
{year}-01-01
'
:
f
'
{year}-12-31
'
]
reg
=
np
.
linalg
.
lstsq
(
rets_
[
factors
]
,
rets_
[
sym
]
,
rcond
=
-
1
)
[
0
]
rets_
=
rets
.
loc
[
f
'
{year + 1}-01-01
'
:
f
'
{year + 1}-12-31
'
]
mu_apt
=
np
.
dot
(
rets_
[
factors
]
.
mean
(
)
*
252
,
reg
)
mu_real
=
rets_
[
sym
]
.
mean
(
)
*
252
res
=
res
.
append
(
pd
.
DataFrame
(
{
'
symbol
'
:
sym
,
'
mu_apt
'
:
mu_apt
,
'
mu_real
'
:
mu_real
}
,
index
=
[
year
+
1
]
)
)
(
'
{} | fl: {} | mu_apt: {:6.3f} | mu_real: {:6.3f}
'
.
format
(
year
+
1
,
reg
.
round
(
2
)
,
mu_apt
,
mu_real
)
)
The four factors
The multivariate regression
The APT-predicted return of the stock
The realized return of the stock
The preceding code provides the following output:
AAPL
.
O
=======================================================================
2011
|
fl
:
[
0.91
-
0.04
-
0.35
0.12
]
|
mu_apt
:
0.011
|
mu_real
:
0.228
2012
|
fl
:
[
0.76
-
0.02
-
0.24
0.05
]
|
mu_apt
:
0.099
|
mu_real
:
0.275
2013
|
fl
:
[
1.67
0.04
-
0.56
0.10
]
|
mu_apt
:
0.366
|
mu_real
:
0.053
2014
|
fl
:
[
0.53
-
0.00
0.02
0.16
]
|
mu_apt
:
0.050
|
mu_real
:
0.320
2015
|
fl
:
[
1.07
0.02
0.25
0.01
]
|
mu_apt
:
-
0.038
|
mu_real
:
-
0.047
2016
|
fl
:
[
1.21
0.01
-
0.14
-
0.02
]
|
mu_apt
:
0.110
|
mu_real
:
0.096
2017
|
fl
:
[
1.10
0.01
-
0.15
-
0.02
]
|
mu_apt
:
0.170
|
mu_real
:
0.381
2018
|
fl
:
[
1.06
-
0.03
-
0.15
0.12
]
|
mu_apt
:
-
0.088
|
mu_real
:
-
0.071
2019
|
fl
:
[
1.37
0.01
-
0.20
0.13
]
|
mu_apt
:
0.364
|
mu_real
:
0.621
MSFT
.
O
=======================================================================
2011
|
fl
:
[
0.98
0.01
0.02
-
0.11
]
|
mu_apt
:
-
0.008
|
mu_real
:
-
0.072
2012
|
fl
:
[
0.82
0.00
-
0.03
-
0.01
]
|
mu_apt
:
0.103
|
mu_real
:
0.029
2013
|
fl
:
[
1.14
0.00
-
0.07
-
0.01
]
|
mu_apt
:
0.294
|
mu_real
:
0.337
2014
|
fl
:
[
1.28
0.05
0.04
0.07
]
|
mu_apt
:
0.149
|
mu_real
:
0.216
2015
|
fl
:
[
1.20
0.03
0.05
0.01
]
|
mu_apt
:
-
0.016
|
mu_real
:
0.178
2016
|
fl
:
[
1.44
0.03
-
0.17
-
0.02
]
|
mu_apt
:
0.127
|
mu_real
:
0.113
2017
|
fl
:
[
1.33
0.01
-
0.14
0.00
]
|
mu_apt
:
0.216
|
mu_real
:
0.321
2018
|
fl
:
[
1.10
-
0.02
-
0.14
0.22
]
|
mu_apt
:
-
0.087
|
mu_real
:
0.172
2019
|
fl
:
[
1.51
0.01
-
0.16
-
0.02
]
|
mu_apt
:
0.378
|
mu_real
:
0.440
INTC
.
O
=======================================================================
2011
|
fl
:
[
1.17
0.01
0.05
-
0.13
]
|
mu_apt
:
-
0.010
|
mu_real
:
0.142
2012
|
fl
:
[
1.03
0.04
0.01
0.03
]
|
mu_apt
:
0.122
|
mu_real
:
-
0.163
2013
|
fl
:
[
1.06
-
0.01
-
0.10
0.01
]
|
mu_apt
:
0.267
|
mu_real
:
0.230
2014
|
fl
:
[
0.96
0.02
0.36
-
0.02
]
|
mu_apt
:
0.063
|
mu_real
:
0.335
2015
|
fl
:
[
0.93
-
0.01
-
0.09
0.02
]
|
mu_apt
:
0.001
|
mu_real
:
-
0.052
2016
|
fl
:
[
1.02
0.00
-
0.05
0.06
]
|
mu_apt
:
0.099
|
mu_real
:
0.051
2017
|
fl
:
[
1.41
0.02
-
0.18
0.03
]
|
mu_apt
:
0.226
|
mu_real
:
0.242
2018
|
fl
:
[
1.12
-
0.01
-
0.11
0.17
]
|
mu_apt
:
-
0.076
|
mu_real
:
0.017
2019
|
fl
:
[
1.50
0.01
-
0.34
0.30
]
|
mu_apt
:
0.431
|
mu_real
:
0.243
AMZN
.
O
=======================================================================
2011
|
fl
:
[
1.02
-
0.03
-
0.18
-
0.14
]
|
mu_apt
:
-
0.016
|
mu_real
:
-
0.039
2012
|
fl
:
[
0.98
-
0.01
-
0.17
-
0.09
]
|
mu_apt
:
0.117
|
mu_real
:
0.374
2013
|
fl
:
[
1.07
-
0.00
0.09
0.00
]
|
mu_apt
:
0.282
|
mu_real
:
0.464
2014
|
fl
:
[
1.54
0.03
0.01
-
0.08
]
|
mu_apt
:
0.176
|
mu_real
:
-
0.251
2015
|
fl
:
[
1.26
-
0.02
0.45
-
0.11
]
|
mu_apt
:
-
0.044
|
mu_real
:
0.778
2016
|
fl
:
[
1.06
-
0.00
-
0.15
-
0.04
]
|
mu_apt
:
0.099
|
mu_real
:
0.104
2017
|
fl
:
[
0.94
-
0.02
0.12
-
0.03
]
|
mu_apt
:
0.185
|
mu_real
:
0.446
2018
|
fl
:
[
0.90
-
0.04
-
0.25
0.28
]
|
mu_apt
:
-
0.085
|
mu_real
:
0.251
2019
|
fl
:
[
1.99
0.05
-
0.37
0.12
]
|
mu_apt
:
0.506
|
mu_real
:
0.207
Figure 4-11 compares the APT-predicted returns for a stock and its realized stock returns over time. Compared to the single-factor CAPM, there seems to be hardly any improvement:
In
[
97
]:
sym
=
'AMZN.O'
In
[
98
]:
res
[
res
[
'symbol'
]
==
sym
]
.
corr
()
Out
[
98
]:
mu_apt
mu_real
mu_apt
1.000000
-
0.098281
mu_real
-
0.098281
1.000000
In
[
99
]:
res
[
res
[
'symbol'
]
==
sym
]
.
plot
(
kind
=
'bar'
,
figsize
=
(
10
,
6
),
title
=
sym
);
The same picture arises in Figure 4-12, produced by the following snippet, which compares the averages for multiple stocks. Because there is hardly any variation in the average APT predictions, there are large average differences to the realized returns:
In
[
100
]:
grouped
=
res
.
groupby
(
'symbol'
)
.
mean
()
grouped
Out
[
100
]:
mu_apt
mu_real
symbol
AAPL
.
O
0.116116
0.206158
AMZN
.
O
0.135528
0.259395
INTC
.
O
0.124811
0.116180
MSFT
.
O
0.128441
0.192655
In
[
101
]:
grouped
.
plot
(
kind
=
'bar'
,
figsize
=
(
10
,
6
),
title
=
'Average Values'
);
Of course, the selection of the risk factors is of paramount importance in this context. The data-driven investor decides to find out what risk factors are typically considered relevant ones for stocks. After studying the paper by Bender et al. (2013), the investor replaces the original risk factors with a new set. In particular, the investor chooses the set as presented in Table 4-3.
Factor | Description | RIC |
---|---|---|
Market |
MSCI World Gross Return Daily USD (PUS = Price Return) |
|
Size |
MSCI World Equal Weight Price Net Index EOD |
|
Volatility |
MSCI World Minimum Volatility Net Return |
|
Value |
MSCI World Value Weighted Gross (NUS for Net) |
|
Risk |
MSCI World Risk Weighted Gross USD EOD |
|
Growth |
MSCI World Quality Net Return USD |
|
Momentum |
MSCI World Momentum Gross Index USD EOD |
|
The following Python code retrieves a respective data set from a remote location and visualizes the normalized time series data (see Figure 4-13). Already a brief look reveals that the time series seem to be highly positively correlated:
In
[
102
]
:
factors
=
pd
.
read_csv
(
'
http://hilpisch.com/aiif_eikon_eod_factors.csv
'
,
index_col
=
0
,
parse_dates
=
True
)
In
[
103
]
:
(
factors
/
factors
.
iloc
[
0
]
)
.
plot
(
figsize
=
(
10
,
6
)
)
;
This impression is confirmed by the following calculation and the resulting correlation matrix for the factor returns. All correlation factors are about 0.75 or higher:
In
[
104
]
:
start
=
'
2017-01-01
'
end
=
'
2020-01-01
'
In
[
105
]
:
retsd
=
rets
.
loc
[
start
:
end
]
.
copy
(
)
retsd
.
dropna
(
inplace
=
True
)
In
[
106
]
:
retsf
=
np
.
log
(
factors
/
factors
.
shift
(
1
)
)
retsf
=
retsf
.
loc
[
start
:
end
]
retsf
.
dropna
(
inplace
=
True
)
retsf
=
retsf
.
loc
[
retsd
.
index
]
.
dropna
(
)
In
[
107
]
:
retsf
.
corr
(
)
Out
[
107
]
:
market
size
volatility
value
risk
growth
\
market
1.000000
0.935867
0.845010
0.964124
0.947150
0.959038
size
0.935867
1.000000
0.791767
0.965739
0.983238
0.835477
volatility
0.845010
0.791767
1.000000
0.778294
0.865467
0.818280
value
0.964124
0.965739
0.778294
1.000000
0.958359
0.864222
risk
0.947150
0.983238
0.865467
0.958359
1.000000
0.858546
growth
0.959038
0.835477
0.818280
0.864222
0.858546
1.000000
momentum
0.928705
0.796420
0.819585
0.818796
0.825563
0.952956
momentum
market
0.928705
size
0.796420
volatility
0.819585
value
0.818796
risk
0.825563
growth
0.952956
momentum
1.000000
Defines start and end dates for data selection
Selects the relevant returns data sub-set
Calculates and processes the log returns for the factors
Shows the correlation matrix for the factors
The following Python code derives factor loadings for the original stocks but with the new factors. They are derived from the first half of the data set and applied to predict the stock return for the second half given the performance of the single factors. The realized return is also calculated. Both time series are compared in Figure 4-14. As to be expected given the high correlation of the factors, the explanatory power of the APT approach is not much higher compared to the CAPM:
In
[
108
]:
res
=
pd
.
DataFrame
()
In
[
109
]:
np
.
set_printoptions
(
formatter
=
{
'float'
:
lambda
x
:
f
'{x:5.2f}'
})
In
[
110
]:
split
=
int
(
len
(
retsf
)
*
0.5
)
for
sym
in
rets
.
columns
[:
4
]:
(
'
\n
'
+
sym
)
(
74
*
'='
)
retsf_
,
retsd_
=
retsf
.
iloc
[:
split
],
retsd
.
iloc
[:
split
]
reg
=
np
.
linalg
.
lstsq
(
retsf_
,
retsd_
[
sym
],
rcond
=-
1
)[
0
]
retsf_
,
retsd_
=
retsf
.
iloc
[
split
:],
retsd
.
iloc
[
split
:]
mu_apt
=
np
.
dot
(
retsf_
.
mean
()
*
252
,
reg
)
mu_real
=
retsd_
[
sym
]
.
mean
()
*
252
res
=
res
.
append
(
pd
.
DataFrame
({
'mu_apt'
:
mu_apt
,
'mu_real'
:
mu_real
},
index
=
[
sym
,]),
sort
=
True
)
(
'fl: {} | apt: {:.3f} | real: {:.3f}'
.
format
(
reg
.
round
(
1
),
mu_apt
,
mu_real
))
AAPL
.
O
==========================================================================
fl
:
[
2.30
2.80
-
0.70
-
1.40
-
4.20
2.00
-
0.20
]
|
apt
:
0.115
|
real
:
0.301
MSFT
.
O
==========================================================================
fl
:
[
1.50
0.00
0.10
-
1.30
-
1.40
0.80
1.00
]
|
apt
:
0.181
|
real
:
0.304
INTC
.
O
==========================================================================
fl
:
[
-
3.10
1.60
0.40
1.30
-
2.60
2.50
1.10
]
|
apt
:
0.186
|
real
:
0.118
AMZN
.
O
==========================================================================
fl
:
[
9.10
3.30
-
1.00
-
7.10
-
3.10
-
1.80
1.20
]
|
apt
:
0.019
|
real
:
0.050
In
[
111
]:
res
.
plot
(
kind
=
'bar'
,
figsize
=
(
10
,
6
));
The data-driven investor is not willing to dismiss the APT completely. Therefore, an additional test might shed some more light on the explanatory power of APT. To this end, the factor loadings are used to test whether APT can explain movements of the stock price over time (correctly). And indeed, although APT does not predict the absolute performance correctly (it is off by 10+ percentage points), it predicts the direction of the stock price movement correctly in the majority of cases (see Figure 4-15). The correlation between the predicted and realized returns is also pretty high at around 85%. However, the analysis uses realized factor returns to generate the APT predictions—something, of course, not available in practice a day before the relevant trading day:
In
[
112
]
:
sym
Out
[
112
]
:
'
AMZN.O
'
In
[
113
]
:
rets_sym
=
np
.
dot
(
retsf_
,
reg
)
In
[
114
]
:
rets_sym
=
pd
.
DataFrame
(
rets_sym
,
columns
=
[
sym
+
'
_apt
'
]
,
index
=
retsf_
.
index
)
In
[
115
]
:
rets_sym
[
sym
+
'
_real
'
]
=
retsd_
[
sym
]
In
[
116
]
:
rets_sym
.
mean
(
)
*
252
Out
[
116
]
:
AMZN
.
O_apt
0.019401
AMZN
.
O_real
0.050344
dtype
:
float64
In
[
117
]
:
rets_sym
.
std
(
)
*
252
*
*
0.5
Out
[
117
]
:
AMZN
.
O_apt
0.270995
AMZN
.
O_real
0.307653
dtype
:
float64
In
[
118
]
:
rets_sym
.
corr
(
)
Out
[
118
]
:
AMZN
.
O_apt
AMZN
.
O_real
AMZN
.
O_apt
1.000000
0.832218
AMZN
.
O_real
0.832218
1.000000
In
[
119
]
:
rets_sym
.
cumsum
(
)
.
apply
(
np
.
exp
)
.
plot
(
figsize
=
(
10
,
6
)
)
;
Predicts the daily stock price returns given the realized factor returns
Stores the results in a
DataFrame
object and adds column and index dataAdds the realized stock price returns to the
DataFrame
objectCalculates the annualized returns
Calculates the annualized volatility
Calculates the correlation factor
How accurately does APT predict the direction of the stock price movement given the realized factor returns? The following Python code shows that the accuracy score is a bit better than 75%:
In
[
120
]:
rets_sym
[
'same'
]
=
(
np
.
sign
(
rets_sym
[
sym
+
'_apt'
])
==
np
.
sign
(
rets_sym
[
sym
+
'_real'
]))
In
[
121
]:
rets_sym
[
'same'
]
.
value_counts
()
Out
[
121
]:
True
288
False
89
Name
:
same
,
dtype
:
int64
In
[
122
]:
rets_sym
[
'same'
]
.
value_counts
()[
True
]
/
len
(
rets_sym
)
Out
[
122
]:
0.7639257294429708
Debunking Central Assumptions
The previous section provides a number of numerical, real-world examples showing how popular normative financial theories might fail in practice. This section argues that one of the major reasons is that central assumptions of these popular financial theories are invalid; that is, they simply do not describe the reality of financial markets. The two assumptions analyzed are normally distributed returns and linear relationships.
Normally Distributed Returns
As a matter of fact, only a normal distribution is completely specified through its first (expectation) and second moment (standard deviation).
Sample data sets
For illustration, consider a randomly generated set of standard normally distributed numbers as generated by the following Python code.4 Figure 4-16 shows the typical bell shape of the resulting histogram:
In
[
1
]
:
import
numpy
as
np
import
pandas
as
pd
from
pylab
import
plt
,
mpl
np
.
random
.
seed
(
100
)
plt
.
style
.
use
(
'
seaborn
'
)
mpl
.
rcParams
[
'
savefig.dpi
'
]
=
300
mpl
.
rcParams
[
'
font.family
'
]
=
'
serif
'
In
[
2
]
:
N
=
10000
In
[
3
]
:
snrn
=
np
.
random
.
standard_normal
(
N
)
snrn
-
=
snrn
.
mean
(
)
snrn
/
=
snrn
.
std
(
)
In
[
4
]
:
round
(
snrn
.
mean
(
)
,
4
)
Out
[
4
]
:
-
0.0
In
[
5
]
:
round
(
snrn
.
std
(
)
,
4
)
Out
[
5
]
:
1.0
In
[
6
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
hist
(
snrn
,
bins
=
35
)
;
Draws standard normally distributed random numbers
Corrects the first moment (expectation) to 0.0
Corrects the second moment (standard deviation) to 1.0
Now consider a set of random numbers that share the same first and second moment values but have a completely different distribution than Figure 4-17 illustrates. Although the moments are the same, this distribution only consists of three discrete values:
In
[
7
]
:
numbers
=
np
.
ones
(
N
)
*
1.5
split
=
int
(
0.25
*
N
)
numbers
[
split
:
3
*
split
]
=
-
1
numbers
[
3
*
split
:
4
*
split
]
=
0
In
[
8
]
:
numbers
-
=
numbers
.
mean
(
)
numbers
/
=
numbers
.
std
(
)
In
[
9
]
:
round
(
numbers
.
mean
(
)
,
4
)
Out
[
9
]
:
0.0
In
[
10
]
:
round
(
numbers
.
std
(
)
,
4
)
Out
[
10
]
:
1.0
In
[
11
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
hist
(
numbers
,
bins
=
35
)
;
A set of numbers with three discrete values only
Corrects the first moment (expectation) to 0.0
Corrects the second moment (standard deviation) to 1.0
First and Second Moment
The first and second moment of a probability distribution only describe a normal distribution completely. There are infinitely many other distributions that might share the first two moments with a normal distribution while being completely different.
In preparation for a test of real financial returns, consider the following Python functions that allow one to visualize data as a histogram and to add a probability density function (PDF) of a normal distribution with the first two moments of the data:
In
[
12
]
:
import
math
import
scipy.stats
as
scs
import
statsmodels.api
as
sm
In
[
13
]
:
def
dN
(
x
,
mu
,
sigma
)
:
''' Probability density function of a normal random variable x. '''
z
=
(
x
-
mu
)
/
sigma
=
np
.
exp
(
-
0.5
*
z
*
*
2
)
/
math
.
sqrt
(
2
*
math
.
pi
*
sigma
*
*
2
)
return
In
[
14
]
:
def
return_histogram
(
rets
,
title
=
'
'
)
:
''' Plots a histogram of the returns. '''
plt
.
figure
(
figsize
=
(
10
,
6
)
)
x
=
np
.
linspace
(
min
(
rets
)
,
max
(
rets
)
,
100
)
plt
.
hist
(
np
.
array
(
rets
)
,
bins
=
50
,
density
=
True
,
label
=
'
frequency
'
)
y
=
dN
(
x
,
np
.
mean
(
rets
)
,
np
.
std
(
rets
)
)
plt
.
plot
(
x
,
y
,
linewidth
=
2
,
label
=
'
'
)
plt
.
xlabel
(
'
log returns
'
)
plt
.
ylabel
(
'
frequency/probability
'
)
plt
.
title
(
title
)
plt
.
legend
(
)
Figure 4-18 shows how well the histogram approximates the PDF for the standard normally distributed random numbers:
In
[
15
]:
return_histogram
(
snrn
)
By contrast, Figure 4-19 illustrates that the PDF of the normal distribution has nothing to do with the data shown as a histogram:
In
[
16
]:
return_histogram
(
numbers
)
Another way of comparing a normal distribution to data is the Quantile-Quantile (Q-Q) plot. As Figure 4-20 shows, for normally distributed numbers, the numbers themselves lie (mostly) on a straight line in the Q-Q plane:
In
[
17
]:
def
return_qqplot
(
rets
,
title
=
''
):
''' Generates a Q-Q plot of the returns.
'''
fig
=
sm
.
qqplot
(
rets
,
line
=
's'
,
alpha
=
0.5
)
fig
.
set_size_inches
(
10
,
6
)
plt
.
title
(
title
)
plt
.
xlabel
(
'theoretical quantiles'
)
plt
.
ylabel
(
'sample quantiles'
)
In
[
18
]:
return_qqplot
(
snrn
)
Again, the Q-Q plot as shown in Figure 4-21 for the discrete numbers looks completely different to the one in Figure 4-20:
In
[
19
]:
return_qqplot
(
numbers
)
Finally, one can also use statistical tests to check whether a set of numbers is normally distributed or not.
The following Python function implements three tests:
-
Test for normal skew.
-
Test for normal kurtosis.
-
Test for normal skew and kurtosis combined.
A p-value below 0.05 is generally considered to be a counter-indicator for normality; that is, the hypothesis that the numbers are normally distributed is rejected. In that sense, as in the preceding figures, the p-values for the two data sets speak for themselves:
In
[
20
]:
def
print_statistics
(
rets
):
(
'RETURN SAMPLE STATISTICS'
)
(
'---------------------------------------------'
)
(
'Skew of Sample Log Returns {:9.6f}'
.
format
(
scs
.
skew
(
rets
)))
(
'Skew Normal Test p-value {:9.6f}'
.
format
(
scs
.
skewtest
(
rets
)[
1
]))
(
'---------------------------------------------'
)
(
'Kurt of Sample Log Returns {:9.6f}'
.
format
(
scs
.
kurtosis
(
rets
)))
(
'Kurt Normal Test p-value {:9.6f}'
.
format
(
scs
.
kurtosistest
(
rets
)[
1
]))
(
'---------------------------------------------'
)
(
'Normal Test p-value {:9.6f}'
.
format
(
scs
.
normaltest
(
rets
)[
1
]))
(
'---------------------------------------------'
)
In
[
21
]:
print_statistics
(
snrn
)
RETURN
SAMPLE
STATISTICS
---------------------------------------------
Skew
of
Sample
Log
Returns
0.016793
Skew
Normal
Test
p
-
value
0.492685
---------------------------------------------
Kurt
of
Sample
Log
Returns
-
0.024540
Kurt
Normal
Test
p
-
value
0.637637
---------------------------------------------
Normal
Test
p
-
value
0.707334
---------------------------------------------
In
[
22
]:
print_statistics
(
numbers
)
RETURN
SAMPLE
STATISTICS
---------------------------------------------
Skew
of
Sample
Log
Returns
0.689254
Skew
Normal
Test
p
-
value
0.000000
---------------------------------------------
Kurt
of
Sample
Log
Returns
-
1.141902
Kurt
Normal
Test
p
-
value
0.000000
---------------------------------------------
Normal
Test
p
-
value
0.000000
---------------------------------------------
Real financial returns
The following Python code retrieves EOD data from a remote source, as done earlier in the chapter, and calculates the log returns for all financial time series contained in the data set. Figure 4-22 shows that the log returns of the S&P 500 stock index represented as a histogram show a much higher peak and fatter tails when compared to the normal PDF with the sample expectation and standard deviation. These two insights are stylized facts because they can be consistently observed for different financial instruments:
In
[
23
]:
raw
=
pd
.
read_csv
(
'http://hilpisch.com/aiif_eikon_eod_data.csv'
,
index_col
=
0
,
parse_dates
=
True
)
.
dropna
()
In
[
24
]:
rets
=
np
.
log
(
raw
/
raw
.
shift
(
1
))
.
dropna
()
In
[
25
]:
symbol
=
'.SPX'
In
[
26
]:
return_histogram
(
rets
[
symbol
]
.
values
,
symbol
)
Similar insights can be gained when considering the Q-Q plot for the S&P 500 log returns in Figure 4-23. In particular, the Q-Q plot visualizes the fat tails pretty well (points below the straight line to the left and above the straight line to the right):
In
[
27
]:
return_qqplot
(
rets
[
symbol
]
.
values
,
symbol
)
The Python code that follows conducts the statistical tests regarding the normality of the real financial returns for a selection of the financial time series from the data set. Real financial returns regularly fail such tests. Therefore, it is safe to conclude that the normality assumption about financial returns hardly, if at all, describes financial reality:
In
[
28
]:
symbols
=
[
'.SPX'
,
'AMZN.O'
,
'EUR='
,
'GLD'
]
In
[
29
]:
for
sym
in
symbols
:
(
'
\n
{}'
.
format
(
sym
))
(
45
*
'='
)
print_statistics
(
rets
[
sym
]
.
values
)
.
SPX
=============================================
RETURN
SAMPLE
STATISTICS
---------------------------------------------
Skew
of
Sample
Log
Returns
-
0.497160
Skew
Normal
Test
p
-
value
0.000000
---------------------------------------------
Kurt
of
Sample
Log
Returns
4.598167
Kurt
Normal
Test
p
-
value
0.000000
---------------------------------------------
Normal
Test
p
-
value
0.000000
---------------------------------------------
AMZN
.
O
=============================================
RETURN
SAMPLE
STATISTICS
---------------------------------------------
Skew
of
Sample
Log
Returns
0.135268
Skew
Normal
Test
p
-
value
0.005689
---------------------------------------------
Kurt
of
Sample
Log
Returns
7.344837
Kurt
Normal
Test
p
-
value
0.000000
---------------------------------------------
Normal
Test
p
-
value
0.000000
---------------------------------------------
EUR
=
=============================================
RETURN
SAMPLE
STATISTICS
---------------------------------------------
Skew
of
Sample
Log
Returns
-
0.053959
Skew
Normal
Test
p
-
value
0.268203
---------------------------------------------
Kurt
of
Sample
Log
Returns
1.780899
Kurt
Normal
Test
p
-
value
0.000000
---------------------------------------------
Normal
Test
p
-
value
0.000000
---------------------------------------------
GLD
=============================================
RETURN
SAMPLE
STATISTICS
---------------------------------------------
Skew
of
Sample
Log
Returns
-
0.581025
Skew
Normal
Test
p
-
value
0.000000
---------------------------------------------
Kurt
of
Sample
Log
Returns
5.899701
Kurt
Normal
Test
p
-
value
0.000000
---------------------------------------------
Normal
Test
p
-
value
0.000000
---------------------------------------------
Normality Assumption
Although the normality assumption is a good approximation for many real-world phenomena, such as in physics, it is not appropriate and can even be dangerous when it comes to financial returns. Almost no financial return sample data set passes statistical normality tests. Beyond the fact that it has proven useful in other domains, a major reason why this assumption is found in so many financial models is that it leads to elegant and relatively simple mathematical models, calculations, and proofs.
Linear Relationships
Similar to the “omnipresence” of the normality assumption in financial models and theories, linear relationships between variables seem to be another widespread benchmark. This sub-section considers an important one, namely the assumed linear relationship in the CAPM between the beta of a stock and its expected (realized) return. Generally speaking, the higher the beta is, the higher the expected return given a positive market performance will be—in a fixed proportional way as given by the beta value itself.
Recall the calculation of the betas, the CAPM expected returns, and the realized returns for a selection of technology stocks from the previous section, which is repeated in the following Python code for convenience. This time, the beta values are added to the results’ DataFrame
object as well.
In
[
30
]:
r
=
0.005
In
[
31
]:
market
=
'.SPX'
In
[
32
]:
res
=
pd
.
DataFrame
()
In
[
33
]:
for
sym
in
rets
.
columns
[:
4
]:
for
year
in
range
(
2010
,
2019
):
rets_
=
rets
.
loc
[
f
'{year}-01-01'
:
f
'{year}-12-31'
]
muM
=
rets_
[
market
]
.
mean
()
*
252
cov
=
rets_
.
cov
()
.
loc
[
sym
,
market
]
var
=
rets_
[
market
]
.
var
()
beta
=
cov
/
var
rets_
=
rets
.
loc
[
f
'{year + 1}-01-01'
:
f
'{year + 1}-12-31'
]
muM
=
rets_
[
market
]
.
mean
()
*
252
mu_capm
=
r
+
beta
*
(
muM
-
r
)
mu_real
=
rets_
[
sym
]
.
mean
()
*
252
res
=
res
.
append
(
pd
.
DataFrame
({
'symbol'
:
sym
,
'beta'
:
beta
,
'mu_capm'
:
mu_capm
,
'mu_real'
:
mu_real
},
index
=
[
year
+
1
]),
sort
=
True
)
The following analysis calculates the score for a linear regression for which the beta is the independent variable and the expected CAPM return, given the market portfolio performance, is the dependent variable. refers to the coefficient of determination and measures how well a model performs compared to a baseline predictor in the form of a simple mean value. The linear regression can only explain around 10% of the variability in the expected CAPM return, a pretty low value, which is also confirmed through Figure 4-24:
In
[
34
]:
from
sklearn.metrics
import
r2_score
In
[
35
]:
reg
=
np
.
polyfit
(
res
[
'beta'
],
res
[
'mu_capm'
],
deg
=
1
)
res
[
'mu_capm_ols'
]
=
np
.
polyval
(
reg
,
res
[
'beta'
])
In
[
36
]:
r2_score
(
res
[
'mu_capm'
],
res
[
'mu_capm_ols'
])
Out
[
36
]:
0.09272355783573516
In
[
37
]:
res
.
plot
(
kind
=
'scatter'
,
x
=
'beta'
,
y
=
'mu_capm'
,
figsize
=
(
10
,
6
))
x
=
np
.
linspace
(
res
[
'beta'
]
.
min
(),
res
[
'beta'
]
.
max
())
plt
.
plot
(
x
,
np
.
polyval
(
reg
,
x
),
'g--'
,
label
=
'regression'
)
plt
.
legend
();
For the realized return, the explanatory power of the linear regression is even lower, with about 4.5% (see Figure 4-25). The linear regressions recover the positive relationship between beta and stock returns—“the higher the beta, the higher the return given the (positive) market portfolio performance”—as indicated by the positive slope of the regression lines. However, they only explain a small part of the observed overall variability in the stock returns:
In
[
38
]:
reg
=
np
.
polyfit
(
res
[
'beta'
],
res
[
'mu_real'
],
deg
=
1
)
res
[
'mu_real_ols'
]
=
np
.
polyval
(
reg
,
res
[
'beta'
])
In
[
39
]:
r2_score
(
res
[
'mu_real'
],
res
[
'mu_real_ols'
])
Out
[
39
]:
0.04466919444752959
In
[
40
]:
res
.
plot
(
kind
=
'scatter'
,
x
=
'beta'
,
y
=
'mu_real'
,
figsize
=
(
10
,
6
))
x
=
np
.
linspace
(
res
[
'beta'
]
.
min
(),
res
[
'beta'
]
.
max
())
plt
.
plot
(
x
,
np
.
polyval
(
reg
,
x
),
'g--'
,
label
=
'regression'
)
plt
.
legend
();
Linear Relationships
As with the normality assumptions, linear relationships can often be observed in the physical world. However, in finance there are hardly any cases in which variables depend on each other in a clearly linear way. From a modeling point of view, linear relationships lead, as does the normality assumption, to elegant and relatively simple mathematical models, calculations, and proofs. In addition, the standard tool in financial econometrics, OLS regression, is well suited to dealing with linear relationships in data. These are major reasons why normality and linearity are often deliberately chosen as convenient building blocks of financial models and theories.
Conclusions
Science has been driven for centuries by the rigorous generation and analysis of data. However, finance used to be characterized by normative theories based on simplified mathematical models of the financial markets, relying on assumptions such as normality of returns and linear relationships. The almost universal and comprehensive availability of (financial) data has led to a shift in focus from a theory-first approach to data-driven finance. Several examples based on real financial data illustrate that many popular financial models and theories cannot survive a confrontation with financial market realities. Although elegant, they might be too simplistic to capture the complexities, changing nature, and nonlinearities of financial markets.
References
Books and papers cited in this chapter:
Python Code
The following Python file contains a number of helper functions to simplify certain tasks in NLP:
#
# NLP Helper Functions
#
# Artificial Intelligence in Finance
# (c) Dr Yves J Hilpisch
# The Python Quants GmbH
#
import
re
import
nltk
import
string
import
pandas
as
pd
from
pylab
import
plt
from
wordcloud
import
WordCloud
from
nltk.corpus
import
stopwords
from
nltk.corpus
import
wordnet
as
wn
from
lxml.html.clean
import
Cleaner
from
sklearn.feature_extraction.text
import
TfidfVectorizer
plt
.
style
.
use
(
'seaborn'
)
cleaner
=
Cleaner
(
style
=
True
,
links
=
True
,
allow_tags
=
[
''
],
remove_unknown_tags
=
False
)
stop_words
=
stopwords
.
words
(
'english'
)
stop_words
.
extend
([
'new'
,
'old'
,
'pro'
,
'open'
,
'menu'
,
'close'
])
def
remove_non_ascii
(
s
):
''' Removes all non-ascii characters.
'''
return
''
.
join
(
i
for
i
in
s
if
ord
(
i
)
<
128
)
def
clean_up_html
(
t
):
t
=
cleaner
.
clean_html
(
t
)
t
=
re
.
sub
(
'[
\n\t\r
]'
,
' '
,
t
)
t
=
re
.
sub
(
' +'
,
' '
,
t
)
t
=
re
.
sub
(
'<.*?>'
,
''
,
t
)
t
=
remove_non_ascii
(
t
)
return
t
def
clean_up_text
(
t
,
numbers
=
False
,
punctuation
=
False
):
''' Cleans up a text, e.g. HTML document,
from HTML tags and also cleans up the
text body.
'''
try
:
t
=
clean_up_html
(
t
)
except
:
pass
t
=
t
.
lower
()
t
=
re
.
sub
(
r
"what's"
,
"what is "
,
t
)
t
=
t
.
replace
(
'(ap)'
,
''
)
t
=
re
.
sub
(
r
"\'ve"
,
" have "
,
t
)
t
=
re
.
sub
(
r
"can't"
,
"cannot "
,
t
)
t
=
re
.
sub
(
r
"n't"
,
" not "
,
t
)
t
=
re
.
sub
(
r
"i'm"
,
"i am "
,
t
)
t
=
re
.
sub
(
r
"\'s"
,
""
,
t
)
t
=
re
.
sub
(
r
"\'re"
,
" are "
,
t
)
t
=
re
.
sub
(
r
"\'d"
,
" would "
,
t
)
t
=
re
.
sub
(
r
"\'ll"
,
" will "
,
t
)
t
=
re
.
sub
(
r
'\s+'
,
' '
,
t
)
t
=
re
.
sub
(
r
"
\\
"
,
""
,
t
)
t
=
re
.
sub
(
r
"\'"
,
""
,
t
)
t
=
re
.
sub
(
r
"
\"
"
,
""
,
t
)
if
numbers
:
t
=
re
.
sub
(
'[^a-zA-Z ?!]+'
,
''
,
t
)
if
punctuation
:
t
=
re
.
sub
(
r
'\W+'
,
' '
,
t
)
t
=
remove_non_ascii
(
t
)
t
=
t
.
strip
()
return
t
def
nltk_lemma
(
word
):
''' If one exists, returns the lemma of a word.
I.e. the base or dictionary version of it.
'''
lemma
=
wn
.
morphy
(
word
)
if
lemma
is
None
:
return
word
else
:
return
lemma
def
tokenize
(
text
,
min_char
=
3
,
lemma
=
True
,
stop
=
True
,
numbers
=
False
):
''' Tokenizes a text and implements some
transformations.
'''
tokens
=
nltk
.
word_tokenize
(
text
)
tokens
=
[
t
for
t
in
tokens
if
len
(
t
)
>=
min_char
]
if
numbers
:
tokens
=
[
t
for
t
in
tokens
if
t
[
0
]
.
lower
()
in
string
.
ascii_lowercase
]
if
stop
:
tokens
=
[
t
for
t
in
tokens
if
t
not
in
stop_words
]
if
lemma
:
tokens
=
[
nltk_lemma
(
t
)
for
t
in
tokens
]
return
tokens
def
generate_word_cloud
(
text
,
no
,
name
=
None
,
show
=
True
):
''' Generates a word cloud bitmap given a
text document (string).
It uses the Term Frequency (TF) and
Inverse Document Frequency (IDF)
vectorization approach to derive the
importance of a word -- represented
by the size of the word in the word cloud.
Parameters
==========
text: str
text as the basis
no: int
number of words to be included
name: str
path to save the image
show: bool
whether to show the generated image or not
'''
tokens
=
tokenize
(
text
)
vec
=
TfidfVectorizer
(
min_df
=
2
,
analyzer
=
'word'
,
ngram_range
=
(
1
,
2
),
stop_words
=
'english'
)
vec
.
fit_transform
(
tokens
)
wc
=
pd
.
DataFrame
({
'words'
:
vec
.
get_feature_names
(),
'tfidf'
:
vec
.
idf_
})
words
=
' '
.
join
(
wc
.
sort_values
(
'tfidf'
,
ascending
=
True
)[
'words'
]
.
head
(
no
))
wordcloud
=
WordCloud
(
max_font_size
=
110
,
background_color
=
'white'
,
width
=
1024
,
height
=
768
,
margin
=
10
,
max_words
=
150
)
.
generate
(
words
)
if
show
:
plt
.
figure
(
figsize
=
(
10
,
10
))
plt
.
imshow
(
wordcloud
,
interpolation
=
'bilinear'
)
plt
.
axis
(
'off'
)
plt
.
show
()
if
name
is
not
None
:
wordcloud
.
to_file
(
name
)
def
generate_key_words
(
text
,
no
):
try
:
tokens
=
tokenize
(
text
)
vec
=
TfidfVectorizer
(
min_df
=
2
,
analyzer
=
'word'
,
ngram_range
=
(
1
,
2
),
stop_words
=
'english'
)
vec
.
fit_transform
(
tokens
)
wc
=
pd
.
DataFrame
({
'words'
:
vec
.
get_feature_names
(),
'tfidf'
:
vec
.
idf_
})
words
=
wc
.
sort_values
(
'tfidf'
,
ascending
=
False
)[
'words'
]
.
values
words
=
[
a
for
a
in
words
if
not
a
.
isnumeric
()][:
no
]
except
:
words
=
list
()
return
words
Get Artificial Intelligence in Finance now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.