Chapter 4. DataDriven Finance
If artificial intelligence is the new electricity, big data is the oil that powers the generators.
KaiFu Lee (2018)
Nowadays, analysts sift through nontraditional information such as satellite imagery and credit card data, or use artificial intelligence techniques such as machine learning and natural language processing to glean fresh insights from traditional sources such as economic data and earningscall transcripts.
Robin Wigglesworth (2019)
This chapter discusses central aspects of datadriven finance. For the purposes of this book, datadriven finance is understood to be a financial context (theory, model, application, and so on) that is primarily driven by and based on insights gained from data.
“Scientific Method” discusses the scientific method, which is about generally accepted principles that should guide scientific effort. “Financial Econometrics and Regression” is about financial econometrics and related topics. “Data Availability” sheds light on which types of (financial) data are available today and in what quality and quantity via programmatic APIs. “Normative Theories Revisited” revisits the normative theories of Chapter 3 and analyzes them based on real financial time series data. Also based on real financial data, “Debunking Central Assumptions” debunks two of the most commonly found assumptions in financial models and theories: normality of returns and linear relationships.
Scientific Method
The scientific method refers to a set of generally accepted principles that should guide any scientific project. Wikipedia defines the scientific method as follows:
The scientific method is an empirical method of acquiring knowledge that has characterized the development of science since at least the 17th century. It involves careful observation, applying rigorous skepticism about what is observed, given that cognitive assumptions can distort how one interprets the observation. It involves formulating hypotheses, via induction, based on such observations; experimental and measurementbased testing of deductions drawn from the hypotheses; and refinement (or elimination) of the hypotheses based on the experimental findings. These are principles of the scientific method, as distinguished from a definitive series of steps applicable to all scientific enterprises.
Given this definition, normative finance, as discussed in Chapter 3, is in stark contrast to the scientific method. Normative financial theories mostly rely on assumptions and axioms in combination with deduction as the major analytical method to arrive at their central results.

Expected utility theory (EUT) assumes that agents have the same utility function no matter what state of the world unfolds and that they maximize expected utility under conditions of uncertainty.

Meanvariance portfolio (MVP) theory describes how investors should invest under conditions of uncertainty assuming that only the expected return and the expected volatility of a portfolio over one period count.

The capital asset pricing model (CAPM) assumes that only the nondiversifiable market risk explains the expected return and the expected volatility of a stock over one period.

Arbitrage pricing theory (APT) assumes that a number of identifiable risk factors explains the expected return and the expected volatility of a stock over time; admittedly, compared to the other theories, the formulation of APT is rather broad and allows for wideranging interpretations.
What characterizes the aforementioned normative financial theories is that they were originally derived under certain assumptions and axioms using “pen and paper” only, without any recourse to realworld data or observations. From a historical point of view, many of these theories were rigorously tested against realworld data only long after their publication dates. This can be explained primarily with better data availability and increased computational capabilities over time. After all, data and computation are the main ingredients for the application of statistical methods in practice. The discipline at the intersection of mathematics, statistics, and finance that applies such methods to financial market data is typically called financial econometrics, the topic of the next section.
Financial Econometrics and Regression
Adapting the definition provided by Investopedia for econometrics, one can define financial econometrics as follows:
[Financial] econometrics is the quantitative application of statistical and mathematical models using [financial] data to develop financial theories or test existing hypotheses in finance and to forecast future trends from historical data. It subjects realworld [financial] data to statistical trials and then compares and contrasts the results against the [financial] theory or theories being tested.
Alexander (2008b) provides a thorough and broad introduction to the field of financial econometrics. The second chapter of the book covers single and multifactor models, such as the CAPM and APT. Alexander (2008b) is part of a series of four books called Market Risk Analysis. The first in the series, Alexander (2008a), covers theoretical background concepts, topics, and methods, such as MVP theory and the CAPM themselves. The book by Campbell (2018) is another comprehensive resource for financial theory and related econometric research.
One of the major tools in financial econometrics is regression, in both its univariate and multivariate forms. Regression is also a central tool in statistical learning in general. What is the difference between traditional mathematics and statistical learning? Although there is no general answer to this question (after all, statistics is a subfield of mathematics), a simple example should emphasize a major difference relevant to the context of this book.
First is the standard mathematical way. Assume a mathematical function is given as follows:
Given multiple values of ${x}_{i},i=1,2,...,n$, one can derive function values for $f$ by applying the above definition:
The following Python code illustrates this based on a simple numerical example:
In
[
1
]:
import
numpy
as
np
In
[
2
]:
def
f
(
x
):
return
2
+
1
/
2
*
x
In
[
3
]:
x
=
np
.
arange
(

4
,
5
)
x
Out
[
3
]:
array
([

4
,

3
,

2
,

1
,
0
,
1
,
2
,
3
,
4
])
In
[
4
]:
y
=
f
(
x
)
y
Out
[
4
]:
array
([
0.
,
0.5
,
1.
,
1.5
,
2.
,
2.5
,
3.
,
3.5
,
4.
])
Second is the approach taken in statistical learning. Whereas in the preceding example, the function comes first and then the data is derived, this sequence is reversed in statistical learning. Here, the data is generally given and a functional relationship is to be found. In this context, $x$ is often called the independent variable and $y$ the dependent variable. Consequently, consider the following data:
The problem is to find, for example, parameters $\alpha ,\beta $ such that:
Another way of writing this is by including residual values ${\u03f5}_{i},i=1,2,...,n$:
In the context of ordinary leastsquares (OLS) regression, $\alpha ,\beta $ are chosen to minimize the meansquared error between the approximated values ${\widehat{y}}_{i}$ and the real values ${y}_{i}$. The minimization problem, then, is as follows:
In the case of simple OLS regression, as described previously, the optimal solutions are known in closed form and are as follows:
Here, $\text{Cov}\left(\right)$ stands for the covariance, $\text{Var}\left(\right)$ for the variance, and $\overline{x},\overline{y}$ for the mean values of $x,y$.
Returning to the preceding numerical example, these insights can be used to derive optimal parameters $\alpha ,\beta $ and, in this particular case, to recover the original definition of $f\left(x\right)$:
In
[
5
]
:
x
Out
[
5
]
:
array
(
[

4
,

3
,

2
,

1
,
0
,
1
,
2
,
3
,
4
]
)
In
[
6
]
:
y
Out
[
6
]
:
array
(
[
0.
,
0.5
,
1.
,
1.5
,
2.
,
2.5
,
3.
,
3.5
,
4.
]
)
In
[
7
]
:
beta
=
np
.
cov
(
x
,
y
,
ddof
=
0
)
[
0
,
1
]
/
x
.
var
(
)
beta
Out
[
7
]
:
0.49999999999999994
In
[
8
]
:
alpha
=
y
.
mean
(
)

beta
*
x
.
mean
(
)
alpha
Out
[
8
]
:
2.0
In
[
9
]
:
y_
=
alpha
+
beta
*
x
In
[
10
]
:
np
.
allclose
(
y_
,
y
)
Out
[
10
]
:
True
$\beta $ as derived from the covariance matrix and the variance
$\alpha $ as derived from $\beta $ and the mean values
Estimated values ${\widehat{y}}_{i},i=1,2,...,n$, given $\alpha ,\beta $
Checks whether ${\widehat{y}}_{i},{y}_{i}$ values are numerically equal
The preceding example and those in Chapter 1 illustrate that the application of OLS regression to a given data set is in general straightforward. There are more reasons why OLS regression has become one of the central tools in econometrics and financial econometrics. Among them are the following:
 Centuries old

The leastsquares approach, particularly in combination with regression, has been used for more than 200 years.^{1}
 Simplicity

The mathematics behind OLS regression is easy to understand and easy to implement in programming.
 Scalability

There is basically no limit regarding the data size to which OLS regression can be applied.
 Flexibility

OLS regression can be applied to a wide range of problems and data sets.
 Speed

OLS regression is fast to evaluate, even on larger data sets.
 Availability

Efficient implementations in Python and many other programming languages are readily available.
However, as easy and straightforward as the application of OLS regression might be in general, the method rests on a number of assumptions—most of them related to the residuals—that are not always satisfied in practice.
 Linearity

The model is linear in its parameters, with regard to both the coefficients and the residuals.
 Independence

Independent variables are not perfectly (to a high degree) correlated with each other (no multicollinearity).
 Zero mean

The mean value of the residuals is (close to) zero.
 No correlation

Residuals are not (strongly) correlated with the independent variables.
 Homoscedasticity

The standard deviation of the residuals is (almost) constant.
 No autocorrelation

The residuals are not (strongly) correlated with each other.
In practice, it is in general quite simple to test for the validity of the assumptions given a specific data set.
Data Availability
Financial econometrics is driven by statistical methods, such as regression, and the availability of financial data. From the 1950s to the 1990s, and even into the early 2000s, theoretical and empirical financial research was mainly driven by relatively small data sets compared to today’s standards, and was mostly comprised of endofday (EOD) data. Data availability is something that has changed dramatically over the last decade or so, with more and more types of financial and other data available in ever increasing granularity, quantity, and velocity.
Programmatic APIs
With regard to datadriven finance, what is important is not only what data is available but also how it can be accessed and processed. For quite a while now, finance professionals have relied on data terminals from companies such as Refinitiv (see Eikon Terminal) or Bloomberg (see Bloomberg Terminal), to mention just two of the leading providers. Newspapers, magazines, financial reports, and the like have long been replaced by such terminals as the primary source for financial information. However, the sheer volume and variety of data provided by such terminals cannot be consumed systematically by a single user or even large groups of finance professionals. Therefore, the major breakthrough in datadriven finance is to be seen in the programmatic availability of data via application programming interfaces (APIs) that allow the usage of computer code to select, retrieve, and process arbitrary data sets.
The remainder of this section is devoted to the illustration of such APIs by which even academics and retail investors can retrieve a wealth of different data sets. Before such examples are provided, Table 41 offers an overview of categories of data that are in general relevant in a financial context, as well as typical examples. In the table, structured data refers to numerical data types that often come in tabular structures, while unstructured data refers to data in the form of standard text that often has no structure beyond headers or paragraphs, for example. Alternative data refers to data types that are typically not considered financial data.
Time  Structured data  Unstructured data  Alternative data 

Historical 
Prices, fundamentals 
News, texts 
Web, social media, satellites 
Streaming 
Prices, volumes 
News, filings 
Web, social media, satellites, Internet of Things 
Structured Historical Data
First, structured historical data types will be retrieved programmatically. To this end, the following Python code uses the Eikon Data API.^{2}
To access data via the Eikon Data API, a local application, such as Refinitiv Workspace, must be running and the API access must be configured on the Python level:
In
[
11
]:
import
eikon
as
ek
import
configparser
In
[
12
]:
c
=
configparser
.
ConfigParser
()
c
.
read
(
'../aiif.cfg'
)
ek
.
set_app_key
(
c
[
'eikon'
][
'app_id'
])
2020

08

04
10
:
30
:
18
,
05
9
P
[
14938
]
[
MainThread
4521459136
]
Error
on
handshake
port
9000
:
ReadTimeout
(
ReadTimeout
())
If these requirements are met, historical structured data can be retrieved via a single function call. For example, the following Python code retrieves EOD data for a set of symbols and a specified time interval:
In
[
14
]
:
symbols
=
[
'
AAPL.O
'
,
'
MSFT.O
'
,
'
NFLX.O
'
,
'
AMZN.O
'
]
In
[
15
]
:
data
=
ek
.
get_timeseries
(
symbols
,
fields
=
'
CLOSE
'
,
start_date
=
'
20190701
'
,
end_date
=
'
20200701
'
)
In
[
16
]
:
data
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
254
entries
,
2019

07

01
to
2020

07

01
Data
columns
(
total
4
columns
)
:
# Column NonNull Count Dtype




























0
AAPL
.
O
254
non

null
float64
1
MSFT
.
O
254
non

null
float64
2
NFLX
.
O
254
non

null
float64
3
AMZN
.
O
254
non

null
float64
dtypes
:
float64
(
4
)
memory
usage
:
9.9
KB
In
[
17
]
:
data
.
tail
(
)
Out
[
17
]
:
CLOSE
AAPL
.
O
MSFT
.
O
NFLX
.
O
AMZN
.
O
Date
2020

06

25
364.84
200.34
465.91
2754.58
2020

06

26
353.63
196.33
443.40
2692.87
2020

06

29
361.78
198.44
447.24
2680.38
2020

06

30
364.80
203.51
455.04
2758.82
2020

07

01
364.11
204.70
485.64
2878.70
Defines a list of
RICs
(symbols) to retrieve data for^{3}Retrieves EOD
Close
prices for the list ofRICs
Shows the meta information for the returned
DataFrame
objectShows the final rows of the
DataFrame
object
Similarly, oneminute bars with OHLC
fields can be retrieved with appropriate adjustments of the parameters:
In
[
18
]
:
data
=
ek
.
get_timeseries
(
'
AMZN.O
'
,
fields
=
'
*
'
,
start_date
=
'
20200803
'
,
end_date
=
'
20200804
'
,
interval
=
'
minute
'
)
In
[
19
]
:
data
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
911
entries
,
2020

08

03
08
:
01
:
00
to
2020

08

04
00
:
00
:
00
Data
columns
(
total
6
columns
)
:
# Column NonNull Count Dtype




























0
HIGH
911
non

null
float64
1
LOW
911
non

null
float64
2
OPEN
911
non

null
float64
3
CLOSE
911
non

null
float64
4
COUNT
911
non

null
float64
5
VOLUME
911
non

null
float64
dtypes
:
float64
(
6
)
memory
usage
:
49.8
KB
In
[
20
]
:
data
.
head
(
)
Out
[
20
]
:
AMZN
.
O
HIGH
LOW
OPEN
CLOSE
COUNT
VOLUME
Date
2020

08

03
08
:
01
:
00
3190.00
3176.03
3176.03
3178.17
18.0
383.0
2020

08

03
08
:
02
:
00
3183.02
3176.03
3180.00
3177.01
15.0
513.0
2020

08

03
08
:
03
:
00
3179.91
3177.05
3179.91
3177.05
5.0
14.0
2020

08

03
08
:
04
:
00
3184.00
3179.91
3179.91
3184.00
8.0
102.0
2020

08

03
08
:
05
:
00
3184.91
3182.91
3183.30
3184.00
12.0
403.0
One can retrieve more than structured financial time series data from the Eikon Data API. Fundamental data can also be retrieved for a number of RICs
and a number of different data fields at the same time, as the following Python code illustrates:
In
[
21
]
:
data_grid
,
err
=
ek
.
get_data
(
[
'
AAPL.O
'
,
'
IBM
'
,
'
GOOG.O
'
,
'
AMZN.O
'
]
,
[
'
TR.TotalReturnYTD
'
,
'
TR.WACCBeta
'
,
'
YRHIGH
'
,
'
YRLOW
'
,
'
TR.Ebitda
'
,
'
TR.GrossProfit
'
]
)
In
[
22
]
:
data_grid
Out
[
22
]
:
Instrument
YTD
Total
Return
Beta
YRHIGH
YRLOW
EBITDA
\
0
AAPL
.
O
49.141271
1.221249
425.66
192.5800
7.647700e+10
1
IBM

5.019570
1.208156
158.75
90.5600
1.898600e+10
2
GOOG
.
O
10.278829
1.067084
1586.99
1013.5361
4.757900e+10
3
AMZN
.
O
68.406897
1.338106
3344.29
1626.0318
3.025600e+10
Gross
Profit
0
98392000000
1
36488000000
2
89961000000
3
114986000000
Programmatic Data Availability
Basically all structured financial data is available nowadays in programmatic fashion. Financial time series data, in this context, is the paramount example. However, other structured data types such as fundamental data are available in the same way, simplifying the work of quantitative analysts, traders, portfolio managers, and the like significantly.
Structured Streaming Data
Many applications in finance require realtime structured data, such as in algorithmic trading or market risk management. The following Python code makes use of the API of the Oanda Trading Platform and streams in real time a number of time stamps, bid quotes, and ask quotes for the Bitcoin price in USD:
In
[
23
]
:
import
tpqoa
In
[
24
]
:
oa
=
tpqoa
.
tpqoa
(
'
../aiif.cfg
'
)
In
[
25
]
:
oa
.
stream_data
(
'
BTC_USD
'
,
stop
=
5
)
2020

08

04
T08
:
30
:
38.621075583
Z
11298.8
11334.8
2020

08

04
T08
:
30
:
50.485678488
Z
11298.3
11334.3
2020

08

04
T08
:
30
:
50.801666847
Z
11297.3
11333.3
2020

08

04
T08
:
30
:
51.326269990
Z
11296.0
11332.0
2020

08

04
T08
:
30
:
54.423973431
Z
11296.6
11332.6
Printing out the streamed data fields is, of course, only for illustration. Certain financial applications might require sophisticated processing of the retrieved data and the generation of signals or statistics, for instance. Particularly during weekdays and trading hours, the number of price ticks streamed for financial instruments increases steadily, demanding powerful data processing capabilities on the end of financial institutions that need to process such data in real time or at least in nearreal time (“near time”).
The significance of this observation becomes clear when looking at Apple Inc. stock prices. One can calculate that there are roughly $252\xb740=10,080$ EOD closing quotes for the Apple stock over a period of 40 years. (Apple Inc. went public on December 12, 1980.) The following code retrieves tick data for the Apple stock price for one hour only. The retrieved data set, which might not even be complete for the given time interval, has 50,000 data rows, or five times as many tick quotes as the EOD quotes accumulated over 40 years of trading:
In
[
26
]
:
data
=
ek
.
get_timeseries
(
'
AAPL.O
'
,
fields
=
'
*
'
,
start_date
=
'
20200803 15:00:00
'
,
end_date
=
'
20200803 16:00:00
'
,
interval
=
'
tick
'
)
In
[
27
]
:
data
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
50000
entries
,
2020

08

03
15
:
26
:
24.889000
to
2020

08

03
15
:
59
:
59.762000
Data
columns
(
total
2
columns
)
:
# Column NonNull Count Dtype




























0
VALUE
49953
non

null
float64
1
VOLUME
50000
non

null
float64
dtypes
:
float64
(
2
)
memory
usage
:
1.1
MB
In
[
28
]
:
data
.
head
(
)
Out
[
28
]
:
AAPL
.
O
VALUE
VOLUME
Date
2020

08

03
15
:
26
:
24.889
439.06
175.0
2020

08

03
15
:
26
:
24.889
439.08
3.0
2020

08

03
15
:
26
:
24.890
439.08
100.0
2020

08

03
15
:
26
:
24.890
439.08
5.0
2020

08

03
15
:
26
:
24.899
439.10
35.0
EOD Versus Tick Data
Most of the financial theories still applied today have their origin in when EOD data was basically the only type of financial data available. Today, financial institutions, and even retail traders and investors, are confronted with neverending streams of realtime data. The example of Apple stock illustrates that for a single stock during one trading hour, there might be four times as many ticks coming in as the amount of EOD data accumulated over a period of 40 years. This not only challenges actors in financial markets, but also puts into question whether existing financial theories can be applied to such an environment at all.
Unstructured Historical Data
Many important data sources in finance provide unstructured data only, such as financial news or company filings. Undoubtedly, machines are much better and faster than humans at crunching large amounts of structured, numerical data. However, recent advances in natural language processing (NLP) make machines better and faster at processing financial news too, for example. In 2020, data service providers ingest roughly 1.5 million news articles on a daily basis. It is clear that this vast amount of textbased data cannot be processed properly by human beings.
Fortunately, unstructured data is also to a large extent available these days via programmatic APIs. The following Python code retrieves a number of news articles from the Eikon Data API related to the company Tesla, Inc. and its production. One article is selected and shown in full:
In
[
29
]
:
news
=
ek
.
get_news_headlines
(
'
R:TSLA.O PRODUCTION
'
,
date_from
=
'
20200601
'
,
date_to
=
'
20200801
'
,
count
=
7
)
In
[
30
]
:
news
Out
[
30
]
:
versionCreated
\
2020

07

29
11
:
02
:
31.276
2020

07

29
11
:
02
:
31.276000
+
00
:
00
2020

07

28
00
:
59
:
48.000
2020

07

28
00
:
59
:
48
+
00
:
00
2020

07

23
21
:
20
:
36.090
2020

07

23
21
:
20
:
36.090000
+
00
:
00
2020

07

23
08
:
22
:
17.000
2020

07

23
08
:
22
:
17
+
00
:
00
2020

07

23
07
:
08
:
48.000
2020

07

23
07
:
46
:
56
+
00
:
00
2020

07

23
00
:
55
:
54.000
2020

07

23
00
:
55
:
54
+
00
:
00
2020

07

22
21
:
35
:
42.640
2020

07

22
22
:
13
:
26.597000
+
00
:
00
text
\
2020

07

29
11
:
02
:
31.276
Tesla
Launches
Hiring
Spree
in
China
as
It
Pre
.
.
.
2020

07

28
00
:
59
:
48.000
Tesla
hiring
in
Shanghai
as
production
ramps
up
2020

07

23
21
:
20
:
36.090
Tesla
speeds
up
Model
3
production
in
Shanghai
2020

07

23
08
:
22
:
17.000
UPDATE
1

'
Please mine more nickel,
'
Musk
urges
.
.
.
2020

07

23
07
:
08
:
48.000
'
Please mine more nickel,
'
Musk
urges
as
Tesla
.
.
.
2020

07

23
00
:
55
:
54.000
USA

Tesla
choisit
le
Texas
pour
la
production
.
.
.
2020

07

22
21
:
35
:
42.640
TESLA
INC

THE
REAL
LIMITATION
ON
TESLA
GROWT
.
.
.
storyId
\
2020

07

29
11
:
02
:
31.276
urn
:
newsml
:
reuters
.
com
:
20200729
:
nCXG3W8s9X
:
1
2020

07

28
00
:
59
:
48.000
urn
:
newsml
:
reuters
.
com
:
20200728
:
nL3N2EY3PG
:
8
2020

07

23
21
:
20
:
36.090
urn
:
newsml
:
reuters
.
com
:
20200723
:
nNRAcf1v8f
:
1
2020

07

23
08
:
22
:
17.000
urn
:
newsml
:
reuters
.
com
:
20200723
:
nL3N2EU1P9
:
1
2020

07

23
07
:
08
:
48.000
urn
:
newsml
:
reuters
.
com
:
20200723
:
nL3N2EU0HH
:
1
2020

07

23
00
:
55
:
54.000
urn
:
newsml
:
reuters
.
com
:
20200723
:
nL5N2EU03M
:
1
2020

07

22
21
:
35
:
42.640
urn
:
newsml
:
reuters
.
com
:
20200722
:
nFWN2ET120
:
2
sourceCode
2020

07

29
11
:
02
:
31.276
NS
:
CAIXIN
2020

07

28
00
:
59
:
48.000
NS
:
RTRS
2020

07

23
21
:
20
:
36.090
NS
:
SOUTHC
2020

07

23
08
:
22
:
17.000
NS
:
RTRS
2020

07

23
07
:
08
:
48.000
NS
:
RTRS
2020

07

23
00
:
55
:
54.000
NS
:
RTRS
2020

07

22
21
:
35
:
42.640
NS
:
RTRS
In
[
31
]
:
storyId
=
news
[
'
storyId
'
]
[
1
]
In
[
32
]
:
from
IPython.display
import
HTML
In
[
33
]
:
HTML
(
ek
.
get_news_story
(
storyId
)
[
:
1148
]
)
Out
[
33
]
:
<
IPython
.
core
.
display
.
HTML
object
>
Jan 06, 2020 Tesla, Inc.TSLA registered record production and deliveries of 104,891 and 112,000 vehicles, respectively, in the fourth quarter of 2019. Notably, the company's Model S/X and Model 3 reported record production and deliveries in the fourth quarter. The Model S/X division recorded production and delivery volume of 17,933 and 19,450 vehicles, respectively. The Model 3 division registered production of 86,958 vehicles, while 92,550 vehicles were delivered. In 2019, Tesla delivered 367,500 vehicles, reflecting an increase of 50%, year over year, and nearly in line with the company's fullyear guidance of 360,000 vehicles.
Unstructured Streaming Data
In the same way that historical unstructured data is retrieved, programmatic APIs can be used to stream unstructured news data, for example, in real time or at least near time. One such API is available for DNA: the Data, News, Analytics platform from Dow Jones. Figure 41 shows the screenshot of a web application that streams “Commodity and Financial News” articles and processes these with NLP techniques in real time.
The newsstreaming application has the following main features:
 Full text

The full text of each article is available by clicking on the article header.
 Keyword summary

A keyword summary is created and printed on the screen.
 Sentiment analysis

Sentiment scores are calculated and visualized as colored arrows. Details become visible through a click on the arrows.
 Word cloud

A word cloud summary bitmap is created, shown as a thumbnail and visible after a click on the thumbnail (see Figure 42).
Alternative Data
Nowadays, financial institutions, and in particular hedge funds, systematically mine a number of alternative data sources to gain an edge in trading and investing. A recent article by Bloomberg lists, among others, the following alternative data sources:

Webscraped data

Crowdsourced data

Credit cards and pointofsales (POS) systems

Social media sentiment

Search trends

Web traffic

Supply chain data

Energy production data

Consumer profiles

Satellite imagery/geospacial data

App installs

Ocean vessel tracking

Wearables, drones, Internet of Things (IoT) sensors
In the following, the usage of alternative data is illustrated by two examples. The first retrieves and processes Apple Inc. press releases in the form of HTML pages. The following Python code makes use of a set of helper functions as shown in “Python Code”. In the code, a list of URLs is defined, each representing an HTML page with a press release from Apple Inc. The raw HTML code is then retrieved for each press release. Then the raw code is cleaned up, and an excerpt for one press release is printed:
In
[
34
]
:
import
nlp
import
requests
In
[
35
]
:
sources
=
[
'
https://nr.apple.com/dE0b1T5G3u
'
,
# iPad Pro
'
https://nr.apple.com/dE4c7T6g1K
'
,
# MacBook Air
'
https://nr.apple.com/dE4q4r8A2A
'
,
# Mac Mini
]
In
[
36
]
:
html
=
[
requests
.
get
(
url
)
.
text
for
url
in
sources
]
In
[
37
]
:
data
=
[
nlp
.
clean_up_text
(
t
)
for
t
in
html
]
In
[
38
]
:
data
[
0
]
[
536
:
1001
]
Out
[
38
]
:
'
display, powerful a12x bionic chip and face id introducing the new ipad pro
with
all

screen
design
and
next

generation
performance
.
new
york
apple
today
introduced
the
new
ipad
pro
with
all

screen
design
and
next

generation
performance
,
marking
the
biggest
change
to
ipad
ever
.
the
all

new
design
pushes
11

inch
and
12.9

inch
liquid
retina
displays
to
the
edges
of
ipad
pro
and
integrates
face
id
to
securely
unlock
ipad
with
just
a
glance
.
1
the
a12x
bionic
chip
w
'
Imports the NLP helper functions
Defines the URLs for the three press releases
Retrieves the raw HTML codes for the three press releases
Cleans up the raw HTML codes (for example, HTML tags are removed)
Prints an excerpt from one press release
Of course, defining alternative data as broadly as is done in this section implies that there is a limitless amount of data that one can retrieve and process for financial purposes. At its core, this is the business of search engines such as the one from Google LLC. In a financial context, it would be of paramount importance to specify exactly what unstructured alternative data sources to tap into.
The second example is about the retrieval of data from the social network Twitter, Inc. To this end, Twitter provides API access to tweets on its platform, provided one has set up a Twitter account appropriately. The following Python code connects to the Twitter API and retrieves and prints the five most recent tweets from my home timeline and user timeline, respectively:
In
[
39
]
:
from
import
,
OAuth
In
[
40
]
:
t
=
(
auth
=
OAuth
(
c
[
'
'
]
[
'
access_token
'
]
,
c
[
'
'
]
[
'
access_secret_token
'
]
,
c
[
'
'
]
[
'
api_key
'
]
,
c
[
'
'
]
[
'
api_secret_key
'
]
)
,
retry
=
True
)
In
[
41
]
:
l
=
t
.
statuses
.
home_timeline
(
count
=
5
)
In
[
42
]
:
for
e
in
l
:
(
e
[
'
text
'
]
)
The
Bank
of
England
is
effectively
subsidizing
polluting
industries
in
its
pandemic
rescue
program
,
a
think
tank
sa
…
https
:
/
/
t
.
co
/
Fq5jl2CIcp
Cool
shared
task
:
mining
scientific
contributions
(
by
@SeeTedTalk
@SoerenAuer
and
Jennifer
D
'
Souza)
https
:
/
/
t
.
co
/
dm56DMUrWm
Twelve
people
were
hospitalized
in
Wyoming
on
Monday
after
a
hot
air
balloon
crash
,
officials
said
.
Three
hot
air
…
https
:
/
/
t
.
co
/
EaNBBRXVar
President
Trump
directed
controversial
Pentagon
pick
into
new
role
with
similar
duties
after
nomination
failed
https
:
/
/
t
.
co
/
ZyXpPcJkcQ
Company
announcement
:
Revolut
launches
Open
Banking
for
its
400
,
000
Italian
.
.
.
https
:
/
/
t
.
co
/
OfvbgwbeJW
#fintech
In
[
43
]
:
l
=
t
.
statuses
.
user_timeline
(
screen_name
=
'
dyjh
'
,
count
=
5
)
In
[
44
]
:
for
e
in
l
:
(
e
[
'
text
'
]
)
#Python for #AlgoTrading (focus on the process) & #AI in #Finance (focus
on
prediction
methods
)
will
complement
eac
…
https
:
/
/
t
.
co
/
P1s8fXCp42
Currently
putting
finishing
touches
on
#AI in #Finance (@OReillyMedia). Book
going
into
production
shortly
.
https
:
/
/
t
.
co
/
JsOSA3sfBL
Chinatown
Is
Coming
Back
,
One
Noodle
at
a
Time
https
:
/
/
t
.
co
/
In5kXNeVc5
Alt
data
industry
balloons
as
hedge
funds
strive
for
Covid
edge
via
@FT

"
We remain of the view that alternative d… https://t.co/9HtUOjoEdz
@Wolf_Of_BTC
Just
follow
me
on
(
or
)
.
Then
you
will
notice
for
sure
when
it
is
out
.
Connects to the Twitter API
Retrieves and prints five (most recent) tweets from home timeline
Retrieves and prints five (most recent) tweets from user timeline
The Twitter API allows also for searches, based on which most recent tweets can be retrieved and processed:
In
[
45
]
:
d
=
t
.
search
.
tweets
(
q
=
'
#Python
'
,
count
=
7
)
In
[
46
]
:
for
e
in
d
[
'
statuses
'
]
:
(
e
[
'
text
'
]
)
RT
@KirkDBorne
:
#AI is Reshaping Programming — Tips on How to Stay on Top:
https
:
/
/
t
.
co
/
CFNu1i352C
—
—
Courses
:
1
:
#MachineLearning — Jupyte…
RT
@reuvenmlerner
:
Today
,
a
#Python student's code didn't print:
x
=
5
if
x
==
5
:
:
(
'
yes!
'
)
There
was
a
typo
,
namely
:
after
pr
…
RT
@GavLaaaaaaaa
:
Javascript
Does
Not
Need
a
StringBuilder
https
:
/
/
t
.
co
/
aS7NzHLO65
#programming #softwareengineering #bigdata
#datascience…
RT
@CodeFlawCo
:
It
is
necessary
to
publish
regular
updates
on
#programmer #coder #developer #technology RT @pak_aims: Learning to C…
RT
@GavLaaaaaaaa
:
Javascript
Does
Not
Need
a
StringBuilder
https
:
/
/
t
.
co
/
aS7NzHLO65
#programming #softwareengineering #bigdata
#datascience…
One can also collect a larger number of tweets from a Twitter user and create a summary in the form of a word cloud (see Figure 43). The following Python code again makes use of the NLP helper functions as shown in “Python Code”:
In
[
47
]
:
l
=
t
.
statuses
.
user_timeline
(
screen_name
=
'
elonmusk
'
,
count
=
50
)
In
[
48
]
:
tl
=
[
e
[
'
text
'
]
for
e
in
l
]
In
[
49
]
:
tl
[
:
5
]
Out
[
49
]
:
[
'
@flcnhvy @Lindw0rm @cleantechnica True
'
,
'
@Lindw0rm @cleantechnica Highly likely down the road
'
,
'
@cleantechnica True fact
'
,
'
@NASASpaceflight Scrubbed for the day. A Raptor turbopump spin start valve
didn
’
t
open
,
triggering
an
automatic
abo
…
https
:
/
/
t
.
co
/
QDdlNXFgJg
'
,
'
@Erdayastronaut I’m in the Boca control room. Hop attempt in ~33 minutes.
'
]
In
[
50
]
:
wc
=
nlp
.
generate_word_cloud
(
'
'
.
join
(
tl
)
,
35
,
name
=
'
../../images/ch04/musk_twitter_wc.png
'
)
Retrieves the 50 most recent tweets for the user
elonmusk
Collects the texts in a
list
objectShows excerpts for the final five tweets
Generates a word cloud summary and shows it
Once a financial practitioner defines the “relevant financial data” to go beyond structured financial time series data, the data sources seem limitless in terms of volume, variety, and velocity. The way the tweets are retrieved from the Twitter API is almost in near time since the most recent tweets are accessed in the examples. These and similar APIbased data sources therefore provide a neverending stream of alternative data for which, as previously pointed out, it is important to specify exactly what one is looking for. Otherwise, any financial data science effort might easily drown in too much data and/or too noisy data.
Normative Theories Revisited
Chapter 3 introduces normative financial theories such as the MVP theory or the CAPM. For quite a long time, students and academics learning and studying such theories were more or less constrained to the theory itself. With all the available financial data, as discussed and illustrated in the previous section, in combination with powerful open source software for data analysis—such as Python, NumPy
, pandas
, and so on—it has become pretty easy and straightforward to put financial theories to realworld tests. It does not require small teams and larger studies anymore to do so. A typical notebook, internet access, and a standard Python environment suffice. This is what this section is about. However, before diving into datadriven finance, the following subsection discusses briefly some famous paradoxes in the context of EUT and how corporations model and predict the behavior of individuals in practice.
Expected Utility and Reality
In economics, risk describes a situation in which possible future states and probabilities for those states to unfold are known in advance to the decision maker. This is the standard assumption in finance and the context of EUT. On the other hand, ambiguity describes situations in economics in which probabilities, or even possible future states, are not known in advance to a decision maker. Uncertainty subsumes the two different decisionmaking situations.
There is a long tradition of analyzing the concrete decisionmaking behavior of individuals (“agents”) under uncertainty. Innumerable studies and experiments have been conducted to observe and analyze how agents behave when faced with uncertainty as compared to what theories such as EUT predict. For centuries, paradoxa have played an important role in decisionmaking theory and research.
One such paradox, the St. Petersburg paradox, gave rise to the invention of utility functions and EUT in the first place. Daniel Bernoulli presented the paradox—and a solution to it—in 1738. The paradox is based on the following coin tossing game $G$. An agent is faced with a game during which a (perfect) coin is tossed potentially infinitely many times. If after the first toss heads prevails, the agent receives a payoff of 1 (currency unit). As long as heads is observed, the coin is tossed again. Otherwise the game ends. If heads prevails a second time, the agent receives an additional payoff of 2. If it does a third time, the additional payoff is 4. For the fourth time it is 8, and so on. This is a situation of risk since all possible future states, as well as their associated probabilities, are known in advance.
The expected payoff of this game is infinite. This can be seen from the following infinite sum of which every element is strictly positive:
However, faced with such a game, a decision maker in general would be willing to pay a finite sum only to play the game. A major reason for this is the fact that relatively large payoffs only happen with a relatively small probability. Consider the potential payoff $W=511$:
The probability of winning such a payoff is pretty low. To be exact, it is only $P(x=W)=\frac{1}{512}=$ 0.001953125. The probability for such a payoff or a smaller one, on the other hand, is pretty high:
In other words, in 998 out of 1,000 games the payoff is 511 or smaller. Therefore, an agent would probably not wager much more than 511 to play this game. The way out of this paradox is the introduction of a utility function with positive but decreasing marginal utility. In the context of the St. Petersburg paradox, this means that there is a function $u:{\mathbb{R}}_{+}\to \mathbb{R}$ that assigns to every positive payoff $x$ a real value $u\left(x\right)$. Positive but decreasing marginal utility then formally translates into the following:
As seen in Chapter 3, one such candidate function is $u\left(x\right)=ln\left(x\right)$ with:
The expected utility then is finite, as the calculation of the following infinite sum illustrates:
The expected utility of $ln\left(2\right)$ = 0.693147 is obviously a pretty small number in comparison to the expected payoff of infinity. Bernoulli utility functions and EUT resolve the St. Petersburg paradox.
Other paradoxa, such as the Allais paradox published in Allais (1953), address the EUT itself. This paradox is based on an experiment with four different games that test subjects should rank. Table 42 shows the four games $(A,B,{A}^{\text{'}},{B}^{\text{'}})$. The ranking is to be done for the two pairs $(A,B)$ and $({A}^{\text{'}},{B}^{\text{'}})$. The independence axiom postulates that the first row in the table should not have any influence on the ordering of $({A}^{\text{'}},{B}^{\text{'}})$ since the payoff is the same for both games.
Probability  Game A  Game B  Game A’  Game B’ 

0.66 
2,400 
2,400 
0 
0 
0.33 
2,500 
2,400 
2,500 
2,400 
0.01 
0 
2,400 
0 
2,400 
In experiments, the majority of decision makers rank the games as follows: $B\succ A$ and $A\text{'}\succ {B}^{\text{'}}$. The ranking $B\succ A$ leads to the following inequalities, where ${u}_{1}\equiv u\left(2400\right),{u}_{2}\equiv u\left(2500\right),{u}_{3}\equiv u\left(0\right)$:
The ranking $A\text{'}\succ {B}^{\text{'}}$ in turn leads to the following inequalities:
These inequalities obviously contradict each other and lead to the Allais paradox. One possible explanation is that decision makers in general value certainty higher than the typical models, such as EUT, predict. Most people would probably rather choose to receive $1 million with certainty than play a game in which they can win $100 million with a probability of 5%, although there are a number of suitable utility functions available that under EUT would have the decision maker choose the game instead of the certain amount.
Another explanation lies in framing decisions and the psychology of decision makers. It is well known that more people would accept a surgery if it has a “95% chance of success” than a “5% chance of death.” Simply changing the wording might lead to behavior that is inconsistent with decisionmaking theories such as EUT.
Another famous paradox addressing shortcomings of EUT in its subjective form, according to Savage (1954, 1972), is the Ellsberg paradox, which dates back to the seminal paper by Ellsberg (1961). It addresses the importance of ambiguity in many realworld decision situations. A standard setting for this paradox comprises two different urns, both of which contain exactly 100 balls. For urn 1, it is known that it contains exactly 50 black and 50 red balls. For urn 2, it is only known that it contains black and red balls but not in which proportion.
Test subjects can choose among the following game options:

Game 1: red 1, black 1, or indifferent

Game 2: red 2, black 2, or indifferent

Game 3: red 1, red 2, or indifferent

Game 4: black 1, black 2, or indifferent
Here, “red 1,” for example, means that a red ball is drawn from urn 1. Typically, a test subject would answer as follows:

Game 1: indifferent

Game 2: indifferent

Game 3: red 1

Game 4: black 1
This set of decisions—which is not the only one to be observed but is a common one—exemplifies what is called ambiguity aversion. Since the probabilities for black and red balls, respectively, are not known for urn 2, decision makers prefer a situation of risk instead of ambiguity.
The two paradoxa of Allais and Ellsberg show that real test subjects quite often behave contrary to what wellestablished decision theories in economics predict. In other words, human beings as decision makers can in general not be compared to machines that carefully collect data and then crunch the numbers to make a decision under uncertainty, be it in the form of risk or ambiguity. Human behavior is more complex than most, if not all, theories currently suggest. How difficult and complex it can be to explain human behavior is clear after reading, for example, the 800page book Behave by Sapolsky (2018). It covers multiple facets of this topic, ranging from biochemical processes to genetics, human evolution, tribes, language, religion, and more, in an integrative manner.
If standard economic decision paradigms such as EUT do not explain realworld decision making too well, what alternatives are available? Economic experiments that build the basis for the Allais and Ellsberg paradoxa are a good starting point in learning how decision makers behave in specific, controlled situations. Such experiments and their sometimes surprising and paradoxical results have indeed motivated a great number of researchers to come up with alternative theories and models that resolve the paradoxa. The book The Experiment in the History of Economics by Fontaine and Leonard (2005) is about the historical role of experiments in economics. There is, for example, a whole string of literature that addresses issues arising from the Ellsberg paradox. This literature deals with, among other topics, nonadditive probabilities, Choquet integrals, and decision heuristics such as maximizing the minimum payoff (“maxmin”) or minimizing the maximum loss (“minmax”). These alternative approaches have proven superior to EUT, at least in certain decisionmaking scenarios. But they are far from being mainstream in finance.
What, after all, has proven to be useful in practice? Not too surprisingly, the answer lies in data and machine learning algorithms. The internet, with its billions of users, generates a treasure trove of data describing realworld human behavior, or what is sometimes called revealed preferences. The big data generated on the web has a scale that is multiple orders of magnitude larger than what single experiments can generate. Companies such as Amazon, Facebook, Google, and Twitter are able to make billions of dollars by recording user behavior (that is, their revealed preferences) and capitalizing on the insights generated by ML algorithms trained on this data.
The default ML approach taken in this context is supervised learning. The algorithms themselves are in general theory and modelfree; variants of neural networks are often applied. Therefore, when companies today predict the behavior of their users or customers, more often than not a modelfree ML algorithm is deployed. Traditional decision theories like EUT or one of its successors generally do not play a role at all. This makes it somewhat surprising that such theories still, at the beginning of the 2020s, are a cornerstone of most economic and financial theories applied in practice. And this is not even to mention the large number of financial textbooks that cover traditional decision theories in detail. If one of the most fundamental building blocks of financial theory seems to lack meaningful empirical support or practical benefits, what about the financial models that build on top of it? More on this appears in subsequent sections and chapters.
DataDriven Predictions of Behavior
Standard economic decision theories are intellectually appealing to many, even to those who, faced with a concrete decision under uncertainty, would behave in contrast to the theories’ predictions. On the other hand, big data and modelfree, supervised learning approaches prove useful and successful in practice for predicting user and customer behavior. In a financial context, this might imply that one should not really worry about why and how financial agents decide the way they decide. One should rather focus on their indirectly revealed preferences based on features data (new information) that describes the state of a financial market and labels data (outcomes) that reflects the impact of the decisions made by financial agents. This leads to a datadriven instead of a theory or modeldriven view of decision making in financial markets. Financial agents become dataprocessing organisms that can be much better modeled, for example, by complex neural networks than, say, a simple utility function in combination with an assumed probability distribution.
MeanVariance Portfolio Theory
Assume a datadriven investor wants to apply MVP theory to invest in a portfolio of technology stocks and wants to add a goldrelated exchangetraded fund (ETF) for diversification. Probably, the investor would access relevant historical price data via an API to a trading platform or a data provider. To make the following analysis reproducible, it relies on a CSV data file stored in a remote location. The following Python code retrieves the data file, selects a number of symbols given the investor’s goal, and calculates log returns from the price time series data. Figure 44 compares the normalized price time series for the selected symbols:
In
[
51
]
:
import
numpy
as
np
import
pandas
as
pd
from
pylab
import
plt
,
mpl
from
scipy.optimize
import
minimize
plt
.
style
.
use
(
'
seaborn
'
)
mpl
.
rcParams
[
'
savefig.dpi
'
]
=
300
mpl
.
rcParams
[
'
font.family
'
]
=
'
serif
'
np
.
set_printoptions
(
precision
=
5
,
suppress
=
True
,
formatter
=
{
'
float
'
:
lambda
x
:
f
'
{x:6.3f}
'
}
)
In
[
52
]
:
url
=
'
http://hilpisch.com/aiif_eikon_eod_data.csv
'
In
[
53
]
:
raw
=
pd
.
read_csv
(
url
,
index_col
=
0
,
parse_dates
=
True
)
.
dropna
(
)
In
[
54
]
:
raw
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
2516
entries
,
2010

01

04
to
2019

12

31
Data
columns
(
total
12
columns
)
:
# Column NonNull Count Dtype




























0
AAPL
.
O
2516
non

null
float64
1
MSFT
.
O
2516
non

null
float64
2
INTC
.
O
2516
non

null
float64
3
AMZN
.
O
2516
non

null
float64
4
GS
.
N
2516
non

null
float64
5
SPY
2516
non

null
float64
6
.
SPX
2516
non

null
float64
7
.
VIX
2516
non

null
float64
8
EUR
=
2516
non

null
float64
9
XAU
=
2516
non

null
float64
10
GDX
2516
non

null
float64
11
GLD
2516
non

null
float64
dtypes
:
float64
(
12
)
memory
usage
:
255.5
KB
In
[
55
]
:
symbols
=
[
'
AAPL.O
'
,
'
MSFT.O
'
,
'
INTC.O
'
,
'
AMZN.O
'
,
'
GLD
'
]
In
[
56
]
:
rets
=
np
.
log
(
raw
[
symbols
]
/
raw
[
symbols
]
.
shift
(
1
)
)
.
dropna
(
)
In
[
57
]
:
(
raw
[
symbols
]
/
raw
[
symbols
]
.
iloc
[
0
]
)
.
plot
(
figsize
=
(
10
,
6
)
)
;
Retrieves historical EOD data from a remote location
Specifies the symbols (
RICs
) to be invested inCalculates the log returns for all time series
Plots the normalized financial time series for the selected symbols
The datadriven investor wants to first set a baseline for performance as given by an equally weighted portfolio over the whole period of the available data. To this end, the following Python code defines functions to calculate the portfolio return, the portfolio volatility, and the portfolio Sharpe ratio given a set of weights for the selected symbols:
In
[
58
]
:
weights
=
len
(
rets
.
columns
)
*
[
1
/
len
(
rets
.
columns
)
]
In
[
59
]
:
def
port_return
(
rets
,
weights
)
:
return
np
.
dot
(
rets
.
mean
(
)
,
weights
)
*
252
In
[
60
]
:
port_return
(
rets
,
weights
)
Out
[
60
]
:
0.15694764653018106
In
[
61
]
:
def
port_volatility
(
rets
,
weights
)
:
return
np
.
dot
(
weights
,
np
.
dot
(
rets
.
cov
(
)
*
252
,
weights
)
)
*
*
0.5
In
[
62
]
:
port_volatility
(
rets
,
weights
)
Out
[
62
]
:
0.16106507848480675
In
[
63
]
:
def
port_sharpe
(
rets
,
weights
)
:
return
port_return
(
rets
,
weights
)
/
port_volatility
(
rets
,
weights
)
In
[
64
]
:
port_sharpe
(
rets
,
weights
)
Out
[
64
]
:
0.97443622172255
Equally weighted portfolio
Portfolio return
Portfolio volatility
Portfolio Sharpe ratio (with zero short rate)
The investor also wants to analyze which combinations of portfolio risk and return—and consequently Sharpe ratio—are roughly possible by applying Monte Carlo simulation to randomize the portfolio weights. Short sales are excluded, and the portfolio weights are assumed to add up to 100%. The following Python code implements the simulation and visualizes the results (see Figure 45):
In
[
65
]
:
w
=
np
.
random
.
random
(
(
1000
,
len
(
symbols
)
)
)
w
=
(
w
.
T
/
w
.
sum
(
axis
=
1
)
)
.
T
In
[
66
]
:
w
[
:
5
]
Out
[
66
]
:
array
(
[
[
0.184
,
0.157
,
0.227
,
0.353
,
0.079
]
,
[
0.207
,
0.282
,
0.258
,
0.023
,
0.230
]
,
[
0.313
,
0.284
,
0.051
,
0.340
,
0.012
]
,
[
0.238
,
0.181
,
0.145
,
0.191
,
0.245
]
,
[
0.246
,
0.256
,
0.315
,
0.181
,
0.002
]
]
)
In
[
67
]
:
pvr
=
[
(
port_volatility
(
rets
[
symbols
]
,
weights
)
,
port_return
(
rets
[
symbols
]
,
weights
)
)
for
weights
in
w
]
pvr
=
np
.
array
(
pvr
)
In
[
68
]
:
psr
=
pvr
[
:
,
1
]
/
pvr
[
:
,
0
]
In
[
69
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
fig
=
plt
.
scatter
(
pvr
[
:
,
0
]
,
pvr
[
:
,
1
]
,
c
=
psr
,
cmap
=
'
coolwarm
'
)
cb
=
plt
.
colorbar
(
fig
)
cb
.
set_label
(
'
Sharpe ratio
'
)
plt
.
xlabel
(
'
expected volatility
'
)
plt
.
ylabel
(
'
expected return
'
)
plt
.
title
(
'

'
.
join
(
symbols
)
)
;
Simulates portfolio weights adding up to 100%
Derives the resulting portfolio volatilities and returns
Calculates the resulting Sharpe ratios
The datadriven investor now wants to backtest the performance of a portfolio that was set up at the beginning of 2011. The optimal portfolio composition was derived from the financial time series data available from 2010. At the beginning of 2012, the portfolio composition was adjusted given the available data from 2011, and so on. To this end, the following Python code derives the portfolio weights for every relevant year that maximizes the Sharpe ratio:
In
[
70
]
:
bnds
=
len
(
symbols
)
*
[
(
0
,
1
)
,
]
bnds
Out
[
70
]
:
[
(
0
,
1
)
,
(
0
,
1
)
,
(
0
,
1
)
,
(
0
,
1
)
,
(
0
,
1
)
]
In
[
71
]
:
cons
=
{
'
type
'
:
'
eq
'
,
'
fun
'
:
lambda
weights
:
weights
.
sum
(
)

1
}
In
[
72
]
:
opt_weights
=
{
}
for
year
in
range
(
2010
,
2019
)
:
rets_
=
rets
[
symbols
]
.
loc
[
f
'
{year}0101
'
:
f
'
{year}1231
'
]
ow
=
minimize
(
lambda
weights
:

port_sharpe
(
rets_
,
weights
)
,
len
(
symbols
)
*
[
1
/
len
(
symbols
)
]
,
bounds
=
bnds
,
constraints
=
cons
)
[
'
x
'
]
opt_weights
[
year
]
=
ow
In
[
73
]
:
opt_weights
Out
[
73
]
:
{
2010
:
array
(
[
0.366
,
0.000
,
0.000
,
0.056
,
0.578
]
)
,
2011
:
array
(
[
0.543
,
0.000
,
0.077
,
0.000
,
0.380
]
)
,
2012
:
array
(
[
0.324
,
0.000
,
0.000
,
0.471
,
0.205
]
)
,
2013
:
array
(
[
0.012
,
0.305
,
0.219
,
0.464
,
0.000
]
)
,
2014
:
array
(
[
0.452
,
0.115
,
0.419
,
0.000
,
0.015
]
)
,
2015
:
array
(
[
0.000
,
0.000
,
0.000
,
1.000
,
0.000
]
)
,
2016
:
array
(
[
0.150
,
0.260
,
0.000
,
0.058
,
0.533
]
)
,
2017
:
array
(
[
0.231
,
0.203
,
0.031
,
0.109
,
0.426
]
)
,
2018
:
array
(
[
0.000
,
0.295
,
0.000
,
0.705
,
0.000
]
)
}
Specifies the bounds for the single asset weights
Specifies that all weights need to add up to 100%
Selects the relevant data set for the given year
Derives the portfolio weights that maximize the Sharpe ratio
The optimal portfolio compositions as derived for the relevant years illustrate that MVP theory in its original form quite often leads to (relative) extreme situations in the sense that one or more assets are not included at all or that even a single asset makes up 100% of the portfolio. Of course, this can be actively avoided by setting, for example, a minimum weight for every asset considered. The results also indicate that this approach leads to significant rebalancings in the portfolio, driven by the previous year’s realized statistics and correlations.
To complete the backtest, the following code compares the expected portfolio statistics (from the optimal composition of the previous year applied to the previous year’s data) with the realized portfolio statistics for the current year (from the optimal composition from the previous year applied to the current year’s data):
In
[
74
]
:
res
=
pd
.
DataFrame
(
)
for
year
in
range
(
2010
,
2019
)
:
rets_
=
rets
[
symbols
]
.
loc
[
f
'
{year}0101
'
:
f
'
{year}1231
'
]
epv
=
port_volatility
(
rets_
,
opt_weights
[
year
]
)
epr
=
port_return
(
rets_
,
opt_weights
[
year
]
)
esr
=
epr
/
epv
rets_
=
rets
[
symbols
]
.
loc
[
f
'
{year + 1}0101
'
:
f
'
{year + 1}1231
'
]
rpv
=
port_volatility
(
rets_
,
opt_weights
[
year
]
)
rpr
=
port_return
(
rets_
,
opt_weights
[
year
]
)
rsr
=
rpr
/
rpv
res
=
res
.
append
(
pd
.
DataFrame
(
{
'
epv
'
:
epv
,
'
epr
'
:
epr
,
'
esr
'
:
esr
,
'
rpv
'
:
rpv
,
'
rpr
'
:
rpr
,
'
rsr
'
:
rsr
}
,
index
=
[
year
+
1
]
)
)
In
[
75
]
:
res
Out
[
75
]
:
epv
epr
esr
rpv
rpr
rsr
2011
0.157440
0.303003
1.924564
0.160622
0.133836
0.833235
2012
0.173279
0.169321
0.977156
0.182292
0.161375
0.885256
2013
0.202460
0.278459
1.375378
0.168714
0.166897
0.989228
2014
0.181544
0.368961
2.032353
0.197798
0.026830
0.135645
2015
0.160340
0.309486
1.930190
0.211368

0.024560

0.116194
2016
0.326730
0.778330
2.382179
0.296565
0.103870
0.350242
2017
0.106148
0.090933
0.856663
0.079521
0.230630
2.900235
2018
0.086548
0.260702
3.012226
0.157337
0.038234
0.243004
2019
0.323796
0.228008
0.704174
0.207672
0.275819
1.328147
In
[
76
]
:
res
.
mean
(
)
Out
[
76
]
:
epv
0.190920
epr
0.309689
esr
1.688320
rpv
0.184654
rpr
0.123659
rsr
0.838755
dtype
:
float64
Figure 46 compares the expected and realized portfolio volatilities for the single years. MVP theory does quite a good job in predicting the portfolio volatility. This is also supported by a relatively high correlation between the two time series:
In
[
77
]:
res
[[
'epv'
,
'rpv'
]]
.
corr
()
Out
[
77
]:
epv
rpv
epv
1.000000
0.765733
rpv
0.765733
1.000000
In
[
78
]:
res
[[
'epv'
,
'rpv'
]]
.
plot
(
kind
=
'bar'
,
figsize
=
(
10
,
6
),
title
=
'Expected vs. Realized Portfolio Volatility'
);
However, the conclusions are the opposite when comparing the expected with the realized portfolio returns (see Figure 47). MVP theory obviously fails in predicting the portfolio returns, as is confirmed by the negative correlation between the two time series:
In
[
79
]:
res
[[
'epr'
,
'rpr'
]]
.
corr
()
Out
[
79
]:
epr
rpr
epr
1.000000

0.350437
rpr

0.350437
1.000000
In
[
80
]:
res
[[
'epr'
,
'rpr'
]]
.
plot
(
kind
=
'bar'
,
figsize
=
(
10
,
6
),
title
=
'Expected vs. Realized Portfolio Return'
);
Similar, or even worse, conclusions need to be drawn with regard to the Sharpe ratio (see Figure 48). For the datadriven investor who aims at maximizing the Sharpe ratio of the portfolio, the theory’s predictions are generally significantly off from the realized values. The correlation between the two time series is even lower than for the returns:
In
[
81
]:
res
[[
'esr'
,
'rsr'
]]
.
corr
()
Out
[
81
]:
esr
rsr
esr
1.000000

0.698607
rsr

0.698607
1.000000
In
[
82
]:
res
[[
'esr'
,
'rsr'
]]
.
plot
(
kind
=
'bar'
,
figsize
=
(
10
,
6
),
title
=
'Expected vs. Realized Sharpe Ratio'
);
Predictive Power of MVP Theory
MVP theory applied to realworld data reveals its practical shortcomings. Without additional constraints, optimal portfolio compositions and rebalancings can be extreme. The predictive power with regard to portfolio return and Sharpe ratio is pretty bad in the numerical example, whereas the predictive power with regard to portfolio risk seems acceptable. However, investors generally are interested in riskadjusted performance measures, such as the Sharpe ratio, and this is the statistic for which MVP theory fails worst in the example.
Capital Asset Pricing Model
A similar approach can be applied to put the CAPM to a realworld test. Assume that the datadriven technology investor from before wants to apply the CAPM to derive expected returns for the four technology stocks from before. The following Python code first derives the beta for every stock for a given year, and then calculates the expected return for the stock in the next year, given its beta and the performance of the market portfolio. The market portfolio is approximated by the S&P 500 stock index:
In
[
83
]
:
r
=
0.005
In
[
84
]
:
market
=
'
.SPX
'
In
[
85
]
:
rets
=
np
.
log
(
raw
/
raw
.
shift
(
1
)
)
.
dropna
(
)
In
[
86
]
:
res
=
pd
.
DataFrame
(
)
In
[
87
]
:
for
sym
in
rets
.
columns
[
:
4
]
:
(
'
\n
'
+
sym
)
(
54
*
'
=
'
)
for
year
in
range
(
2010
,
2019
)
:
rets_
=
rets
.
loc
[
f
'
{year}0101
'
:
f
'
{year}1231
'
]
muM
=
rets_
[
market
]
.
mean
(
)
*
252
cov
=
rets_
.
cov
(
)
.
loc
[
sym
,
market
]
var
=
rets_
[
market
]
.
var
(
)
beta
=
cov
/
var
rets_
=
rets
.
loc
[
f
'
{year + 1}0101
'
:
f
'
{year + 1}1231
'
]
muM
=
rets_
[
market
]
.
mean
(
)
*
252
mu_capm
=
r
+
beta
*
(
muM

r
)
mu_real
=
rets_
[
sym
]
.
mean
(
)
*
252
res
=
res
.
append
(
pd
.
DataFrame
(
{
'
symbol
'
:
sym
,
'
mu_capm
'
:
mu_capm
,
'
mu_real
'
:
mu_real
}
,
index
=
[
year
+
1
]
)
,
sort
=
True
)
(
'
{}  beta: {:.3f}  mu_capm: {:6.3f}  mu_real: {:6.3f}
'
.
format
(
year
+
1
,
beta
,
mu_capm
,
mu_real
)
)
Specifies the riskless short rate
Defines the market portfolio
Derives the beta of the stock
Calculates the expected return given previous year’s beta and current year market portfolio performance
Calculates the realized performance of the stock for the current year
Collects and prints all results
The preceding code provides the following output:
AAPL
.
O
======================================================
2011

beta
:
1.052

mu_capm
:

0.000

mu_real
:
0.228
2012

beta
:
0.764

mu_capm
:
0.098

mu_real
:
0.275
2013

beta
:
1.266

mu_capm
:
0.327

mu_real
:
0.053
2014

beta
:
0.630

mu_capm
:
0.070

mu_real
:
0.320
2015

beta
:
0.833

mu_capm
:

0.005

mu_real
:

0.047
2016

beta
:
1.144

mu_capm
:
0.103

mu_real
:
0.096
2017

beta
:
1.009

mu_capm
:
0.180

mu_real
:
0.381
2018

beta
:
1.379

mu_capm
:

0.091

mu_real
:

0.071
2019

beta
:
1.252

mu_capm
:
0.316

mu_real
:
0.621
MSFT
.
O
======================================================
2011

beta
:
0.890

mu_capm
:
0.001

mu_real
:

0.072
2012

beta
:
0.816

mu_capm
:
0.104

mu_real
:
0.029
2013

beta
:
1.109

mu_capm
:
0.287

mu_real
:
0.337
2014

beta
:
0.876

mu_capm
:
0.095

mu_real
:
0.216
2015

beta
:
0.955

mu_capm
:

0.007

mu_real
:
0.178
2016

beta
:
1.249

mu_capm
:
0.113

mu_real
:
0.113
2017

beta
:
1.224

mu_capm
:
0.217

mu_real
:
0.321
2018

beta
:
1.303

mu_capm
:

0.086

mu_real
:
0.172
2019

beta
:
1.442

mu_capm
:
0.364

mu_real
:
0.440
INTC
.
O
======================================================
2011

beta
:
1.081

mu_capm
:

0.000

mu_real
:
0.142
2012

beta
:
0.842

mu_capm
:
0.108

mu_real
:

0.163
2013

beta
:
1.081

mu_capm
:
0.280

mu_real
:
0.230
2014

beta
:
0.883

mu_capm
:
0.096

mu_real
:
0.335
2015

beta
:
1.055

mu_capm
:

0.008

mu_real
:

0.052
2016

beta
:
1.009

mu_capm
:
0.092

mu_real
:
0.051
2017

beta
:
1.261

mu_capm
:
0.223

mu_real
:
0.242
2018

beta
:
1.163

mu_capm
:

0.076

mu_real
:
0.017
2019

beta
:
1.376

mu_capm
:
0.347

mu_real
:
0.243
AMZN
.
O
======================================================
2011

beta
:
1.102

mu_capm
:

0.001

mu_real
:

0.039
2012

beta
:
0.958

mu_capm
:
0.122

mu_real
:
0.374
2013

beta
:
1.116

mu_capm
:
0.289

mu_real
:
0.464
2014

beta
:
1.262

mu_capm
:
0.135

mu_real
:

0.251
2015

beta
:
1.473

mu_capm
:

0.013

mu_real
:
0.778
2016

beta
:
1.122

mu_capm
:
0.102

mu_real
:
0.104
2017

beta
:
1.118

mu_capm
:
0.199

mu_real
:
0.446
2018

beta
:
1.300

mu_capm
:

0.086

mu_real
:
0.251
2019

beta
:
1.619

mu_capm
:
0.408

mu_real
:
0.207
Figure 49 compares the predicted (expected) return for a single stock, given the beta from the previous year and market portfolio performance of the current year, with the realized return of the stock for the current year. Obviously, the CAPM in its original form does not prove really useful in predicting a stock’s performance based on beta only:
In
[
88
]:
sym
=
'AMZN.O'
In
[
89
]:
res
[
res
[
'symbol'
]
==
sym
]
.
corr
()
Out
[
89
]:
mu_capm
mu_real
mu_capm
1.000000

0.004826
mu_real

0.004826
1.000000
In
[
90
]:
res
[
res
[
'symbol'
]
==
sym
]
.
plot
(
kind
=
'bar'
,
figsize
=
(
10
,
6
),
title
=
sym
);
Figure 410 compares the averages of the CAPMpredicted stock returns with the averages of the realized returns. Also here, the CAPM does not do a good job.
What is easy to see is that the CAPM predictions do not vary that much on average for the stocks analyzed; they are between 12.2% and 14.4%. However, the realized average returns of the stocks show a high variability; these are between 9.4% and 29.2%. Market portfolio performance and beta alone obviously cannot account for the observed returns of the (technology) stocks:
In
[
91
]:
grouped
=
res
.
groupby
(
'symbol'
)
.
mean
()
grouped
Out
[
91
]:
mu_capm
mu_real
symbol
AAPL
.
O
0.110855
0.206158
AMZN
.
O
0.128223
0.259395
INTC
.
O
0.117929
0.116180
MSFT
.
O
0.120844
0.192655
In
[
92
]:
grouped
.
plot
(
kind
=
'bar'
,
figsize
=
(
10
,
6
),
title
=
'Average Values'
);
Predictive Power of the CAPM
The predictive power of the CAPM with regard to the future performance of stocks, relative to the market portfolio, is pretty low or even nonexistent for certain stocks. One of the reasons is probably the fact that the CAPM rests on the same central assumptions as MVP theory, namely that investors care about only the (expected) return and (expected) volatility of a portfolio and/or stock. From a modeling point of view, one can ask whether the single risk factor is enough to explain variability in stock returns or whether there might be a nonlinear relationship between a stock’s return and the market portfolio performance.
Arbitrage Pricing Theory
The predictive power of the CAPM seems quite limited given the results from the previous numerical example. A valid question is whether the market portfolio performance alone is enough to explain variability in stock returns. The answer of the APT is no—there can be more (even many more) factors that together explain variability in stock returns. “Arbitrage Pricing Theory” formally describes the framework of APT that also relies on a linear relationship between the factors and a stock’s return.
The datadriven investor recognizes that the CAPM is not sufficient to reliably predict a stock’s performance relative to the market portfolio performance. Therefore, the investor decides to add to the market portfolio three additional factors that might drive a stock’s performance:

Market volatility (as represented by the VIX index,
.VIX
) 
Exchange rates (as represented by the EUR/USD rate,
EUR=
) 
Commodity prices (as represented by the gold price,
XAU=
)
The following Python code implements a simple APT approach by using the four factors in combination with multivariate regression to explain a stock’s future performance in relation to the factors:
In
[
93
]
:
factors
=
[
'
.SPX
'
,
'
.VIX
'
,
'
EUR=
'
,
'
XAU=
'
]
In
[
94
]
:
res
=
pd
.
DataFrame
(
)
In
[
95
]
:
np
.
set_printoptions
(
formatter
=
{
'
float
'
:
lambda
x
:
f
'
{x:5.2f}
'
}
)
In
[
96
]
:
for
sym
in
rets
.
columns
[
:
4
]
:
(
'
\n
'
+
sym
)
(
71
*
'
=
'
)
for
year
in
range
(
2010
,
2019
)
:
rets_
=
rets
.
loc
[
f
'
{year}0101
'
:
f
'
{year}1231
'
]
reg
=
np
.
linalg
.
lstsq
(
rets_
[
factors
]
,
rets_
[
sym
]
,
rcond
=

1
)
[
0
]
rets_
=
rets
.
loc
[
f
'
{year + 1}0101
'
:
f
'
{year + 1}1231
'
]
mu_apt
=
np
.
dot
(
rets_
[
factors
]
.
mean
(
)
*
252
,
reg
)
mu_real
=
rets_
[
sym
]
.
mean
(
)
*
252
res
=
res
.
append
(
pd
.
DataFrame
(
{
'
symbol
'
:
sym
,
'
mu_apt
'
:
mu_apt
,
'
mu_real
'
:
mu_real
}
,
index
=
[
year
+
1
]
)
)
(
'
{}  fl: {}  mu_apt: {:6.3f}  mu_real: {:6.3f}
'
.
format
(
year
+
1
,
reg
.
round
(
2
)
,
mu_apt
,
mu_real
)
)
The four factors
The multivariate regression
The APTpredicted return of the stock
The realized return of the stock
The preceding code provides the following output:
AAPL
.
O
=======================================================================
2011

fl
:
[
0.91

0.04

0.35
0.12
]

mu_apt
:
0.011

mu_real
:
0.228
2012

fl
:
[
0.76

0.02

0.24
0.05
]

mu_apt
:
0.099

mu_real
:
0.275
2013

fl
:
[
1.67
0.04

0.56
0.10
]

mu_apt
:
0.366

mu_real
:
0.053
2014

fl
:
[
0.53

0.00
0.02
0.16
]

mu_apt
:
0.050

mu_real
:
0.320
2015

fl
:
[
1.07
0.02
0.25
0.01
]

mu_apt
:

0.038

mu_real
:

0.047
2016

fl
:
[
1.21
0.01

0.14

0.02
]

mu_apt
:
0.110

mu_real
:
0.096
2017

fl
:
[
1.10
0.01

0.15

0.02
]

mu_apt
:
0.170

mu_real
:
0.381
2018

fl
:
[
1.06

0.03

0.15
0.12
]

mu_apt
:

0.088

mu_real
:

0.071
2019

fl
:
[
1.37
0.01

0.20
0.13
]

mu_apt
:
0.364

mu_real
:
0.621
MSFT
.
O
=======================================================================
2011

fl
:
[
0.98
0.01
0.02

0.11
]

mu_apt
:

0.008

mu_real
:

0.072
2012

fl
:
[
0.82
0.00

0.03

0.01
]

mu_apt
:
0.103

mu_real
:
0.029
2013

fl
:
[
1.14
0.00

0.07

0.01
]

mu_apt
:
0.294

mu_real
:
0.337
2014

fl
:
[
1.28
0.05
0.04
0.07
]

mu_apt
:
0.149

mu_real
:
0.216
2015

fl
:
[
1.20
0.03
0.05
0.01
]

mu_apt
:

0.016

mu_real
:
0.178
2016

fl
:
[
1.44
0.03

0.17

0.02
]

mu_apt
:
0.127

mu_real
:
0.113
2017

fl
:
[
1.33
0.01

0.14
0.00
]

mu_apt
:
0.216

mu_real
:
0.321
2018

fl
:
[
1.10

0.02

0.14
0.22
]

mu_apt
:

0.087

mu_real
:
0.172
2019

fl
:
[
1.51
0.01

0.16

0.02
]

mu_apt
:
0.378

mu_real
:
0.440
INTC
.
O
=======================================================================
2011

fl
:
[
1.17
0.01
0.05

0.13
]

mu_apt
:

0.010

mu_real
:
0.142
2012

fl
:
[
1.03
0.04
0.01
0.03
]

mu_apt
:
0.122

mu_real
:

0.163
2013

fl
:
[
1.06

0.01

0.10
0.01
]

mu_apt
:
0.267

mu_real
:
0.230
2014

fl
:
[
0.96
0.02
0.36

0.02
]

mu_apt
:
0.063

mu_real
:
0.335
2015

fl
:
[
0.93

0.01

0.09
0.02
]

mu_apt
:
0.001

mu_real
:

0.052
2016

fl
:
[
1.02
0.00

0.05
0.06
]

mu_apt
:
0.099

mu_real
:
0.051
2017

fl
:
[
1.41
0.02

0.18
0.03
]

mu_apt
:
0.226

mu_real
:
0.242
2018

fl
:
[
1.12

0.01

0.11
0.17
]

mu_apt
:

0.076

mu_real
:
0.017
2019

fl
:
[
1.50
0.01

0.34
0.30
]

mu_apt
:
0.431

mu_real
:
0.243
AMZN
.
O
=======================================================================
2011

fl
:
[
1.02

0.03

0.18

0.14
]

mu_apt
:

0.016

mu_real
:

0.039
2012

fl
:
[
0.98

0.01

0.17

0.09
]

mu_apt
:
0.117

mu_real
:
0.374
2013

fl
:
[
1.07

0.00
0.09
0.00
]

mu_apt
:
0.282

mu_real
:
0.464
2014

fl
:
[
1.54
0.03
0.01

0.08
]

mu_apt
:
0.176

mu_real
:

0.251
2015

fl
:
[
1.26

0.02
0.45

0.11
]

mu_apt
:

0.044

mu_real
:
0.778
2016

fl
:
[
1.06

0.00

0.15

0.04
]

mu_apt
:
0.099

mu_real
:
0.104
2017

fl
:
[
0.94

0.02
0.12

0.03
]

mu_apt
:
0.185

mu_real
:
0.446
2018

fl
:
[
0.90

0.04

0.25
0.28
]

mu_apt
:

0.085

mu_real
:
0.251
2019

fl
:
[
1.99
0.05

0.37
0.12
]

mu_apt
:
0.506

mu_real
:
0.207
Figure 411 compares the APTpredicted returns for a stock and its realized stock returns over time. Compared to the singlefactor CAPM, there seems to be hardly any improvement:
In
[
97
]:
sym
=
'AMZN.O'
In
[
98
]:
res
[
res
[
'symbol'
]
==
sym
]
.
corr
()
Out
[
98
]:
mu_apt
mu_real
mu_apt
1.000000

0.098281
mu_real

0.098281
1.000000
In
[
99
]:
res
[
res
[
'symbol'
]
==
sym
]
.
plot
(
kind
=
'bar'
,
figsize
=
(
10
,
6
),
title
=
sym
);
The same picture arises in Figure 412, produced by the following snippet, which compares the averages for multiple stocks. Because there is hardly any variation in the average APT predictions, there are large average differences to the realized returns:
In
[
100
]:
grouped
=
res
.
groupby
(
'symbol'
)
.
mean
()
grouped
Out
[
100
]:
mu_apt
mu_real
symbol
AAPL
.
O
0.116116
0.206158
AMZN
.
O
0.135528
0.259395
INTC
.
O
0.124811
0.116180
MSFT
.
O
0.128441
0.192655
In
[
101
]:
grouped
.
plot
(
kind
=
'bar'
,
figsize
=
(
10
,
6
),
title
=
'Average Values'
);
Of course, the selection of the risk factors is of paramount importance in this context. The datadriven investor decides to find out what risk factors are typically considered relevant ones for stocks. After studying the paper by Bender et al. (2013), the investor replaces the original risk factors with a new set. In particular, the investor chooses the set as presented in Table 43.
Factor  Description  RIC 

Market 
MSCI World Gross Return Daily USD (PUS = Price Return) 

Size 
MSCI World Equal Weight Price Net Index EOD 

Volatility 
MSCI World Minimum Volatility Net Return 

Value 
MSCI World Value Weighted Gross (NUS for Net) 

Risk 
MSCI World Risk Weighted Gross USD EOD 

Growth 
MSCI World Quality Net Return USD 

Momentum 
MSCI World Momentum Gross Index USD EOD 

The following Python code retrieves a respective data set from a remote location and visualizes the normalized time series data (see Figure 413). Already a brief look reveals that the time series seem to be highly positively correlated:
In
[
102
]
:
factors
=
pd
.
read_csv
(
'
http://hilpisch.com/aiif_eikon_eod_factors.csv
'
,
index_col
=
0
,
parse_dates
=
True
)
In
[
103
]
:
(
factors
/
factors
.
iloc
[
0
]
)
.
plot
(
figsize
=
(
10
,
6
)
)
;
This impression is confirmed by the following calculation and the resulting correlation matrix for the factor returns. All correlation factors are about 0.75 or higher:
In
[
104
]
:
start
=
'
20170101
'
end
=
'
20200101
'
In
[
105
]
:
retsd
=
rets
.
loc
[
start
:
end
]
.
copy
(
)
retsd
.
dropna
(
inplace
=
True
)
In
[
106
]
:
retsf
=
np
.
log
(
factors
/
factors
.
shift
(
1
)
)
retsf
=
retsf
.
loc
[
start
:
end
]
retsf
.
dropna
(
inplace
=
True
)
retsf
=
retsf
.
loc
[
retsd
.
index
]
.
dropna
(
)
In
[
107
]
:
retsf
.
corr
(
)
Out
[
107
]
:
market
size
volatility
value
risk
growth
\
market
1.000000
0.935867
0.845010
0.964124
0.947150
0.959038
size
0.935867
1.000000
0.791767
0.965739
0.983238
0.835477
volatility
0.845010
0.791767
1.000000
0.778294
0.865467
0.818280
value
0.964124
0.965739
0.778294
1.000000
0.958359
0.864222
risk
0.947150
0.983238
0.865467
0.958359
1.000000
0.858546
growth
0.959038
0.835477
0.818280
0.864222
0.858546
1.000000
momentum
0.928705
0.796420
0.819585
0.818796
0.825563
0.952956
momentum
market
0.928705
size
0.796420
volatility
0.819585
value
0.818796
risk
0.825563
growth
0.952956
momentum
1.000000
Defines start and end dates for data selection
Selects the relevant returns data subset
Calculates and processes the log returns for the factors
Shows the correlation matrix for the factors
The following Python code derives factor loadings for the original stocks but with the new factors. They are derived from the first half of the data set and applied to predict the stock return for the second half given the performance of the single factors. The realized return is also calculated. Both time series are compared in Figure 414. As to be expected given the high correlation of the factors, the explanatory power of the APT approach is not much higher compared to the CAPM:
In
[
108
]:
res
=
pd
.
DataFrame
()
In
[
109
]:
np
.
set_printoptions
(
formatter
=
{
'float'
:
lambda
x
:
f
'{x:5.2f}'
})
In
[
110
]:
split
=
int
(
len
(
retsf
)
*
0.5
)
for
sym
in
rets
.
columns
[:
4
]:
(
'
\n
'
+
sym
)
(
74
*
'='
)
retsf_
,
retsd_
=
retsf
.
iloc
[:
split
],
retsd
.
iloc
[:
split
]
reg
=
np
.
linalg
.
lstsq
(
retsf_
,
retsd_
[
sym
],
rcond
=
1
)[
0
]
retsf_
,
retsd_
=
retsf
.
iloc
[
split
:],
retsd
.
iloc
[
split
:]
mu_apt
=
np
.
dot
(
retsf_
.
mean
()
*
252
,
reg
)
mu_real
=
retsd_
[
sym
]
.
mean
()
*
252
res
=
res
.
append
(
pd
.
DataFrame
({
'mu_apt'
:
mu_apt
,
'mu_real'
:
mu_real
},
index
=
[
sym
,]),
sort
=
True
)
(
'fl: {}  apt: {:.3f}  real: {:.3f}'
.
format
(
reg
.
round
(
1
),
mu_apt
,
mu_real
))
AAPL
.
O
==========================================================================
fl
:
[
2.30
2.80

0.70

1.40

4.20
2.00

0.20
]

apt
:
0.115

real
:
0.301
MSFT
.
O
==========================================================================
fl
:
[
1.50
0.00
0.10

1.30

1.40
0.80
1.00
]

apt
:
0.181

real
:
0.304
INTC
.
O
==========================================================================
fl
:
[

3.10
1.60
0.40
1.30

2.60
2.50
1.10
]

apt
:
0.186

real
:
0.118
AMZN
.
O
==========================================================================
fl
:
[
9.10
3.30

1.00

7.10

3.10

1.80
1.20
]

apt
:
0.019

real
:
0.050
In
[
111
]:
res
.
plot
(
kind
=
'bar'
,
figsize
=
(
10
,
6
));
The datadriven investor is not willing to dismiss the APT completely. Therefore, an additional test might shed some more light on the explanatory power of APT. To this end, the factor loadings are used to test whether APT can explain movements of the stock price over time (correctly). And indeed, although APT does not predict the absolute performance correctly (it is off by 10+ percentage points), it predicts the direction of the stock price movement correctly in the majority of cases (see Figure 415). The correlation between the predicted and realized returns is also pretty high at around 85%. However, the analysis uses realized factor returns to generate the APT predictions—something, of course, not available in practice a day before the relevant trading day:
In
[
112
]
:
sym
Out
[
112
]
:
'
AMZN.O
'
In
[
113
]
:
rets_sym
=
np
.
dot
(
retsf_
,
reg
)
In
[
114
]
:
rets_sym
=
pd
.
DataFrame
(
rets_sym
,
columns
=
[
sym
+
'
_apt
'
]
,
index
=
retsf_
.
index
)
In
[
115
]
:
rets_sym
[
sym
+
'
_real
'
]
=
retsd_
[
sym
]
In
[
116
]
:
rets_sym
.
mean
(
)
*
252
Out
[
116
]
:
AMZN
.
O_apt
0.019401
AMZN
.
O_real
0.050344
dtype
:
float64
In
[
117
]
:
rets_sym
.
std
(
)
*
252
*
*
0.5
Out
[
117
]
:
AMZN
.
O_apt
0.270995
AMZN
.
O_real
0.307653
dtype
:
float64
In
[
118
]
:
rets_sym
.
corr
(
)
Out
[
118
]
:
AMZN
.
O_apt
AMZN
.
O_real
AMZN
.
O_apt
1.000000
0.832218
AMZN
.
O_real
0.832218
1.000000
In
[
119
]
:
rets_sym
.
cumsum
(
)
.
apply
(
np
.
exp
)
.
plot
(
figsize
=
(
10
,
6
)
)
;
Predicts the daily stock price returns given the realized factor returns
Stores the results in a
DataFrame
object and adds column and index dataAdds the realized stock price returns to the
DataFrame
objectCalculates the annualized returns
Calculates the annualized volatility
Calculates the correlation factor
How accurately does APT predict the direction of the stock price movement given the realized factor returns? The following Python code shows that the accuracy score is a bit better than 75%:
In
[
120
]:
rets_sym
[
'same'
]
=
(
np
.
sign
(
rets_sym
[
sym
+
'_apt'
])
==
np
.
sign
(
rets_sym
[
sym
+
'_real'
]))
In
[
121
]:
rets_sym
[
'same'
]
.
value_counts
()
Out
[
121
]:
True
288
False
89
Name
:
same
,
dtype
:
int64
In
[
122
]:
rets_sym
[
'same'
]
.
value_counts
()[
True
]
/
len
(
rets_sym
)
Out
[
122
]:
0.7639257294429708
Debunking Central Assumptions
The previous section provides a number of numerical, realworld examples showing how popular normative financial theories might fail in practice. This section argues that one of the major reasons is that central assumptions of these popular financial theories are invalid; that is, they simply do not describe the reality of financial markets. The two assumptions analyzed are normally distributed returns and linear relationships.
Normally Distributed Returns
As a matter of fact, only a normal distribution is completely specified through its first (expectation) and second moment (standard deviation).
Sample data sets
For illustration, consider a randomly generated set of standard normally distributed numbers as generated by the following Python code.^{4} Figure 416 shows the typical bell shape of the resulting histogram:
In
[
1
]
:
import
numpy
as
np
import
pandas
as
pd
from
pylab
import
plt
,
mpl
np
.
random
.
seed
(
100
)
plt
.
style
.
use
(
'
seaborn
'
)
mpl
.
rcParams
[
'
savefig.dpi
'
]
=
300
mpl
.
rcParams
[
'
font.family
'
]
=
'
serif
'
In
[
2
]
:
N
=
10000
In
[
3
]
:
snrn
=
np
.
random
.
standard_normal
(
N
)
snrn

=
snrn
.
mean
(
)
snrn
/
=
snrn
.
std
(
)
In
[
4
]
:
round
(
snrn
.
mean
(
)
,
4
)
Out
[
4
]
:

0.0
In
[
5
]
:
round
(
snrn
.
std
(
)
,
4
)
Out
[
5
]
:
1.0
In
[
6
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
hist
(
snrn
,
bins
=
35
)
;
Draws standard normally distributed random numbers
Corrects the first moment (expectation) to 0.0
Corrects the second moment (standard deviation) to 1.0
Now consider a set of random numbers that share the same first and second moment values but have a completely different distribution than Figure 417 illustrates. Although the moments are the same, this distribution only consists of three discrete values:
In
[
7
]
:
numbers
=
np
.
ones
(
N
)
*
1.5
split
=
int
(
0.25
*
N
)
numbers
[
split
:
3
*
split
]
=

1
numbers
[
3
*
split
:
4
*
split
]
=
0
In
[
8
]
:
numbers

=
numbers
.
mean
(
)
numbers
/
=
numbers
.
std
(
)
In
[
9
]
:
round
(
numbers
.
mean
(
)
,
4
)
Out
[
9
]
:
0.0
In
[
10
]
:
round
(
numbers
.
std
(
)
,
4
)
Out
[
10
]
:
1.0
In
[
11
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
hist
(
numbers
,
bins
=
35
)
;
A set of numbers with three discrete values only
Corrects the first moment (expectation) to 0.0
Corrects the second moment (standard deviation) to 1.0
First and Second Moment
The first and second moment of a probability distribution only describe a normal distribution completely. There are infinitely many other distributions that might share the first two moments with a normal distribution while being completely different.
In preparation for a test of real financial returns, consider the following Python functions that allow one to visualize data as a histogram and to add a probability density function (PDF) of a normal distribution with the first two moments of the data:
In
[
12
]
:
import
math
import
scipy.stats
as
scs
import
statsmodels.api
as
sm
In
[
13
]
:
def
dN
(
x
,
mu
,
sigma
)
:
''' Probability density function of a normal random variable x. '''
z
=
(
x

mu
)
/
sigma
=
np
.
exp
(

0.5
*
z
*
*
2
)
/
math
.
sqrt
(
2
*
math
.
pi
*
sigma
*
*
2
)
return
In
[
14
]
:
def
return_histogram
(
rets
,
title
=
'
'
)
:
''' Plots a histogram of the returns. '''
plt
.
figure
(
figsize
=
(
10
,
6
)
)
x
=
np
.
linspace
(
min
(
rets
)
,
max
(
rets
)
,
100
)
plt
.
hist
(
np
.
array
(
rets
)
,
bins
=
50
,
density
=
True
,
label
=
'
frequency
'
)
y
=
dN
(
x
,
np
.
mean
(
rets
)
,
np
.
std
(
rets
)
)
plt
.
plot
(
x
,
y
,
linewidth
=
2
,
label
=
'
'
)
plt
.
xlabel
(
'
log returns
'
)
plt
.
ylabel
(
'
frequency/probability
'
)
plt
.
title
(
title
)
plt
.
legend
(
)
Figure 418 shows how well the histogram approximates the PDF for the standard normally distributed random numbers:
In
[
15
]:
return_histogram
(
snrn
)
By contrast, Figure 419 illustrates that the PDF of the normal distribution has nothing to do with the data shown as a histogram:
In
[
16
]:
return_histogram
(
numbers
)
Another way of comparing a normal distribution to data is the QuantileQuantile (QQ) plot. As Figure 420 shows, for normally distributed numbers, the numbers themselves lie (mostly) on a straight line in the QQ plane:
In
[
17
]:
def
return_qqplot
(
rets
,
title
=
''
):
''' Generates a QQ plot of the returns.
'''
fig
=
sm
.
qqplot
(
rets
,
line
=
's'
,
alpha
=
0.5
)
fig
.
set_size_inches
(
10
,
6
)
plt
.
title
(
title
)
plt
.
xlabel
(
'theoretical quantiles'
)
plt
.
ylabel
(
'sample quantiles'
)
In
[
18
]:
return_qqplot
(
snrn
)
Again, the QQ plot as shown in Figure 421 for the discrete numbers looks completely different to the one in Figure 420:
In
[
19
]:
return_qqplot
(
numbers
)
Finally, one can also use statistical tests to check whether a set of numbers is normally distributed or not.
The following Python function implements three tests:

Test for normal skew.

Test for normal kurtosis.

Test for normal skew and kurtosis combined.
A pvalue below 0.05 is generally considered to be a counterindicator for normality; that is, the hypothesis that the numbers are normally distributed is rejected. In that sense, as in the preceding figures, the pvalues for the two data sets speak for themselves:
In
[
20
]:
def
print_statistics
(
rets
):
(
'RETURN SAMPLE STATISTICS'
)
(
''
)
(
'Skew of Sample Log Returns {:9.6f}'
.
format
(
scs
.
skew
(
rets
)))
(
'Skew Normal Test pvalue {:9.6f}'
.
format
(
scs
.
skewtest
(
rets
)[
1
]))
(
''
)
(
'Kurt of Sample Log Returns {:9.6f}'
.
format
(
scs
.
kurtosis
(
rets
)))
(
'Kurt Normal Test pvalue {:9.6f}'
.
format
(
scs
.
kurtosistest
(
rets
)[
1
]))
(
''
)
(
'Normal Test pvalue {:9.6f}'
.
format
(
scs
.
normaltest
(
rets
)[
1
]))
(
''
)
In
[
21
]:
print_statistics
(
snrn
)
RETURN
SAMPLE
STATISTICS

Skew
of
Sample
Log
Returns
0.016793
Skew
Normal
Test
p

value
0.492685

Kurt
of
Sample
Log
Returns

0.024540
Kurt
Normal
Test
p

value
0.637637

Normal
Test
p

value
0.707334

In
[
22
]:
print_statistics
(
numbers
)
RETURN
SAMPLE
STATISTICS

Skew
of
Sample
Log
Returns
0.689254
Skew
Normal
Test
p

value
0.000000

Kurt
of
Sample
Log
Returns

1.141902
Kurt
Normal
Test
p

value
0.000000

Normal
Test
p

value
0.000000

Real financial returns
The following Python code retrieves EOD data from a remote source, as done earlier in the chapter, and calculates the log returns for all financial time series contained in the data set. Figure 422 shows that the log returns of the S&P 500 stock index represented as a histogram show a much higher peak and fatter tails when compared to the normal PDF with the sample expectation and standard deviation. These two insights are stylized facts because they can be consistently observed for different financial instruments:
In
[
23
]:
raw
=
pd
.
read_csv
(
'http://hilpisch.com/aiif_eikon_eod_data.csv'
,
index_col
=
0
,
parse_dates
=
True
)
.
dropna
()
In
[
24
]:
rets
=
np
.
log
(
raw
/
raw
.
shift
(
1
))
.
dropna
()
In
[
25
]:
symbol
=
'.SPX'
In
[
26
]:
return_histogram
(
rets
[
symbol
]
.
values
,
symbol
)
Similar insights can be gained when considering the QQ plot for the S&P 500 log returns in Figure 423. In particular, the QQ plot visualizes the fat tails pretty well (points below the straight line to the left and above the straight line to the right):
In
[
27
]:
return_qqplot
(
rets
[
symbol
]
.
values
,
symbol
)
The Python code that follows conducts the statistical tests regarding the normality of the real financial returns for a selection of the financial time series from the data set. Real financial returns regularly fail such tests. Therefore, it is safe to conclude that the normality assumption about financial returns hardly, if at all, describes financial reality:
In
[
28
]:
symbols
=
[
'.SPX'
,
'AMZN.O'
,
'EUR='
,
'GLD'
]
In
[
29
]:
for
sym
in
symbols
:
(
'
\n
{}'
.
format
(
sym
))
(
45
*
'='
)
print_statistics
(
rets
[
sym
]
.
values
)
.
SPX
=============================================
RETURN
SAMPLE
STATISTICS

Skew
of
Sample
Log
Returns

0.497160
Skew
Normal
Test
p

value
0.000000

Kurt
of
Sample
Log
Returns
4.598167
Kurt
Normal
Test
p

value
0.000000

Normal
Test
p

value
0.000000

AMZN
.
O
=============================================
RETURN
SAMPLE
STATISTICS

Skew
of
Sample
Log
Returns
0.135268
Skew
Normal
Test
p

value
0.005689

Kurt
of
Sample
Log
Returns
7.344837
Kurt
Normal
Test
p

value
0.000000

Normal
Test
p

value
0.000000

EUR
=
=============================================
RETURN
SAMPLE
STATISTICS

Skew
of
Sample
Log
Returns

0.053959
Skew
Normal
Test
p

value
0.268203

Kurt
of
Sample
Log
Returns
1.780899
Kurt
Normal
Test
p

value
0.000000

Normal
Test
p

value
0.000000

GLD
=============================================
RETURN
SAMPLE
STATISTICS

Skew
of
Sample
Log
Returns

0.581025
Skew
Normal
Test
p

value
0.000000

Kurt
of
Sample
Log
Returns
5.899701
Kurt
Normal
Test
p

value
0.000000

Normal
Test
p

value
0.000000

Normality Assumption
Although the normality assumption is a good approximation for many realworld phenomena, such as in physics, it is not appropriate and can even be dangerous when it comes to financial returns. Almost no financial return sample data set passes statistical normality tests. Beyond the fact that it has proven useful in other domains, a major reason why this assumption is found in so many financial models is that it leads to elegant and relatively simple mathematical models, calculations, and proofs.
Linear Relationships
Similar to the “omnipresence” of the normality assumption in financial models and theories, linear relationships between variables seem to be another widespread benchmark. This subsection considers an important one, namely the assumed linear relationship in the CAPM between the beta of a stock and its expected (realized) return. Generally speaking, the higher the beta is, the higher the expected return given a positive market performance will be—in a fixed proportional way as given by the beta value itself.
Recall the calculation of the betas, the CAPM expected returns, and the realized returns for a selection of technology stocks from the previous section, which is repeated in the following Python code for convenience. This time, the beta values are added to the results’ DataFrame
object as well.
In
[
30
]:
r
=
0.005
In
[
31
]:
market
=
'.SPX'
In
[
32
]:
res
=
pd
.
DataFrame
()
In
[
33
]:
for
sym
in
rets
.
columns
[:
4
]:
for
year
in
range
(
2010
,
2019
):
rets_
=
rets
.
loc
[
f
'{year}0101'
:
f
'{year}1231'
]
muM
=
rets_
[
market
]
.
mean
()
*
252
cov
=
rets_
.
cov
()
.
loc
[
sym
,
market
]
var
=
rets_
[
market
]
.
var
()
beta
=
cov
/
var
rets_
=
rets
.
loc
[
f
'{year + 1}0101'
:
f
'{year + 1}1231'
]
muM
=
rets_
[
market
]
.
mean
()
*
252
mu_capm
=
r
+
beta
*
(
muM

r
)
mu_real
=
rets_
[
sym
]
.
mean
()
*
252
res
=
res
.
append
(
pd
.
DataFrame
({
'symbol'
:
sym
,
'beta'
:
beta
,
'mu_capm'
:
mu_capm
,
'mu_real'
:
mu_real
},
index
=
[
year
+
1
]),
sort
=
True
)
The following analysis calculates the ${R}^{2}$ score for a linear regression for which the beta is the independent variable and the expected CAPM return, given the market portfolio performance, is the dependent variable. ${R}^{2}$ refers to the coefficient of determination and measures how well a model performs compared to a baseline predictor in the form of a simple mean value. The linear regression can only explain around 10% of the variability in the expected CAPM return, a pretty low value, which is also confirmed through Figure 424:
In
[
34
]:
from
sklearn.metrics
import
r2_score
In
[
35
]:
reg
=
np
.
polyfit
(
res
[
'beta'
],
res
[
'mu_capm'
],
deg
=
1
)
res
[
'mu_capm_ols'
]
=
np
.
polyval
(
reg
,
res
[
'beta'
])
In
[
36
]:
r2_score
(
res
[
'mu_capm'
],
res
[
'mu_capm_ols'
])
Out
[
36
]:
0.09272355783573516
In
[
37
]:
res
.
plot
(
kind
=
'scatter'
,
x
=
'beta'
,
y
=
'mu_capm'
,
figsize
=
(
10
,
6
))
x
=
np
.
linspace
(
res
[
'beta'
]
.
min
(),
res
[
'beta'
]
.
max
())
plt
.
plot
(
x
,
np
.
polyval
(
reg
,
x
),
'g'
,
label
=
'regression'
)
plt
.
legend
();
For the realized return, the explanatory power of the linear regression is even lower, with about 4.5% (see Figure 425). The linear regressions recover the positive relationship between beta and stock returns—“the higher the beta, the higher the return given the (positive) market portfolio performance”—as indicated by the positive slope of the regression lines. However, they only explain a small part of the observed overall variability in the stock returns:
In
[
38
]:
reg
=
np
.
polyfit
(
res
[
'beta'
],
res
[
'mu_real'
],
deg
=
1
)
res
[
'mu_real_ols'
]
=
np
.
polyval
(
reg
,
res
[
'beta'
])
In
[
39
]:
r2_score
(
res
[
'mu_real'
],
res
[
'mu_real_ols'
])
Out
[
39
]:
0.04466919444752959
In
[
40
]:
res
.
plot
(
kind
=
'scatter'
,
x
=
'beta'
,
y
=
'mu_real'
,
figsize
=
(
10
,
6
))
x
=
np
.
linspace
(
res
[
'beta'
]
.
min
(),
res
[
'beta'
]
.
max
())
plt
.
plot
(
x
,
np
.
polyval
(
reg
,
x
),
'g'
,
label
=
'regression'
)
plt
.
legend
();
Linear Relationships
As with the normality assumptions, linear relationships can often be observed in the physical world. However, in finance there are hardly any cases in which variables depend on each other in a clearly linear way. From a modeling point of view, linear relationships lead, as does the normality assumption, to elegant and relatively simple mathematical models, calculations, and proofs. In addition, the standard tool in financial econometrics, OLS regression, is well suited to dealing with linear relationships in data. These are major reasons why normality and linearity are often deliberately chosen as convenient building blocks of financial models and theories.
Conclusions
Science has been driven for centuries by the rigorous generation and analysis of data. However, finance used to be characterized by normative theories based on simplified mathematical models of the financial markets, relying on assumptions such as normality of returns and linear relationships. The almost universal and comprehensive availability of (financial) data has led to a shift in focus from a theoryfirst approach to datadriven finance. Several examples based on real financial data illustrate that many popular financial models and theories cannot survive a confrontation with financial market realities. Although elegant, they might be too simplistic to capture the complexities, changing nature, and nonlinearities of financial markets.
References
Books and papers cited in this chapter:
Python Code
The following Python file contains a number of helper functions to simplify certain tasks in NLP:
#
# NLP Helper Functions
#
# Artificial Intelligence in Finance
# (c) Dr Yves J Hilpisch
# The Python Quants GmbH
#
import
re
import
nltk
import
string
import
pandas
as
pd
from
pylab
import
plt
from
wordcloud
import
WordCloud
from
nltk.corpus
import
stopwords
from
nltk.corpus
import
wordnet
as
wn
from
lxml.html.clean
import
Cleaner
from
sklearn.feature_extraction.text
import
TfidfVectorizer
plt
.
style
.
use
(
'seaborn'
)
cleaner
=
Cleaner
(
style
=
True
,
links
=
True
,
allow_tags
=
[
''
],
remove_unknown_tags
=
False
)
stop_words
=
stopwords
.
words
(
'english'
)
stop_words
.
extend
([
'new'
,
'old'
,
'pro'
,
'open'
,
'menu'
,
'close'
])
def
remove_non_ascii
(
s
):
''' Removes all nonascii characters.
'''
return
''
.
join
(
i
for
i
in
s
if
ord
(
i
)
<
128
)
def
clean_up_html
(
t
):
t
=
cleaner
.
clean_html
(
t
)
t
=
re
.
sub
(
'[
\n\t\r
]'
,
' '
,
t
)
t
=
re
.
sub
(
' +'
,
' '
,
t
)
t
=
re
.
sub
(
'<.*?>'
,
''
,
t
)
t
=
remove_non_ascii
(
t
)
return
t
def
clean_up_text
(
t
,
numbers
=
False
,
punctuation
=
False
):
''' Cleans up a text, e.g. HTML document,
from HTML tags and also cleans up the
text body.
'''
try
:
t
=
clean_up_html
(
t
)
except
:
pass
t
=
t
.
lower
()
t
=
re
.
sub
(
r
"what's"
,
"what is "
,
t
)
t
=
t
.
replace
(
'(ap)'
,
''
)
t
=
re
.
sub
(
r
"\'ve"
,
" have "
,
t
)
t
=
re
.
sub
(
r
"can't"
,
"cannot "
,
t
)
t
=
re
.
sub
(
r
"n't"
,
" not "
,
t
)
t
=
re
.
sub
(
r
"i'm"
,
"i am "
,
t
)
t
=
re
.
sub
(
r
"\'s"
,
""
,
t
)
t
=
re
.
sub
(
r
"\'re"
,
" are "
,
t
)
t
=
re
.
sub
(
r
"\'d"
,
" would "
,
t
)
t
=
re
.
sub
(
r
"\'ll"
,
" will "
,
t
)
t
=
re
.
sub
(
r
'\s+'
,
' '
,
t
)
t
=
re
.
sub
(
r
"
\\
"
,
""
,
t
)
t
=
re
.
sub
(
r
"\'"
,
""
,
t
)
t
=
re
.
sub
(
r
"
\"
"
,
""
,
t
)
if
numbers
:
t
=
re
.
sub
(
'[^azAZ ?!]+'
,
''
,
t
)
if
punctuation
:
t
=
re
.
sub
(
r
'\W+'
,
' '
,
t
)
t
=
remove_non_ascii
(
t
)
t
=
t
.
strip
()
return
t
def
nltk_lemma
(
word
):
''' If one exists, returns the lemma of a word.
I.e. the base or dictionary version of it.
'''
lemma
=
wn
.
morphy
(
word
)
if
lemma
is
None
:
return
word
else
:
return
lemma
def
tokenize
(
text
,
min_char
=
3
,
lemma
=
True
,
stop
=
True
,
numbers
=
False
):
''' Tokenizes a text and implements some
transformations.
'''
tokens
=
nltk
.
word_tokenize
(
text
)
tokens
=
[
t
for
t
in
tokens
if
len
(
t
)
>=
min_char
]
if
numbers
:
tokens
=
[
t
for
t
in
tokens
if
t
[
0
]
.
lower
()
in
string
.
ascii_lowercase
]
if
stop
:
tokens
=
[
t
for
t
in
tokens
if
t
not
in
stop_words
]
if
lemma
:
tokens
=
[
nltk_lemma
(
t
)
for
t
in
tokens
]
return
tokens
def
generate_word_cloud
(
text
,
no
,
name
=
None
,
show
=
True
):
''' Generates a word cloud bitmap given a
text document (string).
It uses the Term Frequency (TF) and
Inverse Document Frequency (IDF)
vectorization approach to derive the
importance of a word  represented
by the size of the word in the word cloud.
Parameters
==========
text: str
text as the basis
no: int
number of words to be included
name: str
path to save the image
show: bool
whether to show the generated image or not
'''
tokens
=
tokenize
(
text
)
vec
=
TfidfVectorizer
(
min_df
=
2
,
analyzer
=
'word'
,
ngram_range
=
(
1
,
2
),
stop_words
=
'english'
)
vec
.
fit_transform
(
tokens
)
wc
=
pd
.
DataFrame
({
'words'
:
vec
.
get_feature_names
(),
'tfidf'
:
vec
.
idf_
})
words
=
' '
.
join
(
wc
.
sort_values
(
'tfidf'
,
ascending
=
True
)[
'words'
]
.
head
(
no
))
wordcloud
=
WordCloud
(
max_font_size
=
110
,
background_color
=
'white'
,
width
=
1024
,
height
=
768
,
margin
=
10
,
max_words
=
150
)
.
generate
(
words
)
if
show
:
plt
.
figure
(
figsize
=
(
10
,
10
))
plt
.
imshow
(
wordcloud
,
interpolation
=
'bilinear'
)
plt
.
axis
(
'off'
)
plt
.
show
()
if
name
is
not
None
:
wordcloud
.
to_file
(
name
)
def
generate_key_words
(
text
,
no
):
try
:
tokens
=
tokenize
(
text
)
vec
=
TfidfVectorizer
(
min_df
=
2
,
analyzer
=
'word'
,
ngram_range
=
(
1
,
2
),
stop_words
=
'english'
)
vec
.
fit_transform
(
tokens
)
wc
=
pd
.
DataFrame
({
'words'
:
vec
.
get_feature_names
(),
'tfidf'
:
vec
.
idf_
})
words
=
wc
.
sort_values
(
'tfidf'
,
ascending
=
False
)[
'words'
]
.
values
words
=
[
a
for
a
in
words
if
not
a
.
isnumeric
()][:
no
]
except
:
words
=
list
()
return
words
^{1} See, for example, Kopf (2015).
^{2} This data service is only available via a paid subscription.
^{3} RIC
stands for Reuters Instrument Code.
^{4} Numbers generated by the random number generator of NumPy
are pseudorandom numbers, although they are referenced throughout the book as random numbers.
Get Artificial Intelligence in Finance now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.