Chapter 4. Data-Driven Finance

If artificial intelligence is the new electricity, big data is the oil that powers the generators.

Kai-Fu Lee (2018)

Nowadays, analysts sift through non-traditional information such as satellite imagery and credit card data, or use artificial intelligence techniques such as machine learning and natural language processing to glean fresh insights from traditional sources such as economic data and earnings-call transcripts.

Robin Wigglesworth (2019)

This chapter discusses central aspects of data-driven finance. For the purposes of this book, data-driven finance is understood to be a financial context (theory, model, application, and so on) that is primarily driven by and based on insights gained from data.

“Scientific Method” discusses the scientific method, which is about generally accepted principles that should guide scientific effort. “Financial Econometrics and Regression” is about financial econometrics and related topics. “Data Availability” sheds light on which types of (financial) data are available today and in what quality and quantity via programmatic APIs. “Normative Theories Revisited” revisits the normative theories of Chapter 3 and analyzes them based on real financial time series data. Also based on real financial data, “Debunking Central Assumptions” debunks two of the most commonly found assumptions in financial models and theories: normality of returns and linear relationships.

Scientific Method

The scientific method refers to a set of generally accepted principles that should guide any scientific project. Wikipedia defines the scientific method as follows:

The scientific method is an empirical method of acquiring knowledge that has characterized the development of science since at least the 17th century. It involves careful observation, applying rigorous skepticism about what is observed, given that cognitive assumptions can distort how one interprets the observation. It involves formulating hypotheses, via induction, based on such observations; experimental and measurement-based testing of deductions drawn from the hypotheses; and refinement (or elimination) of the hypotheses based on the experimental findings. These are principles of the scientific method, as distinguished from a definitive series of steps applicable to all scientific enterprises.

Given this definition, normative finance, as discussed in Chapter 3, is in stark contrast to the scientific method. Normative financial theories mostly rely on assumptions and axioms in combination with deduction as the major analytical method to arrive at their central results.

  • Expected utility theory (EUT) assumes that agents have the same utility function no matter what state of the world unfolds and that they maximize expected utility under conditions of uncertainty.

  • Mean-variance portfolio (MVP) theory describes how investors should invest under conditions of uncertainty assuming that only the expected return and the expected volatility of a portfolio over one period count.

  • The capital asset pricing model (CAPM) assumes that only the nondiversifiable market risk explains the expected return and the expected volatility of a stock over one period.

  • Arbitrage pricing theory (APT) assumes that a number of identifiable risk factors explains the expected return and the expected volatility of a stock over time; admittedly, compared to the other theories, the formulation of APT is rather broad and allows for wide-ranging interpretations.

What characterizes the aforementioned normative financial theories is that they were originally derived under certain assumptions and axioms using “pen and paper” only, without any recourse to real-world data or observations. From a historical point of view, many of these theories were rigorously tested against real-world data only long after their publication dates. This can be explained primarily with better data availability and increased computational capabilities over time. After all, data and computation are the main ingredients for the application of statistical methods in practice. The discipline at the intersection of mathematics, statistics, and finance that applies such methods to financial market data is typically called financial econometrics, the topic of the next section.

Financial Econometrics and Regression

Adapting the definition provided by Investopedia for econometrics, one can define financial econometrics as follows:

[Financial] econometrics is the quantitative application of statistical and mathematical models using [financial] data to develop financial theories or test existing hypotheses in finance and to forecast future trends from historical data. It subjects real-world [financial] data to statistical trials and then compares and contrasts the results against the [financial] theory or theories being tested.

Alexander (2008b) provides a thorough and broad introduction to the field of financial econometrics. The second chapter of the book covers single- and multifactor models, such as the CAPM and APT. Alexander (2008b) is part of a series of four books called Market Risk Analysis. The first in the series, Alexander (2008a), covers theoretical background concepts, topics, and methods, such as MVP theory and the CAPM themselves. The book by Campbell (2018) is another comprehensive resource for financial theory and related econometric research.

One of the major tools in financial econometrics is regression, in both its univariate and multivariate forms. Regression is also a central tool in statistical learning in general. What is the difference between traditional mathematics and statistical learning? Although there is no general answer to this question (after all, statistics is a sub-field of mathematics), a simple example should emphasize a major difference relevant to the context of this book.

First is the standard mathematical way. Assume a mathematical function is given as follows:

f : + , x 2 + 1 2 x

Given multiple values of x i , i = 1 , 2 , ... , n , one can derive function values for f by applying the above definition:

y i = f ( x i ) , i = 1 , 2 , ... , n

The following Python code illustrates this based on a simple numerical example:

In [1]: import numpy as np

In [2]: def f(x):
            return 2 + 1 / 2 * x

In [3]: x = np.arange(-4, 5)
Out[3]: array([-4, -3, -2, -1,  0,  1,  2,  3,  4])

In [4]: y = f(x)
Out[4]: array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ])

Second is the approach taken in statistical learning. Whereas in the preceding example, the function comes first and then the data is derived, this sequence is reversed in statistical learning. Here, the data is generally given and a functional relationship is to be found. In this context, x is often called the independent variable and y the dependent variable. Consequently, consider the following data:

( x i , y i ) , i = 1 , 2 , ... , n

The problem is to find, for example, parameters α , β such that:

f ^ ( x i ) α + β x i = y ^ i y i , i = 1 , 2 , ... , n

Another way of writing this is by including residual values ϵ i , i = 1 , 2 , ... , n :

α + β x i + ϵ i = y i , i = 1 , 2 , ... , n

In the context of ordinary least-squares (OLS) regression, α , β are chosen to minimize the mean-squared error between the approximated values y ^ i and the real values y i . The minimization problem, then, is as follows:

min α,β 1 n i n (y ^ i -y i ) 2

In the case of simple OLS regression, as described previously, the optimal solutions are known in closed form and are as follows:

β = Cov(x,y) Var(x) α = y ¯ - β x ¯

Here, Cov ( ) stands for the covariance, Var ( ) for the variance, and x ¯ , y ¯ for the mean values of x , y .

Returning to the preceding numerical example, these insights can be used to derive optimal parameters α , β and, in this particular case, to recover the original definition of f ( x ) :

In [5]: x
Out[5]: array([-4, -3, -2, -1,  0,  1,  2,  3,  4])

In [6]: y
Out[6]: array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ])

In [7]: beta = np.cov(x, y, ddof=0)[0, 1] / x.var()  1
        beta  1
Out[7]: 0.49999999999999994

In [8]: alpha = y.mean() - beta * x.mean()  2
        alpha  2
Out[8]: 2.0

In [9]: y_ = alpha + beta * x  3

In [10]: np.allclose(y_, y)  4
Out[10]: True

β as derived from the covariance matrix and the variance


α as derived from β and the mean values


Estimated values y ^ i , i = 1 , 2 , ... , n , given α , β


Checks whether y ^ i , y i values are numerically equal

The preceding example and those in Chapter 1 illustrate that the application of OLS regression to a given data set is in general straightforward. There are more reasons why OLS regression has become one of the central tools in econometrics and financial econometrics. Among them are the following:

Centuries old

The least-squares approach, particularly in combination with regression, has been used for more than 200 years.1


The mathematics behind OLS regression is easy to understand and easy to implement in programming.


There is basically no limit regarding the data size to which OLS regression can be applied.


OLS regression can be applied to a wide range of problems and data sets.


OLS regression is fast to evaluate, even on larger data sets.


Efficient implementations in Python and many other programming languages are readily available.

However, as easy and straightforward as the application of OLS regression might be in general, the method rests on a number of assumptions—most of them related to the residuals—that are not always satisfied in practice.


The model is linear in its parameters, with regard to both the coefficients and the residuals.


Independent variables are not perfectly (to a high degree) correlated with each other (no multicollinearity).

Zero mean

The mean value of the residuals is (close to) zero.

No correlation

Residuals are not (strongly) correlated with the independent variables.


The standard deviation of the residuals is (almost) constant.

No autocorrelation

The residuals are not (strongly) correlated with each other.

In practice, it is in general quite simple to test for the validity of the assumptions given a specific data set.

Data Availability

Financial econometrics is driven by statistical methods, such as regression, and the availability of financial data. From the 1950s to the 1990s, and even into the early 2000s, theoretical and empirical financial research was mainly driven by relatively small data sets compared to today’s standards, and was mostly comprised of end-of-day (EOD) data. Data availability is something that has changed dramatically over the last decade or so, with more and more types of financial and other data available in ever increasing granularity, quantity, and velocity.

Programmatic APIs

With regard to data-driven finance, what is important is not only what data is available but also how it can be accessed and processed. For quite a while now, finance professionals have relied on data terminals from companies such as Refinitiv (see Eikon Terminal) or Bloomberg (see Bloomberg Terminal), to mention just two of the leading providers. Newspapers, magazines, financial reports, and the like have long been replaced by such terminals as the primary source for financial information. However, the sheer volume and variety of data provided by such terminals cannot be consumed systematically by a single user or even large groups of finance professionals. Therefore, the major breakthrough in data-driven finance is to be seen in the programmatic availability of data via application programming interfaces (APIs) that allow the usage of computer code to select, retrieve, and process arbitrary data sets.

The remainder of this section is devoted to the illustration of such APIs by which even academics and retail investors can retrieve a wealth of different data sets. Before such examples are provided, Table 4-1 offers an overview of categories of data that are in general relevant in a financial context, as well as typical examples. In the table, structured data refers to numerical data types that often come in tabular structures, while unstructured data refers to data in the form of standard text that often has no structure beyond headers or paragraphs, for example. Alternative data refers to data types that are typically not considered financial data.

Table 4-1. Relevant types of financial data
Time Structured data Unstructured data Alternative data


Prices, fundamentals

News, texts

Web, social media, satellites


Prices, volumes

News, filings

Web, social media, satellites, Internet of Things

Structured Historical Data

First, structured historical data types will be retrieved programmatically. To this end, the following Python code uses the Eikon Data API.2

To access data via the Eikon Data API, a local application, such as Refinitiv Workspace, must be running and the API access must be configured on the Python level:

In [11]: import eikon as ek
         import configparser

In [12]: c = configparser.ConfigParser()'../aiif.cfg')
         2020-08-04 10:30:18,059 P[14938] [MainThread 4521459136] Error on handshake
          port 9000 : ReadTimeout(ReadTimeout())

If these requirements are met, historical structured data can be retrieved via a single function call. For example, the following Python code retrieves EOD data for a set of symbols and a specified time interval:

In [14]: symbols = ['AAPL.O', 'MSFT.O', 'NFLX.O', 'AMZN.O']  1

In [15]: data = ek.get_timeseries(symbols,
                                  end_date='2020-07-01')  2

In [16]:  3
         <class 'pandas.core.frame.DataFrame'>
         DatetimeIndex: 254 entries, 2019-07-01 to 2020-07-01
         Data columns (total 4 columns):
          #   Column  Non-Null Count  Dtype
         ---  ------  --------------  -----
          0   AAPL.O  254 non-null    float64
          1   MSFT.O  254 non-null    float64
          2   NFLX.O  254 non-null    float64
          3   AMZN.O  254 non-null    float64
         dtypes: float64(4)
         memory usage: 9.9 KB

In [17]: data.tail()  4
Out[17]: CLOSE       AAPL.O  MSFT.O  NFLX.O   AMZN.O
         2020-06-25  364.84  200.34  465.91  2754.58
         2020-06-26  353.63  196.33  443.40  2692.87
         2020-06-29  361.78  198.44  447.24  2680.38
         2020-06-30  364.80  203.51  455.04  2758.82
         2020-07-01  364.11  204.70  485.64  2878.70

Defines a list of RICs (symbols) to retrieve data for3


Retrieves EOD Close prices for the list of RICs


Shows the meta information for the returned DataFrame object


Shows the final rows of the DataFrame object

Similarly, one-minute bars with OHLC fields can be retrieved with appropriate adjustments of the parameters:

In [18]: data = ek.get_timeseries('AMZN.O',
                                  interval='minute')  1

In [19]:
         <class 'pandas.core.frame.DataFrame'>
         DatetimeIndex: 911 entries, 2020-08-03 08:01:00 to 2020-08-04 00:00:00
         Data columns (total 6 columns):
          #   Column  Non-Null Count  Dtype
         ---  ------  --------------  -----
          0   HIGH    911 non-null    float64
          1   LOW     911 non-null    float64
          2   OPEN    911 non-null    float64
          3   CLOSE   911 non-null    float64
          4   COUNT   911 non-null    float64
          5   VOLUME  911 non-null    float64
         dtypes: float64(6)
         memory usage: 49.8 KB

In [20]: data.head()
Out[20]: AMZN.O                  HIGH      LOW     OPEN    CLOSE  COUNT  VOLUME
         2020-08-03 08:01:00  3190.00  3176.03  3176.03  3178.17   18.0   383.0
         2020-08-03 08:02:00  3183.02  3176.03  3180.00  3177.01   15.0   513.0
         2020-08-03 08:03:00  3179.91  3177.05  3179.91  3177.05    5.0    14.0
         2020-08-03 08:04:00  3184.00  3179.91  3179.91  3184.00    8.0   102.0
         2020-08-03 08:05:00  3184.91  3182.91  3183.30  3184.00   12.0   403.0

Retrieves one-minute bars with all available fields for a single RIC

One can retrieve more than structured financial time series data from the Eikon Data API. Fundamental data can also be retrieved for a number of RICs and a number of different data fields at the same time, as the following Python code illustrates:

In [21]: data_grid, err = ek.get_data(['AAPL.O', 'IBM', 'GOOG.O', 'AMZN.O'],
                                      ['TR.TotalReturnYTD', 'TR.WACCBeta',
                                       'YRHIGH', 'YRLOW',
                                       'TR.Ebitda', 'TR.GrossProfit'])  1

In [22]: data_grid
Out[22]:   Instrument  YTD Total Return      Beta   YRHIGH      YRLOW        EBITDA  \
         0     AAPL.O         49.141271  1.221249   425.66   192.5800  7.647700e+10
         1        IBM         -5.019570  1.208156   158.75    90.5600  1.898600e+10
         2     GOOG.O         10.278829  1.067084  1586.99  1013.5361  4.757900e+10
         3     AMZN.O         68.406897  1.338106  3344.29  1626.0318  3.025600e+10

            Gross Profit
         0   98392000000
         1   36488000000
         2   89961000000
         3  114986000000

Retrieves data for multiple RICs and multiple data fields

Programmatic Data Availability

Basically all structured financial data is available nowadays in programmatic fashion. Financial time series data, in this context, is the paramount example. However, other structured data types such as fundamental data are available in the same way, simplifying the work of quantitative analysts, traders, portfolio managers, and the like significantly.

Structured Streaming Data

Many applications in finance require real-time structured data, such as in algorithmic trading or market risk management. The following Python code makes use of the API of the Oanda Trading Platform and streams in real time a number of time stamps, bid quotes, and ask quotes for the Bitcoin price in USD:

In [23]: import tpqoa

In [24]: oa = tpqoa.tpqoa('../aiif.cfg')  1

In [25]: oa.stream_data('BTC_USD', stop=5)  2
         2020-08-04T08:30:38.621075583Z 11298.8 11334.8
         2020-08-04T08:30:50.485678488Z 11298.3 11334.3
         2020-08-04T08:30:50.801666847Z 11297.3 11333.3
         2020-08-04T08:30:51.326269990Z 11296.0 11332.0
         2020-08-04T08:30:54.423973431Z 11296.6 11332.6

Connects to the Oanda API


Streams a fixed number of ticks for a given symbol

Printing out the streamed data fields is, of course, only for illustration. Certain financial applications might require sophisticated processing of the retrieved data and the generation of signals or statistics, for instance. Particularly during weekdays and trading hours, the number of price ticks streamed for financial instruments increases steadily, demanding powerful data processing capabilities on the end of financial institutions that need to process such data in real time or at least in near-real time (“near time”).

The significance of this observation becomes clear when looking at Apple Inc. stock prices. One can calculate that there are roughly 252 · 40 = 10 , 080 EOD closing quotes for the Apple stock over a period of 40 years. (Apple Inc. went public on December 12, 1980.) The following code retrieves tick data for the Apple stock price for one hour only. The retrieved data set, which might not even be complete for the given time interval, has 50,000 data rows, or five times as many tick quotes as the EOD quotes accumulated over 40 years of trading:

In [26]: data = ek.get_timeseries('AAPL.O',
                                  start_date='2020-08-03 15:00:00',
                                  end_date='2020-08-03 16:00:00',
                                  interval='tick')  1

In [27]:
         <class 'pandas.core.frame.DataFrame'>
         DatetimeIndex: 50000 entries, 2020-08-03 15:26:24.889000 to 2020-08-03
         Data columns (total 2 columns):
          #   Column  Non-Null Count  Dtype
         ---  ------  --------------  -----
          0   VALUE   49953 non-null  float64
          1   VOLUME  50000 non-null  float64
         dtypes: float64(2)
         memory usage: 1.1 MB

In [28]: data.head()
Out[28]: AAPL.O                    VALUE  VOLUME
         2020-08-03 15:26:24.889  439.06   175.0
         2020-08-03 15:26:24.889  439.08     3.0
         2020-08-03 15:26:24.890  439.08   100.0
         2020-08-03 15:26:24.890  439.08     5.0
         2020-08-03 15:26:24.899  439.10    35.0

Retrieves tick data for the Apple stock price

EOD Versus Tick Data

Most of the financial theories still applied today have their origin in when EOD data was basically the only type of financial data available. Today, financial institutions, and even retail traders and investors, are confronted with never-ending streams of real-time data. The example of Apple stock illustrates that for a single stock during one trading hour, there might be four times as many ticks coming in as the amount of EOD data accumulated over a period of 40 years. This not only challenges actors in financial markets, but also puts into question whether existing financial theories can be applied to such an environment at all.

Unstructured Historical Data

Many important data sources in finance provide unstructured data only, such as financial news or company filings. Undoubtedly, machines are much better and faster than humans at crunching large amounts of structured, numerical data. However, recent advances in natural language processing (NLP) make machines better and faster at processing financial news too, for example. In 2020, data service providers ingest roughly 1.5 million news articles on a daily basis. It is clear that this vast amount of text-based data cannot be processed properly by human beings.

Fortunately, unstructured data is also to a large extent available these days via programmatic APIs. The following Python code retrieves a number of news articles from the Eikon Data API related to the company Tesla, Inc. and its production. One article is selected and shown in full:

In [29]: news = ek.get_news_headlines('R:TSLA.O PRODUCTION',
                                 )  1

In [30]: news
Out[30]:                                           versionCreated  \
         2020-07-29 11:02:31.276 2020-07-29 11:02:31.276000+00:00
         2020-07-28 00:59:48.000        2020-07-28 00:59:48+00:00
         2020-07-23 21:20:36.090 2020-07-23 21:20:36.090000+00:00
         2020-07-23 08:22:17.000        2020-07-23 08:22:17+00:00
         2020-07-23 07:08:48.000        2020-07-23 07:46:56+00:00
         2020-07-23 00:55:54.000        2020-07-23 00:55:54+00:00
         2020-07-22 21:35:42.640 2020-07-22 22:13:26.597000+00:00

                                                                          text  \
         2020-07-29 11:02:31.276  Tesla Launches Hiring Spree in China as It Pre...
         2020-07-28 00:59:48.000    Tesla hiring in Shanghai as production ramps up
         2020-07-23 21:20:36.090     Tesla speeds up Model 3 production in Shanghai
         2020-07-23 08:22:17.000  UPDATE 1-'Please mine more nickel,' Musk urges...
         2020-07-23 07:08:48.000  'Please mine more nickel,' Musk urges as Tesla...
         2020-07-23 00:55:54.000  USA-Tesla choisit le Texas pour la production ...
         2020-07-22 21:35:42.640  TESLA INC - THE REAL LIMITATION ON TESLA GROWT...

                                                                       storyId  \
         2020-07-29 11:02:31.276
         2020-07-28 00:59:48.000
         2020-07-23 21:20:36.090
         2020-07-23 08:22:17.000
         2020-07-23 07:08:48.000
         2020-07-23 00:55:54.000
         2020-07-22 21:35:42.640

         2020-07-29 11:02:31.276  NS:CAIXIN
         2020-07-28 00:59:48.000    NS:RTRS
         2020-07-23 21:20:36.090  NS:SOUTHC
         2020-07-23 08:22:17.000    NS:RTRS
         2020-07-23 07:08:48.000    NS:RTRS
         2020-07-23 00:55:54.000    NS:RTRS
         2020-07-22 21:35:42.640    NS:RTRS

In [31]: storyId = news['storyId'][1]  2

In [32]: from IPython.display import HTML

In [33]: HTML(ek.get_news_story(storyId)[:1148])  3
Out[33]: <IPython.core.display.HTML object>
Jan 06, 2020

Tesla, Inc.TSLA registered record production and deliveries of 104,891 and
112,000 vehicles, respectively, in the fourth quarter of 2019.

Notably, the company's Model S/X and Model 3 reported record production and
deliveries in the fourth quarter. The Model S/X division recorded production
and delivery volume of 17,933 and 19,450 vehicles, respectively. The Model 3
division registered production of 86,958 vehicles, while 92,550 vehicles were

In 2019, Tesla delivered 367,500 vehicles, reflecting an increase of 50%, year
over year, and nearly in line with the company's full-year guidance of 360,000

Retrieves metadata for a number of news articles that fall in the parameter range


Selects one storyId for which to retrieve the full text


Retrieves the full text for the selected article and shows it

Unstructured Streaming Data

In the same way that historical unstructured data is retrieved, programmatic APIs can be used to stream unstructured news data, for example, in real time or at least near time. One such API is available for DNA: the Data, News, Analytics platform from Dow Jones. Figure 4-1 shows the screenshot of a web application that streams “Commodity and Financial News” articles and processes these with NLP techniques in real time.

aiif 0401
Figure 4-1. News-streaming application based on DNA (Dow Jones)

The news-streaming application has the following main features:

Full text

The full text of each article is available by clicking on the article header.

Keyword summary

A keyword summary is created and printed on the screen.

Sentiment analysis

Sentiment scores are calculated and visualized as colored arrows. Details become visible through a click on the arrows.

Word cloud

A word cloud summary bitmap is created, shown as a thumbnail and visible after a click on the thumbnail (see Figure 4-2).

aiif 0402
Figure 4-2. Word cloud bitmap shown in news-streaming application

Alternative Data

Nowadays, financial institutions, and in particular hedge funds, systematically mine a number of alternative data sources to gain an edge in trading and investing. A recent article by Bloomberg lists, among others, the following alternative data sources:

  • Web-scraped data

  • Crowd-sourced data

  • Credit cards and point-of-sales (POS) systems

  • Social media sentiment

  • Search trends

  • Web traffic

  • Supply chain data

  • Energy production data

  • Consumer profiles

  • Satellite imagery/geospacial data

  • App installs

  • Ocean vessel tracking

  • Wearables, drones, Internet of Things (IoT) sensors

In the following, the usage of alternative data is illustrated by two examples. The first retrieves and processes Apple Inc. press releases in the form of HTML pages. The following Python code makes use of a set of helper functions as shown in “Python Code”. In the code, a list of URLs is defined, each representing an HTML page with a press release from Apple Inc. The raw HTML code is then retrieved for each press release. Then the raw code is cleaned up, and an excerpt for one press release is printed:

In [34]: import nlp  1
         import requests

In [35]: sources = [
             '',  # iPad Pro
             '',  # MacBook Air
             '',  # Mac Mini
         ]  2

In [36]: html = [requests.get(url).text for url in sources]  3

In [37]: data = [nlp.clean_up_text(t) for t in html]  4

In [38]: data[0][536:1001]  5
Out[38]: ' display, powerful a12x bionic chip and face id introducing the new ipad pro
          with all-screen design and next-generation performance. new york apple today
          introduced the new ipad pro with all-screen design and next-generation
          performance, marking the biggest change to ipad ever. the all-new design
          pushes 11-inch and 12.9-inch liquid retina displays to the edges of ipad pro
          and integrates face id to securely unlock ipad with just a glance.1 the a12x
          bionic chip w'

Imports the NLP helper functions


Defines the URLs for the three press releases


Retrieves the raw HTML codes for the three press releases


Cleans up the raw HTML codes (for example, HTML tags are removed)


Prints an excerpt from one press release

Of course, defining alternative data as broadly as is done in this section implies that there is a limitless amount of data that one can retrieve and process for financial purposes. At its core, this is the business of search engines such as the one from Google LLC. In a financial context, it would be of paramount importance to specify exactly what unstructured alternative data sources to tap into.

The second example is about the retrieval of data from the social network Twitter, Inc. To this end, Twitter provides API access to tweets on its platform, provided one has set up a Twitter account appropriately. The following Python code connects to the Twitter API and retrieves and prints the five most recent tweets from my home timeline and user timeline, respectively:

In [39]: from twitter import Twitter, OAuth

In [40]: t = Twitter(auth=OAuth(c['twitter']['access_token'],
                     retry=True)  1

In [41]: l = t.statuses.home_timeline(count=5)  2

In [42]: for e in l:
             print(e['text'])  2
         The Bank of England is effectively subsidizing polluting industries in its
          pandemic rescue program, a think tank sa
         Cool shared task: mining scientific contributions (by @SeeTedTalk @SoerenAuer
          and Jennifer D'Souza)
         Twelve people were hospitalized in Wyoming on Monday after a hot air balloon
          crash, officials said.

         Three hot air
         President Trump directed controversial Pentagon pick into new role with
          similar duties after nomination failed
         Company announcement: Revolut launches Open Banking for its 400,000 Italian...

In [43]: l = t.statuses.user_timeline(screen_name='dyjh', count=5)  3

In [44]: for e in l:
             print(e['text'])  3
         #Python for #AlgoTrading (focus on the process) &amp; #AI in #Finance (focus
          on prediction methods) will complement eac
         Currently putting finishing touches on #AI in #Finance (@OReillyMedia). Book
          going into production shortly.
         Chinatown Is Coming Back, One Noodle at a Time
         Alt data industry balloons as hedge funds strive for Covid edge via @FT |
         "We remain of the view that alternative d…
         @Wolf_Of_BTC Just follow me on Twitter (or LinkedIn). Then you will notice for
          sure when it is out.

Connects to the Twitter API


Retrieves and prints five (most recent) tweets from home timeline


Retrieves and prints five (most recent) tweets from user timeline

The Twitter API allows also for searches, based on which most recent tweets can be retrieved and processed:

In [45]: d ='#Python', count=7)  1

In [46]: for e in d['statuses']:
             print(e['text'])  1
         RT @KirkDBorne: #AI is Reshaping Programming — Tips on How to Stay on Top:

         1: #MachineLearning — Jupyte…
         RT @reuvenmlerner: Today, a #Python student's code didn't print:

         x = 5
         if x == 5:
             print: ('yes!')

         There was a typo, namely : after pr
         RT @GavLaaaaaaaa: Javascript Does Not Need a StringBuilder
 #programming #softwareengineering #bigdata
         RT @CodeFlawCo: It is necessary to publish regular updates on Twitter
          #programmer #coder #developer #technology RT @pak_aims: Learning to C…
         RT @GavLaaaaaaaa: Javascript Does Not Need a StringBuilder
 #programming #softwareengineering #bigdata

Searches for tweets with hashtag “Python” and prints the five most recent ones

One can also collect a larger number of tweets from a Twitter user and create a summary in the form of a word cloud (see Figure 4-3). The following Python code again makes use of the NLP helper functions as shown in “Python Code”:

In [47]: l = t.statuses.user_timeline(screen_name='elonmusk', count=50)  1

In [48]: tl = [e['text'] for e in l]  2

In [49]: tl[:5]  3
Out[49]: ['@flcnhvy @Lindw0rm @cleantechnica True',
          '@Lindw0rm @cleantechnica Highly likely down the road',
          '@cleantechnica True fact',
         '@NASASpaceflight Scrubbed for the day. A Raptor turbopump spin start valve
          didnt open, triggering an automatic abo',
          '@Erdayastronaut I’m in the Boca control room. Hop attempt in ~33 minutes.']

In [50]: wc = nlp.generate_word_cloud(' '.join(tl), 35,
                     )  4

Retrieves the 50 most recent tweets for the user elonmusk


Collects the texts in a list object


Shows excerpts for the final five tweets


Generates a word cloud summary and shows it

aiif 0403
Figure 4-3. Word cloud as summary for larger number of tweets

Once a financial practitioner defines the “relevant financial data” to go beyond structured financial time series data, the data sources seem limitless in terms of volume, variety, and velocity. The way the tweets are retrieved from the Twitter API is almost in near time since the most recent tweets are accessed in the examples. These and similar API-based data sources therefore provide a never-ending stream of alternative data for which, as previously pointed out, it is important to specify exactly what one is looking for. Otherwise, any financial data science effort might easily drown in too much data and/or too noisy data.

Normative Theories Revisited

Chapter 3 introduces normative financial theories such as the MVP theory or the CAPM. For quite a long time, students and academics learning and studying such theories were more or less constrained to the theory itself. With all the available financial data, as discussed and illustrated in the previous section, in combination with powerful open source software for data analysis—such as Python, NumPy, pandas, and so on—it has become pretty easy and straightforward to put financial theories to real-world tests. It does not require small teams and larger studies anymore to do so. A typical notebook, internet access, and a standard Python environment suffice. This is what this section is about. However, before diving into data-driven finance, the following sub-section discusses briefly some famous paradoxes in the context of EUT and how corporations model and predict the behavior of individuals in practice.

Expected Utility and Reality

In economics, risk describes a situation in which possible future states and probabilities for those states to unfold are known in advance to the decision maker. This is the standard assumption in finance and the context of EUT. On the other hand, ambiguity describes situations in economics in which probabilities, or even possible future states, are not known in advance to a decision maker. Uncertainty subsumes the two different decision-making situations.

There is a long tradition of analyzing the concrete decision-making behavior of individuals (“agents”) under uncertainty. Innumerable studies and experiments have been conducted to observe and analyze how agents behave when faced with uncertainty as compared to what theories such as EUT predict. For centuries, paradoxa have played an important role in decision-making theory and research.

One such paradox, the St. Petersburg paradox, gave rise to the invention of utility functions and EUT in the first place. Daniel Bernoulli presented the paradox—and a solution to it—in 1738. The paradox is based on the following coin tossing game G . An agent is faced with a game during which a (perfect) coin is tossed potentially infinitely many times. If after the first toss heads prevails, the agent receives a payoff of 1 (currency unit). As long as heads is observed, the coin is tossed again. Otherwise the game ends. If heads prevails a second time, the agent receives an additional payoff of 2. If it does a third time, the additional payoff is 4. For the fourth time it is 8, and so on. This is a situation of risk since all possible future states, as well as their associated probabilities, are known in advance.

The expected payoff of this game is infinite. This can be seen from the following infinite sum of which every element is strictly positive:

𝐄 ( G ) = 1 2 · 1 + 1 4 · 2 + 1 8 · 4 + 1 16 · 8 + ... = k=1 1 2 k 2 k-1 = k=1 1 2 =

However, faced with such a game, a decision maker in general would be willing to pay a finite sum only to play the game. A major reason for this is the fact that relatively large payoffs only happen with a relatively small probability. Consider the potential payoff W = 511 :

W = 1 + 2 + 4 + 8 + 16 + 32 + 64 + 128 + 256 = 511

The probability of winning such a payoff is pretty low. To be exact, it is only P ( x = W ) = 1 512 = 0.001953125. The probability for such a payoff or a smaller one, on the other hand, is pretty high:

P ( x W ) = k=1 9 1 2 k = 0 . 998046875

In other words, in 998 out of 1,000 games the payoff is 511 or smaller. Therefore, an agent would probably not wager much more than 511 to play this game. The way out of this paradox is the introduction of a utility function with positive but decreasing marginal utility. In the context of the St. Petersburg paradox, this means that there is a function u : + that assigns to every positive payoff x a real value u ( x ) . Positive but decreasing marginal utility then formally translates into the following:

u x > 0 2 u x 2 < 0

As seen in Chapter 3, one such candidate function is u ( x ) = ln ( x ) with:

u x = 1 x 2 u x 2 = - 1 x 2

The expected utility then is finite, as the calculation of the following infinite sum illustrates:

𝐄 u ( G ) = k=1 1 2 k u 2 k-1 = k=1 ln2 k-1 2 k = k=1 (k-1) 2 k · ln ( 2 ) = ln ( 2 ) <

The expected utility of ln ( 2 ) = 0.693147 is obviously a pretty small number in comparison to the expected payoff of infinity. Bernoulli utility functions and EUT resolve the St. Petersburg paradox.

Other paradoxa, such as the Allais paradox published in Allais (1953), address the EUT itself. This paradox is based on an experiment with four different games that test subjects should rank. Table 4-2 shows the four games ( A , B , A ' , B ' ) . The ranking is to be done for the two pairs ( A , B ) and ( A ' , B ' ) . The independence axiom postulates that the first row in the table should not have any influence on the ordering of ( A ' , B ' ) since the payoff is the same for both games.

Table 4-2. Games in Allais paradox
Probability Game A Game B Game A’ Game B’
















In experiments, the majority of decision makers rank the games as follows: B A and A ' B ' . The ranking B A leads to the following inequalities, where u 1 u ( 2400 ) , u 2 u ( 2500 ) , u 3 u ( 0 ) :

u 1 > 0 . 66 · u 1 + 0 . 33 · u 2 + 0 . 01 · u 3 0 . 34 · u 1 > 0 . 33 · u 2 + 0 . 01 · u 3

The ranking A ' B ' in turn leads to the following inequalities:

0 . 33 · u 2 + 0 . 01 · u 3 > 0 . 33 · u 1 + 0 . 01 · u 1 0 . 34 · u 1 < 0 . 33 · u 2 + 0 . 01 · u 3

These inequalities obviously contradict each other and lead to the Allais paradox. One possible explanation is that decision makers in general value certainty higher than the typical models, such as EUT, predict. Most people would probably rather choose to receive $1 million with certainty than play a game in which they can win $100 million with a probability of 5%, although there are a number of suitable utility functions available that under EUT would have the decision maker choose the game instead of the certain amount.

Another explanation lies in framing decisions and the psychology of decision makers. It is well known that more people would accept a surgery if it has a “95% chance of success” than a “5% chance of death.” Simply changing the wording might lead to behavior that is inconsistent with decision-making theories such as EUT.

Another famous paradox addressing shortcomings of EUT in its subjective form, according to Savage (1954, 1972), is the Ellsberg paradox, which dates back to the seminal paper by Ellsberg (1961). It addresses the importance of ambiguity in many real-world decision situations. A standard setting for this paradox comprises two different urns, both of which contain exactly 100 balls. For urn 1, it is known that it contains exactly 50 black and 50 red balls. For urn 2, it is only known that it contains black and red balls but not in which proportion.

Test subjects can choose among the following game options:

  • Game 1: red 1, black 1, or indifferent

  • Game 2: red 2, black 2, or indifferent

  • Game 3: red 1, red 2, or indifferent

  • Game 4: black 1, black 2, or indifferent

Here, “red 1,” for example, means that a red ball is drawn from urn 1. Typically, a test subject would answer as follows:

  • Game 1: indifferent

  • Game 2: indifferent

  • Game 3: red 1

  • Game 4: black 1

This set of decisions—which is not the only one to be observed but is a common one—exemplifies what is called ambiguity aversion. Since the probabilities for black and red balls, respectively, are not known for urn 2, decision makers prefer a situation of risk instead of ambiguity.

The two paradoxa of Allais and Ellsberg show that real test subjects quite often behave contrary to what well-established decision theories in economics predict. In other words, human beings as decision makers can in general not be compared to machines that carefully collect data and then crunch the numbers to make a decision under uncertainty, be it in the form of risk or ambiguity. Human behavior is more complex than most, if not all, theories currently suggest. How difficult and complex it can be to explain human behavior is clear after reading, for example, the 800-page book Behave by Sapolsky (2018). It covers multiple facets of this topic, ranging from biochemical processes to genetics, human evolution, tribes, language, religion, and more, in an integrative manner.

If standard economic decision paradigms such as EUT do not explain real-world decision making too well, what alternatives are available? Economic experiments that build the basis for the Allais and Ellsberg paradoxa are a good starting point in learning how decision makers behave in specific, controlled situations. Such experiments and their sometimes surprising and paradoxical results have indeed motivated a great number of researchers to come up with alternative theories and models that resolve the paradoxa. The book The Experiment in the History of Economics by Fontaine and Leonard (2005) is about the historical role of experiments in economics. There is, for example, a whole string of literature that addresses issues arising from the Ellsberg paradox. This literature deals with, among other topics, nonadditive probabilities, Choquet integrals, and decision heuristics such as maximizing the minimum payoff (“max-min”) or minimizing the maximum loss (“min-max”). These alternative approaches have proven superior to EUT, at least in certain decision-making scenarios. But they are far from being mainstream in finance.

What, after all, has proven to be useful in practice? Not too surprisingly, the answer lies in data and machine learning algorithms. The internet, with its billions of users, generates a treasure trove of data describing real-world human behavior, or what is sometimes called revealed preferences. The big data generated on the web has a scale that is multiple orders of magnitude larger than what single experiments can generate. Companies such as Amazon, Facebook, Google, and Twitter are able to make billions of dollars by recording user behavior (that is, their revealed preferences) and capitalizing on the insights generated by ML algorithms trained on this data.

The default ML approach taken in this context is supervised learning. The algorithms themselves are in general theory- and model-free; variants of neural networks are often applied. Therefore, when companies today predict the behavior of their users or customers, more often than not a model-free ML algorithm is deployed. Traditional decision theories like EUT or one of its successors generally do not play a role at all. This makes it somewhat surprising that such theories still, at the beginning of the 2020s, are a cornerstone of most economic and financial theories applied in practice. And this is not even to mention the large number of financial textbooks that cover traditional decision theories in detail. If one of the most fundamental building blocks of financial theory seems to lack meaningful empirical support or practical benefits, what about the financial models that build on top of it? More on this appears in subsequent sections and chapters.

Data-Driven Predictions of Behavior

Standard economic decision theories are intellectually appealing to many, even to those who, faced with a concrete decision under uncertainty, would behave in contrast to the theories’ predictions. On the other hand, big data and model-free, supervised learning approaches prove useful and successful in practice for predicting user and customer behavior. In a financial context, this might imply that one should not really worry about why and how financial agents decide the way they decide. One should rather focus on their indirectly revealed preferences based on features data (new information) that describes the state of a financial market and labels data (outcomes) that reflects the impact of the decisions made by financial agents. This leads to a data-driven instead of a theory- or model-driven view of decision making in financial markets. Financial agents become data-processing organisms that can be much better modeled, for example, by complex neural networks than, say, a simple utility function in combination with an assumed probability distribution.

Mean-Variance Portfolio Theory

Assume a data-driven investor wants to apply MVP theory to invest in a portfolio of technology stocks and wants to add a gold-related exchange-traded fund (ETF) for diversification. Probably, the investor would access relevant historical price data via an API to a trading platform or a data provider. To make the following analysis reproducible, it relies on a CSV data file stored in a remote location. The following Python code retrieves the data file, selects a number of symbols given the investor’s goal, and calculates log returns from the price time series data. Figure 4-4 compares the normalized price time series for the selected symbols:

In [51]: import numpy as np
         import pandas as pd
         from pylab import plt, mpl
         from scipy.optimize import minimize'seaborn')
         mpl.rcParams['savefig.dpi'] = 300
         mpl.rcParams[''] = 'serif'
         np.set_printoptions(precision=5, suppress=True,
                            formatter={'float': lambda x: f'{x:6.3f}'})

In [52]: url = ''  1

In [53]: raw = pd.read_csv(url, index_col=0, parse_dates=True).dropna()  1

In [54]:  1
         <class 'pandas.core.frame.DataFrame'>
         DatetimeIndex: 2516 entries, 2010-01-04 to 2019-12-31
         Data columns (total 12 columns):
          #   Column  Non-Null Count  Dtype
         ---  ------  --------------  -----
          0   AAPL.O  2516 non-null   float64
          1   MSFT.O  2516 non-null   float64
          2   INTC.O  2516 non-null   float64
          3   AMZN.O  2516 non-null   float64
          4   GS.N    2516 non-null   float64
          5   SPY     2516 non-null   float64
          6   .SPX    2516 non-null   float64
          7   .VIX    2516 non-null   float64
          8   EUR=    2516 non-null   float64
          9   XAU=    2516 non-null   float64
          10  GDX     2516 non-null   float64
          11  GLD     2516 non-null   float64
         dtypes: float64(12)
         memory usage: 255.5 KB

In [55]: symbols = ['AAPL.O', 'MSFT.O', 'INTC.O', 'AMZN.O', 'GLD']  2

In [56]: rets = np.log(raw[symbols] / raw[symbols].shift(1)).dropna()  3

In [57]: (raw[symbols] / raw[symbols].iloc[0]).plot(figsize=(10, 6));  4

Retrieves historical EOD data from a remote location


Specifies the symbols (RICs) to be invested in


Calculates the log returns for all time series


Plots the normalized financial time series for the selected symbols

aiif 0404
Figure 4-4. Normalized financial time series data

The data-driven investor wants to first set a baseline for performance as given by an equally weighted portfolio over the whole period of the available data. To this end, the following Python code defines functions to calculate the portfolio return, the portfolio volatility, and the portfolio Sharpe ratio given a set of weights for the selected symbols:

In [58]: weights = len(rets.columns) * [1 / len(rets.columns)]  1

In [59]: def port_return(rets, weights):
             return, weights) * 252  2

In [60]: port_return(rets, weights)  2
Out[60]: 0.15694764653018106

In [61]: def port_volatility(rets, weights):
             return, * 252 , weights)) ** 0.5  3

In [62]: port_volatility(rets, weights)  3
Out[62]: 0.16106507848480675

In [63]: def port_sharpe(rets, weights):
             return port_return(rets, weights) / port_volatility(rets, weights)  4

In [64]: port_sharpe(rets, weights)  4
Out[64]: 0.97443622172255

Equally weighted portfolio


Portfolio return


Portfolio volatility


Portfolio Sharpe ratio (with zero short rate)

The investor also wants to analyze which combinations of portfolio risk and return—and consequently Sharpe ratio—are roughly possible by applying Monte Carlo simulation to randomize the portfolio weights. Short sales are excluded, and the portfolio weights are assumed to add up to 100%. The following Python code implements the simulation and visualizes the results (see Figure 4-5):

In [65]: w = np.random.random((1000, len(symbols)))  1
         w = (w.T / w.sum(axis=1)).T  1

In [66]: w[:5]  1
Out[66]: array([[ 0.184,  0.157,  0.227,  0.353,  0.079],
                [ 0.207,  0.282,  0.258,  0.023,  0.230],
                [ 0.313,  0.284,  0.051,  0.340,  0.012],
                [ 0.238,  0.181,  0.145,  0.191,  0.245],
                [ 0.246,  0.256,  0.315,  0.181,  0.002]])

In [67]: pvr = [(port_volatility(rets[symbols], weights),
                 port_return(rets[symbols], weights))
                for weights in w]  2
         pvr = np.array(pvr)  2

In [68]: psr = pvr[:, 1] / pvr[:, 0]  3

In [69]: plt.figure(figsize=(10, 6))
         fig = plt.scatter(pvr[:, 0], pvr[:, 1],
                           c=psr, cmap='coolwarm')
         cb = plt.colorbar(fig)
         cb.set_label('Sharpe ratio')
         plt.xlabel('expected volatility')
         plt.ylabel('expected return')
         plt.title(' | '.join(symbols));

Simulates portfolio weights adding up to 100%


Derives the resulting portfolio volatilities and returns


Calculates the resulting Sharpe ratios

aiif 0405
Figure 4-5. Simulated portfolio volatilities, returns, and Sharpe ratios

The data-driven investor now wants to backtest the performance of a portfolio that was set up at the beginning of 2011. The optimal portfolio composition was derived from the financial time series data available from 2010. At the beginning of 2012, the portfolio composition was adjusted given the available data from 2011, and so on. To this end, the following Python code derives the portfolio weights for every relevant year that maximizes the Sharpe ratio:

In [70]: bnds = len(symbols) * [(0, 1),]  1
         bnds  1
Out[70]: [(0, 1), (0, 1), (0, 1), (0, 1), (0, 1)]

In [71]: cons = {'type': 'eq', 'fun': lambda weights: weights.sum() - 1}  2

In [72]: opt_weights = {}
         for year in range(2010, 2019):
             rets_ = rets[symbols].loc[f'{year}-01-01':f'{year}-12-31']  3
             ow = minimize(lambda weights: -port_sharpe(rets_, weights),
                           len(symbols) * [1 / len(symbols)],
                           constraints=cons)['x']  4
             opt_weights[year] = ow  5

In [73]: opt_weights  5
Out[73]: {2010: array([ 0.366,  0.000,  0.000,  0.056,  0.578]),
          2011: array([ 0.543,  0.000,  0.077,  0.000,  0.380]),
          2012: array([ 0.324,  0.000,  0.000,  0.471,  0.205]),
          2013: array([ 0.012,  0.305,  0.219,  0.464,  0.000]),
          2014: array([ 0.452,  0.115,  0.419,  0.000,  0.015]),
          2015: array([ 0.000,  0.000,  0.000,  1.000,  0.000]),
          2016: array([ 0.150,  0.260,  0.000,  0.058,  0.533]),
          2017: array([ 0.231,  0.203,  0.031,  0.109,  0.426]),
          2018: array([ 0.000,  0.295,  0.000,  0.705,  0.000])}

Specifies the bounds for the single asset weights


Specifies that all weights need to add up to 100%


Selects the relevant data set for the given year


Derives the portfolio weights that maximize the Sharpe ratio


Stores these weights in a dict object

The optimal portfolio compositions as derived for the relevant years illustrate that MVP theory in its original form quite often leads to (relative) extreme situations in the sense that one or more assets are not included at all or that even a single asset makes up 100% of the portfolio. Of course, this can be actively avoided by setting, for example, a minimum weight for every asset considered. The results also indicate that this approach leads to significant rebalancings in the portfolio, driven by the previous year’s realized statistics and correlations.

To complete the backtest, the following code compares the expected portfolio statistics (from the optimal composition of the previous year applied to the previous year’s data) with the realized portfolio statistics for the current year (from the optimal composition from the previous year applied to the current year’s data):

In [74]: res = pd.DataFrame()
         for year in range(2010, 2019):
             rets_ = rets[symbols].loc[f'{year}-01-01':f'{year}-12-31']
             epv = port_volatility(rets_, opt_weights[year])  1
             epr = port_return(rets_, opt_weights[year])  1
             esr = epr / epv  1
             rets_ = rets[symbols].loc[f'{year + 1}-01-01':f'{year + 1}-12-31']
             rpv = port_volatility(rets_, opt_weights[year]) 2
             rpr = port_return(rets_, opt_weights[year])  2
             rsr = rpr / rpv  2
             res = res.append(pd.DataFrame({'epv': epv, 'epr': epr, 'esr': esr,
                                            'rpv': rpv, 'rpr': rpr, 'rsr': rsr},
                                           index=[year + 1]))

In [75]: res
Out[75]:            epv       epr       esr       rpv       rpr       rsr
         2011  0.157440  0.303003  1.924564  0.160622  0.133836  0.833235
         2012  0.173279  0.169321  0.977156  0.182292  0.161375  0.885256
         2013  0.202460  0.278459  1.375378  0.168714  0.166897  0.989228
         2014  0.181544  0.368961  2.032353  0.197798  0.026830  0.135645
         2015  0.160340  0.309486  1.930190  0.211368 -0.024560 -0.116194
         2016  0.326730  0.778330  2.382179  0.296565  0.103870  0.350242
         2017  0.106148  0.090933  0.856663  0.079521  0.230630  2.900235
         2018  0.086548  0.260702  3.012226  0.157337  0.038234  0.243004
         2019  0.323796  0.228008  0.704174  0.207672  0.275819  1.328147

In [76]: res.mean()
Out[76]: epv    0.190920
         epr    0.309689
         esr    1.688320
         rpv    0.184654
         rpr    0.123659
         rsr    0.838755
         dtype: float64

Expected portfolio statistics


Realized portfolio statistics

Figure 4-6 compares the expected and realized portfolio volatilities for the single years. MVP theory does quite a good job in predicting the portfolio volatility. This is also supported by a relatively high correlation between the two time series:

In [77]: res[['epv', 'rpv']].corr()
Out[77]:           epv       rpv
         epv  1.000000  0.765733
         rpv  0.765733  1.000000

In [78]: res[['epv', 'rpv']].plot(kind='bar', figsize=(10, 6),
                 title='Expected vs. Realized Portfolio Volatility');
aiif 0406
Figure 4-6. Expected versus realized portfolio volatilities

However, the conclusions are the opposite when comparing the expected with the realized portfolio returns (see Figure 4-7). MVP theory obviously fails in predicting the portfolio returns, as is confirmed by the negative correlation between the two time series:

In [79]: res[['epr', 'rpr']].corr()
Out[79]:           epr       rpr
         epr  1.000000 -0.350437
         rpr -0.350437  1.000000

In [80]: res[['epr', 'rpr']].plot(kind='bar', figsize=(10, 6),
                 title='Expected vs. Realized Portfolio Return');
aiif 0407
Figure 4-7. Expected versus realized portfolio returns

Similar, or even worse, conclusions need to be drawn with regard to the Sharpe ratio (see Figure 4-8). For the data-driven investor who aims at maximizing the Sharpe ratio of the portfolio, the theory’s predictions are generally significantly off from the realized values. The correlation between the two time series is even lower than for the returns:

In [81]: res[['esr', 'rsr']].corr()
Out[81]:           esr       rsr
         esr  1.000000 -0.698607
         rsr -0.698607  1.000000

In [82]: res[['esr', 'rsr']].plot(kind='bar', figsize=(10, 6),
                 title='Expected vs. Realized Sharpe Ratio');
aiif 0408
Figure 4-8. Expected versus realized portfolio Sharpe ratios

Predictive Power of MVP Theory

MVP theory applied to real-world data reveals its practical shortcomings. Without additional constraints, optimal portfolio compositions and rebalancings can be extreme. The predictive power with regard to portfolio return and Sharpe ratio is pretty bad in the numerical example, whereas the predictive power with regard to portfolio risk seems acceptable. However, investors generally are interested in risk-adjusted performance measures, such as the Sharpe ratio, and this is the statistic for which MVP theory fails worst in the example.

Capital Asset Pricing Model

A similar approach can be applied to put the CAPM to a real-world test. Assume that the data-driven technology investor from before wants to apply the CAPM to derive expected returns for the four technology stocks from before. The following Python code first derives the beta for every stock for a given year, and then calculates the expected return for the stock in the next year, given its beta and the performance of the market portfolio. The market portfolio is approximated by the S&P 500 stock index:

In [83]: r = 0.005  1

In [84]: market = '.SPX'  2

In [85]: rets = np.log(raw / raw.shift(1)).dropna()

In [86]: res = pd.DataFrame()

In [87]: for sym in rets.columns[:4]:
             print('\n' + sym)
             print(54 * '=')
             for year in range(2010, 2019):
                 rets_ = rets.loc[f'{year}-01-01':f'{year}-12-31']
                 muM = rets_[market].mean() * 252
                 cov = rets_.cov().loc[sym, market]  3
                 var = rets_[market].var()  3
                 beta = cov / var  3
                 rets_ = rets.loc[f'{year + 1}-01-01':f'{year + 1}-12-31']
                 muM = rets_[market].mean() * 252
                 mu_capm = r + beta * (muM - r)  4
                 mu_real = rets_[sym].mean() * 252  5
                 res = res.append(pd.DataFrame({'symbol': sym,
                                                'mu_capm': mu_capm,
                                                'mu_real': mu_real},
                                               index=[year + 1]),
                                 sort=True)  6
                 print('{} | beta: {:.3f} | mu_capm: {:6.3f} | mu_real: {:6.3f}'
                       .format(year + 1, beta, mu_capm, mu_real))  6

Specifies the risk-less short rate


Defines the market portfolio


Derives the beta of the stock


Calculates the expected return given previous year’s beta and current year market portfolio performance


Calculates the realized performance of the stock for the current year


Collects and prints all results

The preceding code provides the following output:

         2011 | beta: 1.052 | mu_capm: -0.000 | mu_real:  0.228
         2012 | beta: 0.764 | mu_capm:  0.098 | mu_real:  0.275
         2013 | beta: 1.266 | mu_capm:  0.327 | mu_real:  0.053
         2014 | beta: 0.630 | mu_capm:  0.070 | mu_real:  0.320
         2015 | beta: 0.833 | mu_capm: -0.005 | mu_real: -0.047
         2016 | beta: 1.144 | mu_capm:  0.103 | mu_real:  0.096
         2017 | beta: 1.009 | mu_capm:  0.180 | mu_real:  0.381
         2018 | beta: 1.379 | mu_capm: -0.091 | mu_real: -0.071
         2019 | beta: 1.252 | mu_capm:  0.316 | mu_real:  0.621

         2011 | beta: 0.890 | mu_capm:  0.001 | mu_real: -0.072
         2012 | beta: 0.816 | mu_capm:  0.104 | mu_real:  0.029
         2013 | beta: 1.109 | mu_capm:  0.287 | mu_real:  0.337
         2014 | beta: 0.876 | mu_capm:  0.095 | mu_real:  0.216
         2015 | beta: 0.955 | mu_capm: -0.007 | mu_real:  0.178
         2016 | beta: 1.249 | mu_capm:  0.113 | mu_real:  0.113
         2017 | beta: 1.224 | mu_capm:  0.217 | mu_real:  0.321
         2018 | beta: 1.303 | mu_capm: -0.086 | mu_real:  0.172
         2019 | beta: 1.442 | mu_capm:  0.364 | mu_real:  0.440

         2011 | beta: 1.081 | mu_capm: -0.000 | mu_real:  0.142
         2012 | beta: 0.842 | mu_capm:  0.108 | mu_real: -0.163
         2013 | beta: 1.081 | mu_capm:  0.280 | mu_real:  0.230
         2014 | beta: 0.883 | mu_capm:  0.096 | mu_real:  0.335
         2015 | beta: 1.055 | mu_capm: -0.008 | mu_real: -0.052
         2016 | beta: 1.009 | mu_capm:  0.092 | mu_real:  0.051
         2017 | beta: 1.261 | mu_capm:  0.223 | mu_real:  0.242
         2018 | beta: 1.163 | mu_capm: -0.076 | mu_real:  0.017
         2019 | beta: 1.376 | mu_capm:  0.347 | mu_real:  0.243

         2011 | beta: 1.102 | mu_capm: -0.001 | mu_real: -0.039
         2012 | beta: 0.958 | mu_capm:  0.122 | mu_real:  0.374
         2013 | beta: 1.116 | mu_capm:  0.289 | mu_real:  0.464
         2014 | beta: 1.262 | mu_capm:  0.135 | mu_real: -0.251
         2015 | beta: 1.473 | mu_capm: -0.013 | mu_real:  0.778
         2016 | beta: 1.122 | mu_capm:  0.102 | mu_real:  0.104
         2017 | beta: 1.118 | mu_capm:  0.199 | mu_real:  0.446
         2018 | beta: 1.300 | mu_capm: -0.086 | mu_real:  0.251
         2019 | beta: 1.619 | mu_capm:  0.408 | mu_real:  0.207

Figure 4-9 compares the predicted (expected) return for a single stock, given the beta from the previous year and market portfolio performance of the current year, with the realized return of the stock for the current year. Obviously, the CAPM in its original form does not prove really useful in predicting a stock’s performance based on beta only:

In [88]: sym = 'AMZN.O'

In [89]: res[res['symbol'] == sym].corr()
Out[89]:           mu_capm   mu_real
         mu_capm  1.000000 -0.004826
         mu_real -0.004826  1.000000

In [90]: res[res['symbol'] == sym].plot(kind='bar',
                         figsize=(10, 6), title=sym);
aiif 0409
Figure 4-9. CAPM-predicted versus realized stock returns for a single stock

Figure 4-10 compares the averages of the CAPM-predicted stock returns with the averages of the realized returns. Also here, the CAPM does not do a good job.

What is easy to see is that the CAPM predictions do not vary that much on average for the stocks analyzed; they are between 12.2% and 14.4%. However, the realized average returns of the stocks show a high variability; these are between 9.4% and 29.2%. Market portfolio performance and beta alone obviously cannot account for the observed returns of the (technology) stocks:

In [91]: grouped = res.groupby('symbol').mean()
Out[91]:          mu_capm   mu_real
         AAPL.O  0.110855  0.206158
         AMZN.O  0.128223  0.259395
         INTC.O  0.117929  0.116180
         MSFT.O  0.120844  0.192655

In [92]: grouped.plot(kind='bar', figsize=(10, 6), title='Average Values');
aiif 0410
Figure 4-10. Average CAPM-predicted versus average realized stock returns for multiple stocks

Predictive Power of the CAPM

The predictive power of the CAPM with regard to the future performance of stocks, relative to the market portfolio, is pretty low or even nonexistent for certain stocks. One of the reasons is probably the fact that the CAPM rests on the same central assumptions as MVP theory, namely that investors care about only the (expected) return and (expected) volatility of a portfolio and/or stock. From a modeling point of view, one can ask whether the single risk factor is enough to explain variability in stock returns or whether there might be a nonlinear relationship between a stock’s return and the market portfolio performance.

Arbitrage Pricing Theory

The predictive power of the CAPM seems quite limited given the results from the previous numerical example. A valid question is whether the market portfolio performance alone is enough to explain variability in stock returns. The answer of the APT is no—there can be more (even many more) factors that together explain variability in stock returns. “Arbitrage Pricing Theory” formally describes the framework of APT that also relies on a linear relationship between the factors and a stock’s return.

The data-driven investor recognizes that the CAPM is not sufficient to reliably predict a stock’s performance relative to the market portfolio performance. Therefore, the investor decides to add to the market portfolio three additional factors that might drive a stock’s performance:

  • Market volatility (as represented by the VIX index, .VIX)

  • Exchange rates (as represented by the EUR/USD rate, EUR=)

  • Commodity prices (as represented by the gold price, XAU=)

The following Python code implements a simple APT approach by using the four factors in combination with multivariate regression to explain a stock’s future performance in relation to the factors:

In [93]: factors = ['.SPX', '.VIX', 'EUR=', 'XAU=']  1

In [94]: res = pd.DataFrame()

In [95]: np.set_printoptions(formatter={'float': lambda x: f'{x:5.2f}'})

In [96]: for sym in rets.columns[:4]:
             print('\n' + sym)
             print(71 * '=')
             for year in range(2010, 2019):
                 rets_ = rets.loc[f'{year}-01-01':f'{year}-12-31']
                 reg = np.linalg.lstsq(rets_[factors],
                                       rets_[sym], rcond=-1)[0]  2
                 rets_ = rets.loc[f'{year + 1}-01-01':f'{year + 1}-12-31']
                 mu_apt =[factors].mean() * 252, reg)  3
                 mu_real =  rets_[sym].mean() * 252  4
                 res = res.append(pd.DataFrame({'symbol': sym,
                                 'mu_apt': mu_apt, 'mu_real': mu_real},
                                  index=[year + 1]))
                 print('{} | fl: {} | mu_apt: {:6.3f} | mu_real: {:6.3f}'
                       .format(year + 1, reg.round(2), mu_apt, mu_real))

The four factors


The multivariate regression


The APT-predicted return of the stock


The realized return of the stock

The preceding code provides the following output:

         2011 | fl: [ 0.91 -0.04 -0.35  0.12] | mu_apt:  0.011 | mu_real:  0.228
         2012 | fl: [ 0.76 -0.02 -0.24  0.05] | mu_apt:  0.099 | mu_real:  0.275
         2013 | fl: [ 1.67  0.04 -0.56  0.10] | mu_apt:  0.366 | mu_real:  0.053
         2014 | fl: [ 0.53 -0.00  0.02  0.16] | mu_apt:  0.050 | mu_real:  0.320
         2015 | fl: [ 1.07  0.02  0.25  0.01] | mu_apt: -0.038 | mu_real: -0.047
         2016 | fl: [ 1.21  0.01 -0.14 -0.02] | mu_apt:  0.110 | mu_real:  0.096
         2017 | fl: [ 1.10  0.01 -0.15 -0.02] | mu_apt:  0.170 | mu_real:  0.381
         2018 | fl: [ 1.06 -0.03 -0.15  0.12] | mu_apt: -0.088 | mu_real: -0.071
         2019 | fl: [ 1.37  0.01 -0.20  0.13] | mu_apt:  0.364 | mu_real:  0.621

         2011 | fl: [ 0.98  0.01  0.02 -0.11] | mu_apt: -0.008 | mu_real: -0.072
         2012 | fl: [ 0.82  0.00 -0.03 -0.01] | mu_apt:  0.103 | mu_real:  0.029
         2013 | fl: [ 1.14  0.00 -0.07 -0.01] | mu_apt:  0.294 | mu_real:  0.337
         2014 | fl: [ 1.28  0.05  0.04  0.07] | mu_apt:  0.149 | mu_real:  0.216
         2015 | fl: [ 1.20  0.03  0.05  0.01] | mu_apt: -0.016 | mu_real:  0.178
         2016 | fl: [ 1.44  0.03 -0.17 -0.02] | mu_apt:  0.127 | mu_real:  0.113
         2017 | fl: [ 1.33  0.01 -0.14  0.00] | mu_apt:  0.216 | mu_real:  0.321
         2018 | fl: [ 1.10 -0.02 -0.14  0.22] | mu_apt: -0.087 | mu_real:  0.172
         2019 | fl: [ 1.51  0.01 -0.16 -0.02] | mu_apt:  0.378 | mu_real:  0.440

         2011 | fl: [ 1.17  0.01  0.05 -0.13] | mu_apt: -0.010 | mu_real:  0.142
         2012 | fl: [ 1.03  0.04  0.01  0.03] | mu_apt:  0.122 | mu_real: -0.163
         2013 | fl: [ 1.06 -0.01 -0.10  0.01] | mu_apt:  0.267 | mu_real:  0.230
         2014 | fl: [ 0.96  0.02  0.36 -0.02] | mu_apt:  0.063 | mu_real:  0.335
         2015 | fl: [ 0.93 -0.01 -0.09  0.02] | mu_apt:  0.001 | mu_real: -0.052
         2016 | fl: [ 1.02  0.00 -0.05  0.06] | mu_apt:  0.099 | mu_real:  0.051
         2017 | fl: [ 1.41  0.02 -0.18  0.03] | mu_apt:  0.226 | mu_real:  0.242
         2018 | fl: [ 1.12 -0.01 -0.11  0.17] | mu_apt: -0.076 | mu_real:  0.017
         2019 | fl: [ 1.50  0.01 -0.34  0.30] | mu_apt:  0.431 | mu_real:  0.243

         2011 | fl: [ 1.02 -0.03 -0.18 -0.14] | mu_apt: -0.016 | mu_real: -0.039
         2012 | fl: [ 0.98 -0.01 -0.17 -0.09] | mu_apt:  0.117 | mu_real:  0.374
         2013 | fl: [ 1.07 -0.00  0.09  0.00] | mu_apt:  0.282 | mu_real:  0.464
         2014 | fl: [ 1.54  0.03  0.01 -0.08] | mu_apt:  0.176 | mu_real: -0.251
         2015 | fl: [ 1.26 -0.02  0.45 -0.11] | mu_apt: -0.044 | mu_real:  0.778
         2016 | fl: [ 1.06 -0.00 -0.15 -0.04] | mu_apt:  0.099 | mu_real:  0.104
         2017 | fl: [ 0.94 -0.02  0.12 -0.03] | mu_apt:  0.185 | mu_real:  0.446
         2018 | fl: [ 0.90 -0.04 -0.25  0.28] | mu_apt: -0.085 | mu_real:  0.251
         2019 | fl: [ 1.99  0.05 -0.37  0.12] | mu_apt:  0.506 | mu_real:  0.207

Figure 4-11 compares the APT-predicted returns for a stock and its realized stock returns over time. Compared to the single-factor CAPM, there seems to be hardly any improvement:

In [97]: sym = 'AMZN.O'

In [98]: res[res['symbol'] == sym].corr()
Out[98]:            mu_apt   mu_real
         mu_apt   1.000000 -0.098281
         mu_real -0.098281  1.000000

In [99]: res[res['symbol'] == sym].plot(kind='bar',
                         figsize=(10, 6), title=sym);
aiif 0411
Figure 4-11. APT-predicted versus realized stock returns for a stock

The same picture arises in Figure 4-12, produced by the following snippet, which compares the averages for multiple stocks. Because there is hardly any variation in the average APT predictions, there are large average differences to the realized returns:

In [100]: grouped = res.groupby('symbol').mean()
Out[100]:           mu_apt   mu_real
          AAPL.O  0.116116  0.206158
          AMZN.O  0.135528  0.259395
          INTC.O  0.124811  0.116180
          MSFT.O  0.128441  0.192655

In [101]: grouped.plot(kind='bar', figsize=(10, 6), title='Average Values');

Of course, the selection of the risk factors is of paramount importance in this context. The data-driven investor decides to find out what risk factors are typically considered relevant ones for stocks. After studying the paper by Bender et al. (2013), the investor replaces the original risk factors with a new set. In particular, the investor chooses the set as presented in Table 4-3.

aiif 0412
Figure 4-12. Average APT-predicted versus average realized stock returns for multiple stocks
Table 4-3. Risk factors for APT
Factor Description RIC


MSCI World Gross Return Daily USD (PUS = Price Return)



MSCI World Equal Weight Price Net Index EOD



MSCI World Minimum Volatility Net Return



MSCI World Value Weighted Gross (NUS for Net)



MSCI World Risk Weighted Gross USD EOD



MSCI World Quality Net Return USD



MSCI World Momentum Gross Index USD EOD


The following Python code retrieves a respective data set from a remote location and visualizes the normalized time series data (see Figure 4-13). Already a brief look reveals that the time series seem to be highly positively correlated:

In [102]: factors = pd.read_csv('',
                                index_col=0, parse_dates=True) 1

In [103]: (factors / factors.iloc[0]).plot(figsize=(10, 6));  2

Retrieves factors time series data


Normalizes and plots the data

aiif 0413
Figure 4-13. Normalized factors time series data

This impression is confirmed by the following calculation and the resulting correlation matrix for the factor returns. All correlation factors are about 0.75 or higher:

In [104]: start = '2017-01-01'  1
          end = '2020-01-01'  1

In [105]: retsd = rets.loc[start:end].copy()  2
          retsd.dropna(inplace=True)  2

In [106]: retsf = np.log(factors / factors.shift(1))  3
          retsf = retsf.loc[start:end]  3
          retsf.dropna(inplace=True)  3
          retsf = retsf.loc[retsd.index].dropna()  3

In [107]: retsf.corr()  4
Out[107]:               market      size  volatility     value      risk    growth  \
          market      1.000000  0.935867    0.845010  0.964124  0.947150  0.959038
          size        0.935867  1.000000    0.791767  0.965739  0.983238  0.835477
          volatility  0.845010  0.791767    1.000000  0.778294  0.865467  0.818280
          value       0.964124  0.965739    0.778294  1.000000  0.958359  0.864222
          risk        0.947150  0.983238    0.865467  0.958359  1.000000  0.858546
          growth      0.959038  0.835477    0.818280  0.864222  0.858546  1.000000
          momentum    0.928705  0.796420    0.819585  0.818796  0.825563  0.952956

          market      0.928705
          size        0.796420
          volatility  0.819585
          value       0.818796
          risk        0.825563
          growth      0.952956
          momentum    1.000000

Defines start and end dates for data selection


Selects the relevant returns data sub-set


Calculates and processes the log returns for the factors


Shows the correlation matrix for the factors

The following Python code derives factor loadings for the original stocks but with the new factors. They are derived from the first half of the data set and applied to predict the stock return for the second half given the performance of the single factors. The realized return is also calculated. Both time series are compared in Figure 4-14. As to be expected given the high correlation of the factors, the explanatory power of the APT approach is not much higher compared to the CAPM:

In [108]: res = pd.DataFrame()

In [109]: np.set_printoptions(formatter={'float': lambda x: f'{x:5.2f}'})

In [110]: split = int(len(retsf) * 0.5)
          for sym in rets.columns[:4]:
              print('\n' + sym)
              print(74 * '=')
              retsf_, retsd_ = retsf.iloc[:split], retsd.iloc[:split]
              reg = np.linalg.lstsq(retsf_, retsd_[sym], rcond=-1)[0]
              retsf_, retsd_ = retsf.iloc[split:], retsd.iloc[split:]
              mu_apt = * 252, reg)
              mu_real =  retsd_[sym].mean() * 252
              res = res.append(pd.DataFrame({'mu_apt': mu_apt,
                              'mu_real': mu_real}, index=[sym,]),
              print('fl: {} | apt: {:.3f} | real: {:.3f}'
                    .format(reg.round(1), mu_apt, mu_real))

          fl: [ 2.30  2.80 -0.70 -1.40 -4.20  2.00 -0.20] | apt: 0.115 | real: 0.301

          fl: [ 1.50  0.00  0.10 -1.30 -1.40  0.80  1.00] | apt: 0.181 | real: 0.304

          fl: [-3.10  1.60  0.40  1.30 -2.60  2.50  1.10] | apt: 0.186 | real: 0.118

          fl: [ 9.10  3.30 -1.00 -7.10 -3.10 -1.80  1.20] | apt: 0.019 | real: 0.050

In [111]: res.plot(kind='bar', figsize=(10, 6));
aiif 0414
Figure 4-14. APT-predicted returns based on typical factors compared to realized returns

The data-driven investor is not willing to dismiss the APT completely. Therefore, an additional test might shed some more light on the explanatory power of APT. To this end, the factor loadings are used to test whether APT can explain movements of the stock price over time (correctly). And indeed, although APT does not predict the absolute performance correctly (it is off by 10+ percentage points), it predicts the direction of the stock price movement correctly in the majority of cases (see Figure 4-15). The correlation between the predicted and realized returns is also pretty high at around 85%. However, the analysis uses realized factor returns to generate the APT predictions—something, of course, not available in practice a day before the relevant trading day:

In [112]: sym
Out[112]: 'AMZN.O'

In [113]: rets_sym =, reg)  1

In [114]: rets_sym = pd.DataFrame(rets_sym,
                                  columns=[sym + '_apt'],
                                  index=retsf_.index)  2

In [115]: rets_sym[sym + '_real'] = retsd_[sym]  3

In [116]: rets_sym.mean() * 252  4
Out[116]: AMZN.O_apt     0.019401
          AMZN.O_real    0.050344
          dtype: float64

In [117]: rets_sym.std() * 252 ** 0.5  5
Out[117]: AMZN.O_apt     0.270995
          AMZN.O_real    0.307653
          dtype: float64

In [118]: rets_sym.corr()  6
Out[118]:              AMZN.O_apt  AMZN.O_real
          AMZN.O_apt     1.000000     0.832218
          AMZN.O_real    0.832218     1.000000

In [119]: rets_sym.cumsum().apply(np.exp).plot(figsize=(10, 6));

Predicts the daily stock price returns given the realized factor returns


Stores the results in a DataFrame object and adds column and index data


Adds the realized stock price returns to the DataFrame object


Calculates the annualized returns


Calculates the annualized volatility


Calculates the correlation factor

aiif 0415
Figure 4-15. APT-predicted performance and real performance over time (gross)

How accurately does APT predict the direction of the stock price movement given the realized factor returns? The following Python code shows that the accuracy score is a bit better than 75%:

In [120]: rets_sym['same'] = (np.sign(rets_sym[sym + '_apt']) ==
                              np.sign(rets_sym[sym + '_real']))

In [121]: rets_sym['same'].value_counts()
Out[121]: True     288
          False     89
          Name: same, dtype: int64

In [122]: rets_sym['same'].value_counts()[True] / len(rets_sym)
Out[122]: 0.7639257294429708

Debunking Central Assumptions

The previous section provides a number of numerical, real-world examples showing how popular normative financial theories might fail in practice. This section argues that one of the major reasons is that central assumptions of these popular financial theories are invalid; that is, they simply do not describe the reality of financial markets. The two assumptions analyzed are normally distributed returns and linear relationships.

Normally Distributed Returns

As a matter of fact, only a normal distribution is completely specified through its first (expectation) and second moment (standard deviation).

Sample data sets

For illustration, consider a randomly generated set of standard normally distributed numbers as generated by the following Python code.4 Figure 4-16 shows the typical bell shape of the resulting histogram:

In [1]: import numpy as np
        import pandas as pd
        from pylab import plt, mpl
        mpl.rcParams['savefig.dpi'] = 300
        mpl.rcParams[''] = 'serif'

In [2]: N = 10000

In [3]: snrn = np.random.standard_normal(N)  1
        snrn -= snrn.mean()  2
        snrn /= snrn.std()  3

In [4]: round(snrn.mean(), 4)  2
Out[4]: -0.0

In [5]: round(snrn.std(), 4)  3
Out[5]: 1.0

In [6]: plt.figure(figsize=(10, 6))
        plt.hist(snrn, bins=35);

Draws standard normally distributed random numbers


Corrects the first moment (expectation) to 0.0


Corrects the second moment (standard deviation) to 1.0

aiif 0416
Figure 4-16. Standard normally distributed random numbers

Now consider a set of random numbers that share the same first and second moment values but have a completely different distribution than Figure 4-17 illustrates. Although the moments are the same, this distribution only consists of three discrete values:

In [7]: numbers = np.ones(N) * 1.5  1
        split = int(0.25 * N)  1
        numbers[split:3 * split] = -1  1
        numbers[3 * split:4 * split] = 0  1

In [8]: numbers -= numbers.mean()  2
        numbers /= numbers.std()  3

In [9]: round(numbers.mean(), 4)  2
Out[9]: 0.0

In [10]: round(numbers.std(), 4)  3
Out[10]: 1.0

In [11]: plt.figure(figsize=(10, 6))
         plt.hist(numbers, bins=35);

A set of numbers with three discrete values only


Corrects the first moment (expectation) to 0.0


Corrects the second moment (standard deviation) to 1.0

aiif 0417
Figure 4-17. Distribution with first and second moment of 0.0 and 1.0, respectively

First and Second Moment

The first and second moment of a probability distribution only describe a normal distribution completely. There are infinitely many other distributions that might share the first two moments with a normal distribution while being completely different.

In preparation for a test of real financial returns, consider the following Python functions that allow one to visualize data as a histogram and to add a probability density function (PDF) of a normal distribution with the first two moments of the data:

In [12]: import math
         import scipy.stats as scs
         import statsmodels.api as sm

In [13]: def dN(x, mu, sigma):
             ''' Probability density function of a normal random variable x.
             z = (x - mu) / sigma
             pdf = np.exp(-0.5 * z ** 2) / math.sqrt(2 * math.pi * sigma ** 2)
             return pdf

In [14]: def return_histogram(rets, title=''):
             ''' Plots a histogram of the returns.
             plt.figure(figsize=(10, 6))
             x = np.linspace(min(rets), max(rets), 100)
             plt.hist(np.array(rets), bins=50,
                      density=True, label='frequency')  1
             y = dN(x, np.mean(rets), np.std(rets))  2
             plt.plot(x, y, linewidth=2, label='PDF')  2
             plt.xlabel('log returns')

Plots the histogram of the data


Plots the PDF of the corresponding normal distribution

Figure 4-18 shows how well the histogram approximates the PDF for the standard normally distributed random numbers:

In [15]: return_histogram(snrn)
aiif 0418
Figure 4-18. Histogram and PDF for standard normally distributed numbers

By contrast, Figure 4-19 illustrates that the PDF of the normal distribution has nothing to do with the data shown as a histogram:

In [16]: return_histogram(numbers)
aiif 0419
Figure 4-19. Histogram and normal PDF for discrete numbers

Another way of comparing a normal distribution to data is the Quantile-Quantile (Q-Q) plot. As Figure 4-20 shows, for normally distributed numbers, the numbers themselves lie (mostly) on a straight line in the Q-Q plane:

In [17]: def return_qqplot(rets, title=''):
             ''' Generates a Q-Q plot of the returns.
             fig = sm.qqplot(rets, line='s', alpha=0.5)
             fig.set_size_inches(10, 6)
             plt.xlabel('theoretical quantiles')
             plt.ylabel('sample quantiles')

In [18]: return_qqplot(snrn)
aiif 0420
Figure 4-20. Q-Q plot for standard normally distributed numbers

Again, the Q-Q plot as shown in Figure 4-21 for the discrete numbers looks completely different to the one in Figure 4-20:

In [19]: return_qqplot(numbers)
aiif 0421
Figure 4-21. Q-Q plot for discrete numbers

Finally, one can also use statistical tests to check whether a set of numbers is normally distributed or not.

The following Python function implements three tests:

  • Test for normal skew.

  • Test for normal kurtosis.

  • Test for normal skew and kurtosis combined.

A p-value below 0.05 is generally considered to be a counter-indicator for normality; that is, the hypothesis that the numbers are normally distributed is rejected. In that sense, as in the preceding figures, the p-values for the two data sets speak for themselves:

In [20]: def print_statistics(rets):
             print('RETURN SAMPLE STATISTICS')
             print('Skew of Sample Log Returns {:9.6f}'.format(
             print('Skew Normal Test p-value   {:9.6f}'.format(
             print('Kurt of Sample Log Returns {:9.6f}'.format(
             print('Kurt Normal Test p-value   {:9.6f}'.format(
             print('Normal Test p-value        {:9.6f}'.format(

In [21]: print_statistics(snrn)
         Skew of Sample Log Returns  0.016793
         Skew Normal Test p-value    0.492685
         Kurt of Sample Log Returns -0.024540
         Kurt Normal Test p-value    0.637637
         Normal Test p-value         0.707334

In [22]: print_statistics(numbers)
         Skew of Sample Log Returns  0.689254
         Skew Normal Test p-value    0.000000
         Kurt of Sample Log Returns -1.141902
         Kurt Normal Test p-value    0.000000
         Normal Test p-value         0.000000

Real financial returns

The following Python code retrieves EOD data from a remote source, as done earlier in the chapter, and calculates the log returns for all financial time series contained in the data set. Figure 4-22 shows that the log returns of the S&P 500 stock index represented as a histogram show a much higher peak and fatter tails when compared to the normal PDF with the sample expectation and standard deviation. These two insights are stylized facts because they can be consistently observed for different financial instruments:

In [23]: raw = pd.read_csv('',
                           index_col=0, parse_dates=True).dropna()

In [24]: rets = np.log(raw / raw.shift(1)).dropna()

In [25]: symbol = '.SPX'

In [26]: return_histogram(rets[symbol].values, symbol)
aiif 0422
Figure 4-22. Frequency distribution and normal PDF for S&P 500 log returns

Similar insights can be gained when considering the Q-Q plot for the S&P 500 log returns in Figure 4-23. In particular, the Q-Q plot visualizes the fat tails pretty well (points below the straight line to the left and above the straight line to the right):

In [27]: return_qqplot(rets[symbol].values, symbol)
aiif 0423
Figure 4-23. Q-Q for S&P 500 log returns

The Python code that follows conducts the statistical tests regarding the normality of the real financial returns for a selection of the financial time series from the data set. Real financial returns regularly fail such tests. Therefore, it is safe to conclude that the normality assumption about financial returns hardly, if at all, describes financial reality:

In [28]: symbols = ['.SPX', 'AMZN.O', 'EUR=', 'GLD']

In [29]: for sym in symbols:
             print(45 * '=')

         Skew of Sample Log Returns -0.497160
         Skew Normal Test p-value    0.000000
         Kurt of Sample Log Returns  4.598167
         Kurt Normal Test p-value    0.000000
         Normal Test p-value         0.000000

         Skew of Sample Log Returns  0.135268
         Skew Normal Test p-value    0.005689
         Kurt of Sample Log Returns  7.344837
         Kurt Normal Test p-value    0.000000
         Normal Test p-value         0.000000

         Skew of Sample Log Returns -0.053959
         Skew Normal Test p-value    0.268203
         Kurt of Sample Log Returns  1.780899
         Kurt Normal Test p-value    0.000000
         Normal Test p-value         0.000000

         Skew of Sample Log Returns -0.581025
         Skew Normal Test p-value    0.000000
         Kurt of Sample Log Returns  5.899701
         Kurt Normal Test p-value    0.000000
         Normal Test p-value         0.000000

Normality Assumption

Although the normality assumption is a good approximation for many real-world phenomena, such as in physics, it is not appropriate and can even be dangerous when it comes to financial returns. Almost no financial return sample data set passes statistical normality tests. Beyond the fact that it has proven useful in other domains, a major reason why this assumption is found in so many financial models is that it leads to elegant and relatively simple mathematical models, calculations, and proofs.

Linear Relationships

Similar to the “omnipresence” of the normality assumption in financial models and theories, linear relationships between variables seem to be another widespread benchmark. This sub-section considers an important one, namely the assumed linear relationship in the CAPM between the beta of a stock and its expected (realized) return. Generally speaking, the higher the beta is, the higher the expected return given a positive market performance will be—in a fixed proportional way as given by the beta value itself.

Recall the calculation of the betas, the CAPM expected returns, and the realized returns for a selection of technology stocks from the previous section, which is repeated in the following Python code for convenience. This time, the beta values are added to the results’ DataFrame object as well.

In [30]: r = 0.005

In [31]: market = '.SPX'

In [32]: res = pd.DataFrame()

In [33]: for sym in rets.columns[:4]:
             for year in range(2010, 2019):
                 rets_ = rets.loc[f'{year}-01-01':f'{year}-12-31']
                 muM = rets_[market].mean() * 252
                 cov = rets_.cov().loc[sym, market]
                 var = rets_[market].var()
                 beta = cov / var
                 rets_ = rets.loc[f'{year + 1}-01-01':f'{year + 1}-12-31']
                 muM = rets_[market].mean() * 252
                 mu_capm = r + beta * (muM - r)
                 mu_real = rets_[sym].mean() * 252
                 res = res.append(pd.DataFrame({'symbol': sym,
                                                'beta': beta,
                                                'mu_capm': mu_capm,
                                                'mu_real': mu_real},
                                               index=[year + 1]),

The following analysis calculates the R 2 score for a linear regression for which the beta is the independent variable and the expected CAPM return, given the market portfolio performance, is the dependent variable. R 2 refers to the coefficient of determination and measures how well a model performs compared to a baseline predictor in the form of a simple mean value. The linear regression can only explain around 10% of the variability in the expected CAPM return, a pretty low value, which is also confirmed through Figure 4-24:

In [34]: from sklearn.metrics import r2_score

In [35]: reg = np.polyfit(res['beta'], res['mu_capm'], deg=1)
         res['mu_capm_ols'] = np.polyval(reg, res['beta'])

In [36]: r2_score(res['mu_capm'], res['mu_capm_ols'])
Out[36]: 0.09272355783573516

In [37]: res.plot(kind='scatter', x='beta', y='mu_capm', figsize=(10, 6))
         x = np.linspace(res['beta'].min(), res['beta'].max())
         plt.plot(x, np.polyval(reg, x), 'g--', label='regression')
aiif 0424
Figure 4-24. Expected CAPM return versus beta (including linear regression)

For the realized return, the explanatory power of the linear regression is even lower, with about 4.5% (see Figure 4-25). The linear regressions recover the positive relationship between beta and stock returns—“the higher the beta, the higher the return given the (positive) market portfolio performance”—as indicated by the positive slope of the regression lines. However, they only explain a small part of the observed overall variability in the stock returns:

In [38]: reg = np.polyfit(res['beta'], res['mu_real'], deg=1)
         res['mu_real_ols'] = np.polyval(reg, res['beta'])

In [39]: r2_score(res['mu_real'], res['mu_real_ols'])
Out[39]: 0.04466919444752959

In [40]: res.plot(kind='scatter', x='beta', y='mu_real', figsize=(10, 6))
         x = np.linspace(res['beta'].min(), res['beta'].max())
         plt.plot(x, np.polyval(reg, x), 'g--', label='regression')
aiif 0425
Figure 4-25. Expected CAPM return versus beta (including linear regression)

Linear Relationships

As with the normality assumptions, linear relationships can often be observed in the physical world. However, in finance there are hardly any cases in which variables depend on each other in a clearly linear way. From a modeling point of view, linear relationships lead, as does the normality assumption, to elegant and relatively simple mathematical models, calculations, and proofs. In addition, the standard tool in financial econometrics, OLS regression, is well suited to dealing with linear relationships in data. These are major reasons why normality and linearity are often deliberately chosen as convenient building blocks of financial models and theories.


Science has been driven for centuries by the rigorous generation and analysis of data. However, finance used to be characterized by normative theories based on simplified mathematical models of the financial markets, relying on assumptions such as normality of returns and linear relationships. The almost universal and comprehensive availability of (financial) data has led to a shift in focus from a theory-first approach to data-driven finance. Several examples based on real financial data illustrate that many popular financial models and theories cannot survive a confrontation with financial market realities. Although elegant, they might be too simplistic to capture the complexities, changing nature, and nonlinearities of financial markets.


Books and papers cited in this chapter:

Python Code

The following Python file contains a number of helper functions to simplify certain tasks in NLP:

# NLP Helper Functions
# Artificial Intelligence in Finance
# (c) Dr Yves J Hilpisch
# The Python Quants GmbH
import re
import nltk
import string
import pandas as pd
from pylab import plt
from wordcloud import WordCloud
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from lxml.html.clean import Cleaner
from sklearn.feature_extraction.text import TfidfVectorizer'seaborn')

cleaner = Cleaner(style=True, links=True, allow_tags=[''],

stop_words = stopwords.words('english')
stop_words.extend(['new', 'old', 'pro', 'open', 'menu', 'close'])

def remove_non_ascii(s):
    ''' Removes all non-ascii characters.
    return ''.join(i for i in s if ord(i) < 128)

def clean_up_html(t):
    t = cleaner.clean_html(t)
    t = re.sub('[\n\t\r]', ' ', t)
    t = re.sub(' +', ' ', t)
    t = re.sub('<.*?>', '', t)
    t = remove_non_ascii(t)
    return t

def clean_up_text(t, numbers=False, punctuation=False):
    ''' Cleans up a text, e.g. HTML document,
        from HTML tags and also cleans up the
        text body.
        t = clean_up_html(t)
    t = t.lower()
    t = re.sub(r"what's", "what is ", t)
    t = t.replace('(ap)', '')
    t = re.sub(r"\'ve", " have ", t)
    t = re.sub(r"can't", "cannot ", t)
    t = re.sub(r"n't", " not ", t)
    t = re.sub(r"i'm", "i am ", t)
    t = re.sub(r"\'s", "", t)
    t = re.sub(r"\'re", " are ", t)
    t = re.sub(r"\'d", " would ", t)
    t = re.sub(r"\'ll", " will ", t)
    t = re.sub(r'\s+', ' ', t)
    t = re.sub(r"\\", "", t)
    t = re.sub(r"\'", "", t)
    t = re.sub(r"\"", "", t)
    if numbers:
        t = re.sub('[^a-zA-Z ?!]+', '', t)
    if punctuation:
        t = re.sub(r'\W+', ' ', t)
    t = remove_non_ascii(t)
    t = t.strip()
    return t

def nltk_lemma(word):
    ''' If one exists, returns the lemma of a word.
        I.e. the base or dictionary version of it.
    lemma = wn.morphy(word)
    if lemma is None:
        return word
        return lemma

def tokenize(text, min_char=3, lemma=True, stop=True,
    ''' Tokenizes a text and implements some
    tokens = nltk.word_tokenize(text)
    tokens = [t for t in tokens if len(t) >= min_char]
    if numbers:
        tokens = [t for t in tokens if t[0].lower()
                  in string.ascii_lowercase]
    if stop:
        tokens = [t for t in tokens if t not in stop_words]
    if lemma:
        tokens = [nltk_lemma(t) for t in tokens]
    return tokens

def generate_word_cloud(text, no, name=None, show=True):
    ''' Generates a word cloud bitmap given a
        text document (string).
        It uses the Term Frequency (TF) and
        Inverse Document Frequency (IDF)
        vectorization approach to derive the
        importance of a word -- represented
        by the size of the word in the word cloud.

    text: str
        text as the basis
    no: int
        number of words to be included
    name: str
        path to save the image
    show: bool
        whether to show the generated image or not
    tokens = tokenize(text)
    vec = TfidfVectorizer(min_df=2,
                      ngram_range=(1, 2),
    wc = pd.DataFrame({'words': vec.get_feature_names(),
                       'tfidf': vec.idf_})
    words = ' '.join(wc.sort_values('tfidf', ascending=True)['words'].head(no))
    wordcloud = WordCloud(max_font_size=110,
                      width=1024, height=768,
                      margin=10, max_words=150).generate(words)
    if show:
        plt.figure(figsize=(10, 10))
        plt.imshow(wordcloud, interpolation='bilinear')
    if name is not None:

def generate_key_words(text, no):
        tokens = tokenize(text)
        vec = TfidfVectorizer(min_df=2,
                      ngram_range=(1, 2),

        wc = pd.DataFrame({'words': vec.get_feature_names(),
                       'tfidf': vec.idf_})
        words = wc.sort_values('tfidf', ascending=False)['words'].values
        words = [ a for a in words if not a.isnumeric()][:no]
        words = list()
    return words

1 See, for example, Kopf (2015).

2 This data service is only available via a paid subscription.

3 RIC stands for Reuters Instrument Code.

4 Numbers generated by the random number generator of NumPy are pseudorandom numbers, although they are referenced throughout the book as random numbers.

Get Artificial Intelligence in Finance now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.