Chapter 1. Introducing Data Science and Trading
The best way to begin learning about a complex topic is to break it down into smaller parts and understand those pieces first. Understanding deep learning for finance requires knowledge of data science and financial markets.
This chapter lays the building blocks needed to thoroughly understand data science and its uses, as well as to understand financial markets and how trading and forecasting can benefit from data science.
By the end of the chapter, you should know what data science is, what its applications are, and how you can use it in finance to extract value.
It is impossible to understand the field of data science without first understanding the types and structures of data. After all, the first word for the name of this immense field is data. So, what is data? And more importantly, what can you do with it?
The final aim of collecting data is decision-making. This is done through a complex process that ranges from the act of gathering and processing data to interpreting it and using the results to make a decision.
Let’s take an example of using data to make a decision. Suppose you have a portfolio composed of five different equally weighted dividend-paying stocks, as detailed in Table 1-1.
Analyzing this data can help you understand the average dividend yield you are receiving from your portfolio. The average is basically the sum divided by the quantity, and it gives a quick snapshot of the overall dividend yield of the portfolio:
Therefore, the average dividend yield of your portfolio is 5.16%. This information can help you compare your average dividend yield to other portfolios so that you know whether you have to make any adjustments.
Another metric you can calculate is the number of stocks held in the portfolio. This may provide the first informational brick in constructing a wall of diversification. Even though these two pieces of information (average dividend yield and the number of stocks in the portfolio) are very simple, complex data analysis begins with simple metrics and may sometimes not require sophisticated models to properly interpret the situation.
The two metrics you calculated in the previous example are called the average (or mean) and the count (or number of elements). They are part of a field called descriptive statistics discussed in Chapter 3, which is also itself part of data science.
Let’s take another example of data analysis for inferential purposes. Suppose you have calculated a yearly correlation measure between two commodities, and you want to predict whether the next yearly correlation will be positive or negative. Table 1-2 has the details of the calculations.
Correlation is a measure of the linear reliance between two time series. A positive correlation generally means that the two time series move on average in the same direction, while a negative correlation generally means that the two time series move on average in opposite directions. Correlation is discussed in Chapter 3.
From Table 1-2, the historical correlation between the two commodities was mostly (i.e., 88%) positive. Taking into account historical observations, you can say that there is an 88% probability that the next correlation measure will be positive. This also means that there is a 12% probability that the next correlation measure will be negative:
This is another basic example of how to use data draw inferences from observations and make decisions. Of course, the assumption here is that historical results will exactly reflect future results, which is unlikely in real life, but occasionally, to predict the future all you have is the past.
Now, before discussing data science, let’s review what types of data can be used and segment them into different groups:
- Numerical data
This type of data is composed of numbers that reflect a certain type of information that is collected at regular or irregular intervals. Examples can include market data (OHLC,1 volume, spreads, etc.) and financial statements data (assets, revenue, costs, etc.).
- Categorical data
Categorical data is data that can be organized into groups or categories using names or labels. It is qualitative rather than quantitative. For example, the blood type of patients is a type of categorical data. Another example is the eye color of different samples from a population.
- Text data
Text data has been on the rise in recent years with the development of natural language processing (NLP). Machine learning models use text data to translate, interpret, and analyze the sentiment of the text.
- Visual data
Images and videos are also considered data, and you can process and transform them into valuable information. For example, a convolutional neural network (CNN) is a type of algorithm (discussed in Chapter 8) that can recognize and categorize photos by labels (e.g., labeling cat photos as cats).
- Audio data
Audio data is very valuable and can help save time on transcriptions. For example, you can use algorithms on audio to create captions and automatically create subtitles. You can also create models that interpret the sentiment of the speaker using the tone and volume of the audio.
Data science is a transdisciplinary field that tries to extract intelligence and conclusions from data using different techniques and models, be they simple or complex. The data science process is composed of many steps besides just analyzing data. The following summarizes these steps:
- Data gathering: This process involves the acquisition of data from reliable and accurate sources. A widely known phrase in computer science generally credited to George Fuechsel goes “Garbage in, garbage out,” and it refers to the need to have quality data that you can rely on for proper analysis. Basically, if you have inaccurate or faulty data, then all your processes will be invalid.
- Data preprocessing: Occasionally, the data you acquire can be in a raw form, and it needs to be preprocessed and cleaned for the data science models to be able to use it. For example, dropping unnecessary data, adding missing values, or eliminating invalid and duplicate data can be part of the preprocessing step. Other, more complex examples can include normalization and denoising of data. The aim of this step is to get the data ready for analysis.
- Data exploration: During this step, basic statistical research is conducted to find trends and other characteristics in data. An example of data exploration is to calculate the mean of the data.
- Data visualization: This is an important step that is an add-on to the previous step. It includes creating visualizations such as histograms and heatmaps to help identify patterns and trends and facilitate interpretation.
- Data analysis: This is the main focus of the data science process. This is when you fit (train) the data using different learning models so that they interpret and predict the future outcome based on the given parameters.
- Data interpretation: This step deals with understanding the feedback and conclusions presented by the data science models. Optimization may also be a part of this step; in those cases, we loop back to step 5 and run the models again with the updated parameters before reinterpreting them and evaluating the performance.
Let’s take a simple example in Python that applies the steps of the data science process. Suppose you want to analyze and predict the VIX (volatility index), a volatility time series indicator that represents the implied volatility of the S&P 500 stock market index. The VIX has been available since 1993 and is issued by the Chicago Board Options Exchange (CBOE).
There is also a hidden step in the data science process that I refer to as step zero, and it occurs when you form an idea based on which process should be initiated. After all, you wouldn’t be applying the process if you didn’t have a motive first. For example, believing that inflation numbers may drive the returns of certain commodities is an idea and a motive to start exploring the data in search of real numbers that prove this hypothesis.
Because it is meant to measure the level of fear or uncertainty in the stock market, the VIX is frequently referred to as the fear index. It is a percentage that is computed using the pricing of options on the S&P 500. A higher VIX value correlates with greater market turbulence and uncertainty, whereas a lower value correlates with greater stability on average.
The first step is data gathering, which in this case can be automated using Python. The following code block connects to the website of the Federal Reserve of Saint Louis and downloads the historical data of the VIX between January 1, 1990, and January 23, 2023 (Chapter 6 is dedicated to introducing Python and writing code; for the moment, you do not have to understand the code, as that is not yet the goal):
# Importing the required library
# Setting the beginning and end of the historical data
# Creating a dataframe and downloading the VIX data
# Printing the latest five observations of the dataframe
The code uses the pandas library to import the
DataReader function, which fetches the historical data online from a variety of sources. The
DataReader function takes the name of the data as the first argument, followed by the source and the dates. The output of
print(vix.tail()) is shown in Table 1-3.
Let’s move on to the second step: data preprocessing. I divide this part into checking for invalid data and transforming the data so that it is ready for use. When dealing with time series, especially downloaded time series, you may sometimes encounter
nan values. NaN stands for Not a Number, and
nan values occur due to missing, invalid, or corrupt data.
You can deal with
nan values in many ways. For the sake of this example, let’s use the simplest way of dealing with these invalid values, which is to eliminate them. But first, let’s write some simple code that outputs the number of
nan values in the dataframe so that you have an idea of how many values you will delete:
# Calculating the number of nan values
# Printing the result
'Number of nan values in the VIX dataframe: '
The code uses the
isnull() function and sums the number it gets, which gives the number of
nan values. The output of the previous code snippet is as follows:
Now that you have an idea of how many rows you will delete, you can use the following code to drop the invalid rows:
# Dropping the nan values from the rows
The second part of the second step is to transform the data. Data science models typically require stationary data, which is data with stable statistical properties such as the mean.
The concept of stationarity and the required statistics metrics are discussed in detail in Chapter 3. For now, all you need to know is that it is likely that you will have to transform your raw data into stationary data when using data science models.
To transform the VIX data into stationary data, you can simply take the differences from one value relative to the previous value. The following code snippet takes the VIX dataframe and transforms it into theoretically implied stationary data:2
# Taking the differences in an attempt to make the data stationary
# Dropping the first value of the dataframe
The third step is data exploration, which is all about understanding the data you have in front of you, statistically speaking. As you will see statistical metrics in detail in Chapter 3, I’ll limit the discussion to just calculating the mean of the dataset.
The mean is simply the value that can represent the other values in the dataset if they were to elect a leader. It is the sum of the values divided by their quantity. The mean is the simplest stat in the descriptive statistics world, and it is definitely the most used one. The following formula shows the mathematical representation of the mean of a set of values:
You can easily calculate the mean of the dataset as follows:
# Calculating the mean of the dataset
# Printing the result
'The mean of the dataset = '
The output of the previous code snippet is as follows:
The next step is data visualization, which is mostly considered to be the fun step. Let’s chart the VIX’s differenced values through time. The following code snippet plots the VIX data shown in Figure 1-1:
# Importing the required library
# Plotting the latest 250 observations in black with a label
'Change in VIX'
# Plotting a red dashed horizontal line that is equal to mean
# Calling a grid to facilitate the visual component
# Calling the legend function so it appears with the chart
# Calling the plot
Steps 5 and 6, data analysis and data interpretation, are what you are going to study thoroughly in this book, so let’s skip them for now and concentrate on the introductory part of data science.
Let’s go back to the invalid or missing data problem before moving on. Sometimes data is incomplete and has missing cells. Even though this has the potential to hinder the predictive ability of the algorithm, it should not stop you from continuing the analysis as there are quick fixes that help lessen the negative impact of the empty cells. For instance, consider Table 1-4.
The table contains the quarterly gross domestic product (GDP)3 of a hypothetical country. Notice how the table is missing the value for Q1 2021. There are three basic ways to solve this issue:
- Delete the cell that contains the missing value.
This is the technique used in the VIX example. It simply considers that the timestamp does not exist. It is the easiest fix.
- Assume that the missing cell is equal to the previous cell.
This is also a simple fix that has the aim of smoothing the data instead of completely ignoring the issue.
- Calculate a mean or a median of the cells around the empty value.
This technique takes smoothing one step further and assumes that the missing value is equal to the mean between the previous and next values. Additionally, it can be the mean of a few past observations.
Data science comprises a range of mathematical and statistical concepts and requires a deep understanding of machine learning algorithms. In this book, these concepts are discussed in detail but also in an easy-to-grasp manner to benefit both technical and nontechnical readers. Many models are assumed to be mystery boxes, and there is a hint of truth in this, but the job of a data scientist is to understand the models before interpreting their results. This helps in understanding the limitations of the models.
This book uses Python as the go-to programming language to create the algorithms. As mentioned, Chapter 6 introduces Python and the knowledge required to manipulate and analyze the data, but it also provides the foundations for creating the different models, which, as you will see, are simpler than you might expect.
Before moving on to the next section, let’s have a look at the concept of data storage. After all, data is valuable, but you need to store it in a place where it can be easily fetched and analyzed.
Data storage refers to the techniques and areas used to store and organize data for future analysis. Data is stored in many formats, such as CSV and XLSX. Other types of formats may include XML, JSON, and even JPEG for images. The format is chosen according to the structure and organization of the data.
Data can also be stored in the cloud or on premises, depending on your storage capacity and costs. For example, you may want to keep your historical, one minute’s worth of Apple stock data in the cloud, instead of in a CSV file, so that you save space on your local computer.
When dealing with time series in Python, you are mostly going to deal with two types of data storage: arrays and dataframes. Let’s take a look at what they are:
An array is used to store elements of the same kind. Typically, a homogeneous dataset (such as numbers) is best kept in an array.
A dataframe is a two-dimensional structure that can hold data of various types. It can be compared to a table with columns and rows.
In general, arrays should be used whenever a homogeneous data collection needs to be efficiently stored. When dealing with heterogeneous data or when you need to edit and analyze data in a tabular manner, you should use dataframes.
Understanding Data Science
An algorithm is a set of ordered procedures that are designed to complete a certain activity or address a particular issue. An algorithm can be as simple as a coin flip or as sophisticated as the Risch algorithm.4
Let’s take a very simple algorithm that updates a charting platform with the necessary financial data. This algorithm would follow these steps:
- Connect the server and the online data provider.
- Copy the financial data with the most recent timestamp.
- Paste the data into the charting platform.
- Loop back to step 1 and redo the whole process.
That is the nature of algorithms: performing a certain set of instructions with a finite or an infinite goal.
The six data science steps discussed in the previous section can also be considered an algorithm.
Trading strategies are also algorithms, as they have clear rules for the initiation and liquidation of positions. An example of a trading strategy is market arbitrage.
Arbitrage is a type of trading strategy that aims to profit from price differences of the same asset quoted on different exchanges. These price differences are anomalies that are erased by arbitrageurs through their buying and selling activities. Consider a stock that is traded on exchange A and exchange B in different countries (for simplicity reasons, the two countries use the same currency). Naturally, the stock must trade at the same price on both exchanges. When this condition does not hold, arbitrageurs come out of their lairs to hunt.
They buy the stock on the cheaper exchange and immediately sell it on the more expensive exchange, thus ensuring a virtually risk-free profit. These operations are performed at lightning speed, as the price differences do not last long due to the sheer power and speed of arbitrageurs. To clarify, here’s an example:
- The stock’s price at exchange A = $10.00.
- The stock’s price at exchange B = $10.50.
The arbitrageur’s algorithm in this case will do the following:
- Buy the stock on exchange A for $10.00.
- Sell the stock immediately on exchange B for $10.50.
- Pocket the difference ($0.50) and repeat until the gap is closed.
Trading and execution algorithms can be highly complex and require specialized knowledge and a certain market edge.
At this point, you should be aware of the two main uses of data science, data interpretation and data prediction:
- Data interpretation
- Data prediction
The main aim of using learning algorithms in financial markets is to predict future asset prices so that you can make an informed trading decision that results in capital appreciation at a success rate higher than random. I discuss many simple and complex learning algorithms in this book. These learning algorithms or models can be categorized as follows:
- Supervised learning
Supervised learning algorithms are models that require labeled data to function. This means that you must provide data so that the model trains itself on these past values and understands the hidden patterns so that it can deliver future outputs when encountering new data. Examples of supervised learning include linear regression algorithms and random forest models.
- Unsupervised learning
Unsupervised learning algorithms are models that do not require labeled data to function. This means that they can do the job with unlabeled data since they are built to find hidden patterns on their own. Examples include clustering algorithms and principal component analysis (PCA).
- Reinforcement learning
Reinforcement learning algorithms are models that do not require data at all, as they discover their environment and learn from it on their own. In contrast to supervised and unsupervised learning models, reinforcement learning models gain knowledge through feedback obtained from the environment via a reward system. Since this is generally applied to situations in which an agent interacts with the environment and learns to adopt behaviors that maximize the reward over time, it may not be the go-to algorithm for time series regression. On the other hand, it can be used to develop a policy that can apply to time series data to create predictions.
As you may have noticed, the book’s title is Deep Learning for Finance. This means that in addition to covering other learning models, I will be spending a sizable portion of the book discussing deep learning models for time series prediction. Deep learning mostly revolves around the use of neural networks, an algorithm discussed in depth in Chapter 8.
Deep supervised learning models (such as deep neural networks) can learn hierarchical representations of the data because they include many layers, with each layer extracting features at a different level of abstraction. As a result, hidden and complex patterns are learned by deep models that may be difficult for shallow (not deep) models to learn.
On the other hand, shallow supervised learning models (like linear regression) have a limited ability to learn complex, nonlinear relationships. But they require less computational effort and are therefore faster.
Data science algorithms are deployed pretty much everywhere nowadays, not just in finance. Some applications include the following:
- Business analytics: Optimizing pricing, predicting customer turnover, or improving marketing initiatives using data analysis
- Healthcare: Improving patient outcomes, finding innovative therapies, or lowering healthcare costs through in-depth analysis of patient data
- Sports: Analyzing sports data to enhance team performance, player scouting, or bets
- Research: Analyzing data to support scientific investigation, prove theories, or gain new knowledge
When someone talks about data science applications, it helps to know what a data scientist does. A data scientist must evaluate and understand complex data in order to get insights and provide guidance for decision-making. Common tasks involved in this include developing statistical models, applying machine learning techniques, and visualizing data. They support the implementation of data-driven solutions and inform stakeholders of their results.
Data scientists are different from data engineers. Whereas a data scientist is concerned with the interpretation and analysis of data, a data engineer is concerned with the tools and infrastructure needed to gather, store, and analyze data.
Introduction to Financial Markets and Trading
The aim of this book is to present a hands-on approach to applying different learning models to forecast financial time series data. It is therefore imperative to gain solid knowledge of how trading and financial markets work.
Financial markets are places where people can trade financial instruments, such as stocks, bonds, and currencies. The act of buying and selling is referred to as trading. The main, but not only, aim of buying a financial instrument is capital appreciation. The buyer believes that the value of the instrument is greater than its price; therefore, the buyer buys the stock (goes long) and sells whenever they believe that the current price equals the current value. In contrast, traders can also make money if the price of the instrument goes down. This process is referred to as short selling and is common in certain markets such as futures and foreign exchange (FX).
The process of short selling entails borrowing the financial instrument from a third party, selling it on the market, and buying it back, before returning it to the third party. Ideally, as you expect the price of the instrument to drop, you would buy it back at a lower cost (after the price decrease) and give it back to the third party at the market price, thus pocketing the difference. The following examples explain these concepts further:
- Long (buy) position example
A trader expects the price of Microsoft shares to increase over the next couple of years due to improved technology regulations, which would increase earnings. They buy a number of shares at $250 and aim to sell them at $500. The trader therefore has a long position on Microsoft stock (also referred to as being bullish).
- Short (sell) position example
A trader expects the price of Lockheed Martin shares to decrease over the next couple of days due to signals from a technical strategy. They sell short a number of shares at $450 and aim to buy them back at $410. The trader therefore has a short position on Lockheed Martin stock (also referred to as being bearish).
Financial instruments may come in their raw form (spot) or in a derivative form. Derivatives are products that traders use to trade markets in certain ways. For example, a forward or a futures contract is a derivative contract where a buyer locks in a price for an asset to buy it at a later time.
Another type of derivative is an option. An option is the right, but not the obligation, to buy a certain asset at a specific price in the future by paying a premium now (the option’s price). When a buyer wants to buy the underlying stock, they exercise their option to do so; otherwise, they may let the option expire.
Trading activity may also occur for hedging purposes, as it is not limited to just speculation. An example of this is Air France (the main French airline company) hedging its business operations by buying oil futures. Buying oil futures protects Air France from rising oil prices that may hurt its main operations (aviation). The rising costs from using fuel to power the planes are offset by the gains from the futures. This allows the airline to focus on its main business. This whole process is called hedging.
As another example, let’s say an airline company expects to consume a certain amount of fuel in the next six months, but it is worried about the potential increase in oil prices over that period. To protect against this price risk, the airline can enter into a futures contract to purchase oil at a fixed price on a future date.
If the price of oil increases during that time, the airline would still be able to purchase the oil at the lower, fixed price agreed upon in the futures contract. If the price of oil decreases, the airline would be obligated to pay the higher, fixed price, but the lower market price for the oil would offset that cost.
In this way, the airline can mitigate the risk of price fluctuations in the oil market and stabilize its fuel costs. This can help the airline better manage its budget and forecast its future earnings. As you can see, the aim is not to make financial gains from the trading operations; it is simply to stabilize its costs by locking in a known price for oil.
Typically, financial instruments are grouped into asset classes based on their type:
- Stock markets
A stock market is an exchange place (electronic or physical) where companies issue shares of stock to raise money for business. When people buy shares of a company’s stock, they become part owners of that company and may become entitled to dividends according to the company’s policy. Depending on the stock, they can also gain the right to vote in board meetings.
- Fixed income
Governments and businesses can borrow money in the fixed income market. When a person purchases a bond, they are effectively lending money to the borrower, who has agreed to repay the loan along with interest. Depending on the borrower’s creditworthiness and the prevailing interest rates, the bond’s value may increase or decrease.
The FX market, also referred to as the currencies market, is a place where people may purchase and sell various currencies. The value of a nation’s currency can increase or decrease based on a variety of variables, including the economy, interest rates, and the nation’s political stability.
Agricultural products, gold, oil, and other physical assets with industrial or other uses are referred to as commodities. They typically offer a means to profit from global economic trends as well as being a form of hedge against inflation.
- Alternative investments
In the world of finance, nontraditional investments such as real estate, private equity funds, and hedge funds are referred to as alternative asset classes. These asset classes have the potential to offer better returns than traditional assets and offer the benefit of diversity, but they also tend to be less liquid and may be more difficult to evaluate.
It’s crucial to remember that each of these asset classes has unique qualities and various levels of risk, so investors should do their homework before investing in any of these assets.
Financial markets allow businesses and governments to raise the money they need to operate. They also allow investors to make money by speculating and investing in interesting opportunities. Trading activities provide liquidity to the markets. And the more liquid a market is, the easier and less costly it is to trade in it. But how do markets really work? What causes the prices to go up and down?
Market microstructure is the research that deals with the trading of securities in financial markets. It looks at how trading works as well as how traders, investors, and market makers behave. Understanding price formation and the variables that affect trading costs is the aim of market microstructure research.
Order flow, liquidity, market effectiveness, and price discovery are just a few of the many subjects covered by market microstructure research. Additionally, this research looks at how various trading techniques, including limit orders, market orders, and algorithmic trading, affect market dynamics. Liquidity is possibly the most important market microstructure concept. It describes how easily an asset may be bought or sold without materially changing its price. Liquidity can vary among financial instruments and over time. It can be impacted by a number of variables, including trading volume and volatility.
Finally, I want to discuss another important area of market microstructure: price discovery. This refers to the method used to set prices in a market. Prices can be affected by elements like order flow, market maker activity, and the presence of various trading methods.
Imagine you want to buy a sizable number of shares in two stocks: stock A and stock B. Stock A is very liquid, while stock B is very illiquid. If you want to execute the buy order on stock A, you are likely to get filled at the desired market price with minimal to no impact. However, with stock B, you are likely to get a worse price, as there are not enough sellers willing to sell at your desired buy price. Therefore, as you create more demand from your orders, the price rises to match the sellers’ prices, and thus, you will buy at a higher (worse) price. This is the impact liquidity can have on your trading.
Applications of Data Science in Finance
Let’s begin peeking into the main areas of data science for finance. Every field has its challenges and problems that need simple and complex solutions. Finance is no different. Recent years have seen a gigantic increase in the use of data science to improve the world of finance, from the corporate world to the markets world. Let’s discuss some of these areas:
- Forecasting the market’s direction
- Financial fraud detection
Financial transactions can be examined for patterns and anomalies using data science models, which attempt to spot possible fraud. One way to use data science to stop financial fraud is to examine credit card transaction data for unusual or suspicious patterns of expenditures, such as numerous minor purchases made in quick succession or significant or frequent purchases made from the same store.
- Risk management
Data science can be used to examine financial data and spot potential risks to portfolios. This can involve analyzing vast amounts of historical data using methods like statistical modeling, machine learning, and artificial intelligence to spot patterns and trends that can be used to forecast risk factors.
- Credit scoring
Data science can be used to examine financial data and credit history, forecast a person’s or a company’s creditworthiness, and make loan decisions. Utilizing financial data, such as income and credit history, to forecast a person’s creditworthiness is one example of applying data science for credit score research. This can involve using techniques such as statistical modeling and machine learning to develop a prediction model that can use a number of indicators, such as prior credit performance, income, and job history, to evaluate a person’s likelihood of repaying a loan.
- Natural language processing
To make better judgments, NLP analyzes and extracts insights from unstructured financial data, such as news articles, reports, and social media posts. NLP uses the sentiment of the text to extract possible trading opportunities stemming from the intentions and feelings of market participants and experts. NLP falls into the field of sentiment analysis (with help from machine learning).
The data science field keeps growing every day with the ongoing introduction of new techniques and models for improving data interpretation. This chapter provided a simple introduction to what you need to know about data science and how you can use it in finance.
The next three chapters present the knowledge in statistics, probability, and math that you may need when trying to understand data science models. Even though the aim of the book is to present a hands-on approach to creating and applying the different models using Python, it helps for you to understand what you’re dealing with instead of blindly applying them to data.
If you need a Python refresher, see Chapter 6, which is a basic introduction. It sets the foundation for what’s to come next in the book. You do not need to become a Python master, but you must understand the code and what it refers to, and especially how to debug and detect errors in the code.
1 OHLC refers to the four essential pieces of market data: open price, high price, low price, and close price.
2 The reason I am saying “implied” is that stationarity must be verified through statistical checks that you will see in Chapter 3. At the moment, the assumption is that differencing the data gives stationary time series.