Chapter 1. Financial Data Engineering Clarified
Given all the payments, transfers, trades, and numerous financial activities that take place on a daily basis, can you imagine how much data the global financial sector generates? According to a 2011 report by McKinsey Global Institute, the banking and investment sector in the US alone stores and manages more than one exabyte of data. To put that in perspective, an exabyte is the equivalent of one billion gigabytes, and it translates into trillions of digital records. The same report shows that on average, financial services firms generate and store more data than firms in other sectors. Some statistics are even more astonishing; for instance, JPMorgan Chase, the largest bank in the United States by market capitalization, manages more than 450 petabytes of data. Bank of New York Mellon, a global financial services company specializing in investment management and investment services, manages over 110 million gigabytes of global financial data.
Naturally, we might extrapolate these estimates and figures to tens or even hundreds of exabytes if we take into account the global context and the constantly expanding financial landscape. As a result, data sits at the heart of the financial system, serving as both the input for different financial operations and the output generated from them. Importantly, to guarantee a healthy and well-functioning system, a reliable and secure data infrastructure is needed for generating, exchanging, storing, and consuming all kinds of financial data. In addition, this infrastructure must adhere to the financial sector’s specific requirements, constraints, practices, and regulations. This is where financial data engineering comes into the scene. To get started, this chapter will introduce you to finance, financial data engineering, and the role and skills of the financial data engineer.
Defining Financial Data Engineering
Data engineering has always been a vibrant and innovative field from both industry and research standpoints. If you are a data engineer, you are likely aware of how many data-related technologies are released and popularized every year. Several factors drive these developments:
-
The growing importance of data as a key input in the creation of digital products and services
-
Large digital companies, such as LinkedIn, Netflix, Google, Meta, and Airbnb, transitioning the data frameworks they developed internally to handle massive volumes of data and traffic to open source projects
-
The impressive success of open source alternatives, which has fueled interest from individuals and businesses in developing and evaluating new tools and ideas
As an industry practice, data engineering has undergone several conceptual and technological evolution episodes. Without offering a detailed historical account, I would simply say that the birth of data engineering started with the introduction of Structured Query Language (SQL) and data warehousing in the 1970s/1980s. Companies like IBM and Oracle were early pioneers in the field, playing a key role in developing and popularizing many of the fundamental principles of data engineering.
Until the early 2000s, data engineering responsibilities were primarily handled by information technology (IT) teams. Roles such as database administrator, database developer, and system administrator were prevalent in the data job market.
With the global rise and adoption of the internet and social media, the so-called big data revolution marked a major step toward contemporary data engineering. Using the release date of Apache Hadoop as a reference, I would say that the big data era started around 2005. Pioneers like Google, Airbnb, Meta, Microsoft, Amazon, and Netflix have popularized a more specialized and advanced version of data engineering. This includes big data frameworks, open source tools, cloud computing, alternative data, and streaming technologies.
The financial sector has actively participated in this dynamic environment as both an observer and an adopter of data technologies. This active involvement stems from the financial industry’s continuous evolution in response to market demands and regulatory changes, which often necessitates the adoption of new technologies. Importantly, data engineering practices in finance are heavily domain driven, given the distinct requirements of the financial sector in terms of security, governance, and regulation, as well as the complex nature of the financial data landscape and financial data management challenges.
Considering these factors, this book will present financial data engineering as a domain-driven field within data engineering, specifically tailored to the financial sector, thereby setting it apart from traditional data engineering. To further justify the need for financial data engineering, the upcoming sections will provide a brief introduction to the finance domain, outline the data-related challenges encountered in financial markets, offer definitions of data engineering and financial data engineering, and provide an overview of the role and responsibilities of a financial data engineer.
First of All, What Is Finance?
Despite the extensive use of the term finance, there could be a lot of confusion about what it really means. This is because finance is a multifaceted concept that can be approached from different angles (see Figure 1-1). To prepare you with a basic domain knowledge, the next sections present a short conceptual illustration of finance from four main perspectives: economics, market, science, and technology.
Finance as an economic function
In economic theory, finance is an institution that mediates between agents who are in deficit (who need more money than they have) and those in surplus (who have more money than they spend). To secure funds, agents in deficit offer to borrow money from agents with a surplus in exchange for an interest payment.
This perspective highlights the vital role of finance in the economy: it offers individuals a means to invest their savings, allows families to purchase a house through a mortgage, provides businesses with capital to get started, empowers universities to invest their assets and expand their campus, and enables governments to finance public projects to fulfill societal needs.
For economists, finance is one of the primary drivers of economic growth. This is why good economies tend to have large, efficient, and inclusive financial markets. To ensure financial markets’ stability and fairness, several regulatory agencies and regulations were established.
A major subject that financial economists often investigate is market equilibrium, which describes a state where demand and supply intersect, resulting in a stable market price. In financial markets, this price is commonly represented by the interest rate, with supply and demand reflecting the quantity of money in circulation. When demand exceeds supply, interest rates typically rise, whereas if supply surpasses demand, interest rates tend to decrease. Entities such as central banks were established to implement monetary policies aimed at maintaining market interest rates as closely aligned with equilibrium as possible.
Finance as a market
To enable individuals and companies to engage efficiently in financial activities, financial markets have emerged, hosting a vast array of financial institutions, products, and services. Nowadays, if we take a well-developed financial sector, we can find a large variety of market players. These may include the following:
-
Commercial banks (e.g., HSBC, Bank of America)
-
Investment banks (e.g., Morgan Stanley, Goldman Sachs)
-
Asset managers (e.g., BlackRock, The Vanguard Group)
-
Security exchanges (e.g., New York Stock Exchange [NYSE], London Stock Exchange, Chicago Mercantile Exchange)
-
Hedge funds (e.g., Citadel, Renaissance Technologies)
-
Mutual funds (e.g., Vanguard Mid-Cap Value Index Fund)
-
Insurance companies (e.g., Allianz, AIG)
-
Central banks (e.g., Federal Reserve, European Central Bank)
-
Government-sponsored enterprises (e.g., Fannie Mae, Freddie Mac)
-
Regulators (e.g., Securities and Exchange Commission)
-
Industry trade groups (e.g., Securities Industry and Financial Markets Association)
-
Credit rating agencies (e.g., S&P Global Ratings, Moody’s)
-
Data vendors (e.g., Bloomberg, London Stock Exchange Group [LSEG])
-
FinTech companies (e.g., Revolut, Wise, Betterment)
-
Big tech companies (e.g., Amazon Cash, Amazon Pay, Apple Pay, Google Pay)
Note
The terms “financial institution,” “financial firm,” “financial company,” and “financial organization” might often be used interchangeably. However, from an economic theory standpoint, “financial institution” may be the most appropriate term to use, as it represents an abstract concept encompassing any company, agency, firm, or organization that serves a specific purpose or function within financial markets. For this reason, I will be mostly using the term “financial institution” throughout this book.
The primary unit of exchange in financial markets is commonly referred to as a financial asset, instrument, or security. There is a large number of financial assets that can be bought and sold in financial markets. Here are a few:1
-
Shares of companies (e.g., common stocks)
-
Fixed income instruments (e.g., corporate bonds, treasury bills)
-
Derivatives (e.g., options, futures, swaps, forwards)
-
Fund shares (e.g., mutual funds, exchange-traded funds)
Given the large and diverse number of financial instruments and transactions, financial markets are further classified into categories, such as the following:
-
Money markets (for liquid short-term exchanges)
-
Capital markets (long-term exchanges)
-
Primary markets (for new issues of instruments)
-
Secondary markets (for already issued instruments)
-
Foreign exchange markets (for trading currencies)
-
Commodity markets (for trading raw materials such as gold and oil)
-
Equity markets (for trading stocks)
-
Fixed-income markets (for trading bonds)
Finance as a research field
Finance is a well-known and extensive field of academic and empirical research. One major area of investigation is asset pricing theory, which aims to understand and calculate the price of claims to risky (uncertain) assets (e.g., stocks, bonds, derivatives, etc.). Within this theory, low prices often translate into a high rate of return, so we can think of financial asset pricing theory as a way to explain why certain financial assets pay (or should pay) higher average returns than others.
Another major field of financial research is risk management, which focuses on measuring and managing the uncertainty around the future value of a financial asset or a portfolio of assets. Other areas of investigation include portfolio management, corporate finance, financial accounting, credit scoring, financial engineering, stock prediction, and performance evaluation.
To publish financial research findings, a variety of peer-reviewed journals have been established. Some of these journals offer broad coverage, while others are more specialized. Here are some examples:
- The Journal of Finance
-
Covers theoretical and empirical research on all major areas of finance
- The Review of Financial Studies
-
Covers theoretical and empirical topics in financial economics
- The Journal of Banking and Finance
-
Covers theoretical and empirical topics in finance and banking, with a focus on financial institutions and money and capital markets
- Quantitative Finance
-
Covers theoretical and empirical interdisciplinary research on quantitative methods of finance
- The Journal of Portfolio Management
-
Covers topics related to finance and investing, such as risk management, portfolio optimization, and performance measurement.
- The Journal of Financial Data Science
-
Covers data-driven research in finance using machine learning, artificial intelligence, and big data analytics
- The Journal of Securities Operations & Custody
-
Covers topics and issues related to securities trading, clearing, settlement, financial standards, and more
In addition to academic journals, a large number of conferences, events, and summits are regularly held to share and discuss the latest developments in financial research. Examples include the Western Finance Association meetings, the American Finance Association meetings, and the Society for Financial Studies Cavalcades. Furthermore, globally renowned certifications like the Chartered Financial Analyst (CFA) are available to aspirant financial specialists who wish to acquire strong ethical and technical foundations in investment research and portfolio management.
Finance as a technology
Finally, finance can refer to the set of technologies and tools enabling all kinds of financial transactions and activities. Examples include the following:
-
Payment systems (mobile, contactless, real-time, digital wallets, gateways, etc.)
-
Blockchain and distributed ledger technology (DLT)
-
Financial market infrastructures (e.g., Euroclear, Clearstream, Fedwire, T2, CHAPS)
-
Trading platforms
-
Stock exchanges (e.g., NYSE, NASDAQ, Tokyo Stock Exchange)
-
Stock market data systems
-
Automated teller machine (ATM)
-
Order management systems (OMSs)
-
Risk management systems
-
Algorithmic trading and high-frequency trading (HFT) systems
-
Smart order routing (SOR) systems
This diverse array of technologies in the financial sector is crucial for maintaining the efficiency and reliability of global financial markets.
Defining Data Engineering
Now that we have a foundational understanding of finance, let’s explore what financial data engineering is. To do this, I’ll first explain traditional data engineering, as it is a widely recognized term in the industry.
If we Google the words “what is data engineering,” we get more than two billion search results. That’s quite a lot, but to be more pragmatic, we can do a more advanced inquiry by searching Google Scholar for all papers and books where the term “data engineering” occurs in the title. Such a query returns a relatively large number of results (around 2,290 scientific publications), as shown in Figure 1-2.
I highly recommend you read some of the publications that Google Scholar returns for data engineering. Interestingly, you will quickly notice that there is quite a high variety of definitions for data engineering. This is expected, as the field of data engineering sits at the intersection between multiple fields, including software engineering, infrastructure engineering, data analysis, networking, software and data architecture, data governance, and other data management-related areas.2
For illustrative purposes, let’s consider the following selected definitions:
Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering.
Joe Reis and Matt Housley, Fundamentals of Data Engineering (O’Reilly, 2022)
Data engineering is all about the movement, manipulation, and management of data.
Lewis Gavin, What Is Data Engineering? (O’Reilly 2019)
Data engineering is the process of designing and building systems that let people collect and analyze raw data from multiple sources and formats. These systems empower people to find practical applications of the data, which businesses can use to thrive.
As you can see, all three definitions are quite different, but if we make an effort to extract the main defining elements, we can infer that data engineering revolves around the design and implementation of an infrastructure that enables an organization to retrieve data from one or more sources, transform it, store it in a target destination, and make it consumable by end users. Naturally, in practice, the complexity of such a process would depend on the technical and business requirements and constraints, which vary on a case-by-case basis. Given this context, I will use the following definition of data engineering throughout this book:
Data engineering is a field of practice and research that focuses on designing and implementing data infrastructure intended to reliably and securely perform tasks such as data ingestion, transformation, storage, and delivery. This infrastructure is tailored to meet varying business requirements, industry practices, and external factors such as regulatory compliance and privacy considerations.
Throughout this book, we’ll focus on the concept of financial data infrastructure as the cornerstone of financial data engineering. Along the way, we will examine the components of a financial data infrastructure, which include physical (hardware) and virtual (software) resources and systems for storing, processing, managing, and transmitting financial data. Furthermore, we will discuss the essential capabilities and features of a financial data infrastructure, such as security, traceability, scalability, observability, and reliability.
With this definition in mind, let’s now proceed to clarify the meaning of financial data engineering.
Defining Financial Data Engineering
Financial data engineering shares most of the traditional data engineering tools, patterns, practices, and technologies. However, when designing and building a financial data infrastructure, relying only on traditional data engineering is not sufficient. You are very likely going to deal with domain-specific issues such as the complex financial data landscape (e.g., a large number of data sources, types, vendors, structures, etc.), the regulatory requirements for reporting and governance, the challenges related to entity and identification systems, the special requirements in terms of speed and volume, and a variety of constraints on delivery, ingestion, storage, and processing.3
Given such domain-driven particularities, financial data engineering deserves to be treated as a specialized field that sits at the intersection between traditional data engineering, financial domain knowledge, and financial data (as illustrated in Figure 1-3). More formally, this book defines financial data engineering as follows:
Financial data engineering is the domain-driven practice of designing, implementing, and maintaining data infrastructure to enable the collection, transformation, storage, consumption, monitoring, and management of financial data coming from mixed sources, with different frequencies, structures, delivery mechanisms, formats, identifiers, and entities, while following secure, compliant, and reliable standards.
Note
Don’t confuse financial data engineering with financial engineering. Financial engineering is an interdisciplinary applied field that uses mathematics, statistics, econometrics, financial theory, and computer science to develop financial investment strategies, financial products, and financial processes.4
Now that you know what financial data engineering is, you may be wondering why it matters to financial institutions and markets and why we should write a book about it. The next section addresses these questions in detail.
Why Financial Data Engineering?
One of the main goals of this book is to illustrate how financial data engineering is unique in terms of the domain-driven elements that characterize it. To understand why the market demands financial data engineering, it is crucial to examine the main factors shaping and driving data-driven needs and trends in the financial sector. The next few sections will provide a detailed account of these factors.
Volume, Variety, and Velocity of Financial Data
One of the primary factors that have been transforming the financial sector is big data. In this book, big data is simply defined as a combination of three attributes: large size (volume), high dimensionality and complexity (variety), and speed of generation (velocity). Let’s explore each of these Vs in detail.
Volume
When referencing big data, it is hard to deny that it is primarily about size. Data can be large, either in absolute or relative terms. Data is said to be large in absolute terms if it gets generated in a remarkably enormous and nonlinear quantity. An absolute increase in data size is often the result of socio-technological changes that induce a structural alteration to the data generation process. For example, in the past, card payments were primarily reserved for major purchases and were relatively limited, whereas today, the widespread adoption of card and mobile payment methods has transformed everyday transactions, with people now using cards and phones to pay for almost everything, from groceries to electronics. This, in turn, has led to a (remarkable) absolute increase in the amount of payment data being generated and collected.
In addition, the rapid development and adoption of digital automated technologies, in particular electronic exchange mechanisms, have resulted in an absolute increase in the sheer volume of financial data generated. The emergence of high-frequency trading is a good example. For instance, a single day’s worth of data from the New York Stock Exchange’s high-frequency dataset, Trade and Quotes (TAQ), comprises approximately 2.3 billion records. With the implementation of high-frequency trading technologies, financial data began to be recorded at incredibly fine intervals, including the millisecond (one-thousandth of a second), microsecond (one-millionth of a second), and even nanosecond (one-billionth of a second) levels.
On the other hand, data is considered relatively large if its size is big compared to other existing datasets. Improved data collection is perhaps the main driver behind the relative increase in financial data volumes. This has been facilitated by technological advancements enabling more efficient data collection, regulatory requirements imposing stricter data collection and reporting requirements, the increasing complexity of financial instruments necessitating the collection of data for risk management, and the growing demand for data-driven insights within the financial sector. As an example, the Options Price Reporting Authority (OPRA), which collects and consolidates all the trades and quotes from member option exchanges in the United States, reported an astonishing peak rate of 45.9 million messages per second in February 2024.5
With large volumes of financial data comes a new space of opportunities:
-
Overcoming sample selection bias that might exist in small datasets
-
Enabling investors and traders to access high-frequency market data
-
Capturing patterns and financial activities not represented in small datasets
-
Monitoring and detecting frauds, market anomalies, and irregularities
-
Enabling the use of advanced machine learning and data mining techniques that can capture complex and nonlinear signals
-
Alleviating the problem of high dimensionality in machine learning, where the number of features is significantly high compared to the number of observations
-
Facilitating the development of financial data products that are derived from data, improve with data, and produce additional data
However, such opportunities come with technical challenges, mostly related to data engineering:
-
Collecting and storing large volumes of financial data from various sources efficiently
-
Designing querying systems that enable users to retrieve extensive datasets quickly
-
Building a data infrastructure capable of handling any data size seamlessly
-
Establishing rules and procedures to ensure data quality and integrity
-
Aggregating large volumes of data from multiple sources
-
Linking records across multiple high-frequency datasets
The frequency at which data is generated and collected greatly impacts financial data volumes. A process that produces one million records per second generates significantly larger data volumes compared to a process that produces one thousand records per second. This rate of data generation is known as data velocity and will be discussed in the following section.
Velocity
Data velocity refers to the speed at which data is generated and ingested. Recent years have seen an increase in the velocity of data generation in financial markets. High-frequency trading, financial transactions, financial news feeds, and finance-related social media posts all produce data at high speeds.
With increased financial data velocity, new opportunities emerge:
-
Quicker reaction times as data arrives shortly after generation
-
Deeper and more immediate insights into intraday dynamics, such as price fluctuations and patterns emerging within an hour, minute, or second
-
Enhanced market monitoring
-
Development of new trading strategies, including algorithmic trading and high-frequency trading
Crucially, high data velocity introduces critical challenges for data infrastructures:
- Volume
-
How to build event-driven systems that can handle the arrival of large amounts of data in real time
- Speed
-
How to build a data infrastructure that can reliably cope with the speed of information transmission in financial markets
- Reaction time
-
How to build pipelines that can react as quickly as possible to new data arrival yet guarantee quality checks and reliability
- Variety/multistream
-
How to handle the arrival of many types of data from multiple sources in real time
The exponential increase in financial data volumes and the velocity of data generation doesn’t occur uniformly. Alongside this growth, new data types, formats, and structures have emerged to fulfill various business and technical requirements. The following section will explore this diversity of data in depth.
Variety
The third feature that defines big data is variety, which refers to the presence of many data types, formats, or structures. To better describe this concept, let’s illustrate the three types of structures that data can have:
- Structured data
-
This data has a clear format and data model, is easy to organize and store, and is ready to analyze. The most common example is tabular data organized as rows and columns.
- Semi-structured data
-
This type of data lacks a straightforward tabular format but has some structural properties that make it manageable. Often, semi-structured data is parsed and stored in a tabular format for ease of use. Examples include XML and JSON, which store data in a hierarchical tree-like format.
- Unstructured data
-
This data lacks any predefined structure or formatting and requires parsing and preprocessing using specialized techniques before analysis. The majority of data worldwide is unstructured, including formats like PDF, HTML, text, video, and audio.
The variety of financial data has significantly increased in recent years. For example, the US Securities and Exchange Commission’s Electronic Data Gathering, Analysis, and Retrieval system (EDGAR) receives and handles about two million filings a year. Such filings can be complex documents, many of which contain multiple attachments, scores of pages, and several thousands of pieces of information or details. Another example is alternative data sources such as news, weather, satellite images, social media posts, and web search activities, which have been shown to be highly valuable for financial analysis and product development.6
Increased variety in financial data opens up new opportunities:
-
Incorporating new variables into financial analysis for enhanced predictions
-
Capturing new economic and financial activities that can’t be analyzed using structured data alone
-
Facilitating the development and integration of innovative financial products like news analytics, fraud detection, and financial networks
-
Enhancing regulatory capabilities to capture complex market structures for more effective oversight
However, data variety also presents several data engineering challenges:
-
Building a data infrastructure capable of efficiently storing and managing diverse types of financial data, including structured, semi-structured, and unstructured formats
-
Implementing data aggregation systems to consolidate different data types into a single access point
-
Developing methodologies for cleaning and transforming new structures of financial data
-
Establishing specialized pipelines to process varied types of financial data, such as natural language processing for text and deep learning for images
-
Implementing identification and entity management systems to link entities across a wide range of data sources
Finance-Specific Data Requirements and Problems
The financial industry has always witnessed constant transformation: new players joining and disrupting the competitive landscape, new technologies emerging and revolutionizing the way financial markets function, new data sources expanding the space of opportunities, and new standards and regulations getting released, promoted, and enforced.
Given these dynamics, the financial industry sets itself apart in terms of the issues and challenges that its participants face. A few key ones are listed here:
-
There is a lack of a standardization in some key areas:
-
Identification system for financial data
-
Classification system for financial assets and sectors
-
Financial information exchange
-
-
Lack of established data standards for financial transaction processing
-
Dispersed and diverse sources of financial data
-
Adoption of multiple data formats by companies, data vendors, providers, and regulators
-
Complexity in matching and identifying entities within financial datasets
-
Lack of reliable methods to define, store, and manage financial reference data (discussed in Chapter 2)
-
Lack of relevant data for understanding and managing various financial problems due to poor data collection processes (e.g., granular data on financial market dependencies and exposures necessary for systemic risk analysis)
-
The constant need to adapt data and tech infrastructure to meet new market and regulatory demands (e.g., the EU’s Instant Payments Regulation requires all payment service providers to offer 24/7 euro payments within seconds, necessitating upgrades to legacy systems)
-
The constant need to record, store, and share financial data for various regulatory and market purposes (e.g., the EU’s Central Electronic System of Payment Information mandates payment system providers to track cross-border payment data and share it with the tax authorities of EU member states).
-
Absence of standardized practices for cleaning and ensuring the quality of financial data
-
Difficulty in aggregating data across various silos and divisions within financial institutions
-
Creating consolidated tapes, which integrate market data from multiple sources including trade and quote information across various venues, continues to pose technological and market challenges
-
Balancing innovation and competitiveness with regulatory compliance
-
Persisting concerns regarding security, privacy, and performance in cloud migration strategies
-
Continued reliance on legacy technological systems due to organizational inertia and risk aversion
Over the years, a number of industry and regulatory initiatives were proposed to tackle these issues. For example, to facilitate a standardized delivery of financial services and products, the United States established the Accredited Standards Committee X9 (ASC X9) to create, maintain, and promote voluntary consensus standards for the financial industry. In addition to setting national standards in the United States, ASC X9 can submit standards to the International Organization for Standardization (ISO) in Geneva, Switzerland, to be considered an international ISO standard. ASC X9 develops standards for many different areas and technologies, including electronic legal orders for financial institutions, electronic benefits and mobile payments, financial identifiers, fast payment systems, cryptography, payment messages, and more.
Additionally, international agencies such as the Association of National Numbering Agencies (ANNA) were established to coordinate and foster the adoption of ISO-based financial identifiers (covered in Chapter 3). Frameworks such as eXtensible Business Reporting Language (XBRL) (discussed in Chapter 7) were developed to standardize the communication and reporting of business information. Following the financial crisis of 2007–2008, the financial industry realized the need for a standardized identifier for legal entities involved in market transactions, which led to the development of the celebrated Legal Entity Identifier (LEI), discussed in Chapter 3.
Furthermore, financial market players have also been actively contributing and providing solutions to the above-mentioned problems. To give a few examples, Bloomberg is currently promoting its Financial Instrument Global Identifier (FIGI) as an open standard for identifying financial instruments; LSEG released its Permanent Identifier (PermID) to complement existing market identifiers; and financial institutions such as JPMorgan have been pioneers in promoting market practices such as Value at Risk (VAR) in the 90s, and more recently the use of financial APIs to support fast, real-time data transactions.
Financial Machine Learning
Machine learning (ML) stands out as one of the most promising investments for shaping the future of the financial industry. To understand what machine learning is, it’s better first to understand what artificial intelligence (AI) is. Although there is no well-accepted definition of artificial intelligence, in its simplest form, AI aims to understand the nature of intelligence to build systems that can reliably perform tasks that usually would require human intelligence, such as speech recognition, visual perception, decision-making, and language understanding. Figure 1-4 illustrates the various fields of inquiry in artificial intelligence.
Machine learning stands out as a highly popular and significant subfield within AI. It focuses on building systems that can discover patterns from data, learn from their mistakes, and make predictions. The key word in machine learning is learning, which the computer scientist Tom Mitchell eloquently illustrates as follows:7
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Machine learning scientists and practitioners often develop models based on three types of learning: supervised, unsupervised, and reinforcement learning. Let’s explore each in detail.
Supervised learning
Supervised learning describes a learning approach that relies on an annotated (labeled) dataset comprised of a set of explanatory variables (called features) and a response variable (called a label). In a supervised setting, the model is trained to identify patterns using explanatory variables. The training process involves showing the model the actual value (label) it should have predicted, hence the term “supervised,” and allowing it to learn from its mistakes (as illustrated in Figure 1-5).
When building a supervised system, modelers start by fitting one or more models on training data, where features and labels are known, via a selected optimization process such as gradient descent. Successively, the fit model(s) is tested on a second chunk of the data, called the validation dataset. The goal of the validation dataset is to allow the machine learning expert to fine-tune the so-called model hyperparameters via a process called regularization. Regularization is a technique used to achieve a balance between bias (how well a model learns the training data) and variance (how good the model is at generalizing to new instances unseen during training). Finally, a test dataset is used to evaluate the performance of the model that did best on the validation dataset. Performance metrics include accuracy, precision, root mean square error (RMSE), and mean square error (MSE), to name a few.8
Supervised learning can be divided into two categories: classification, which predicts a class label for a categorical variable, and regression, which predicts a quantity for a numerical variable. Linear regression, autoregressive models, generalized additive models, neural networks, and tree-based models are well-known regression methods. For classification tasks, methods such as logistic regression, support vector machines, linear discriminant analysis, tree models, and artificial neural networks are commonly used.
In finance, supervised learning is extensively employed for both classification and regression tasks. Examples of financial regression problems include stock price forecasting, volatility estimation and prediction, asset pricing, and risk assessment. Classification problems are also plenty in finance, for example, credit scoring, default prediction, corporate action prediction, fraud detection, and credit risk rating.
Unsupervised learning
Unsupervised learning is used to extract patterns and relationships within data without relying on known target response values (labels). Unlike supervised learning, it does not have a teacher (supervisor) correcting the model based on knowledge of the correct answer (as illustrated in Figure 1-6).
There are two main types of unsupervised learning: clustering, where a model is trained to learn and find groups (clusters) in the data, and density estimation, which tries to summarize the distribution of the data. Examples of clustering techniques include k-means, k-nearest neighbor, principal component analysis, and hierarchical clustering, while the kernel density estimator is perhaps the most common example of density estimation techniques.9
Unsupervised learning applications in finance are still in their early stages, but the future trend is promising. For example, clustering can be used to group similar financial time series, cluster stocks into groups based on sector or risk profile, analyze customer and market segmentation, and find similar firms or customers to assign similar scores or ratings.
Reinforcement learning
In reinforcement learning, an artificial agent is placed in an environment where it can perform a sequence of actions over a state space and learn to make better decisions via a feedback mechanism. The key difference between this technique and supervised learning is that the feedback from the teacher is not about providing the right answer (true label); instead, the agent is given a reward (positive or negative) in order to encourage certain behaviors (actions) and punish others (see Figure 1-7).
As many financial activities entail decision-making by agents, there has been a considerable interest among financial practitioners and researchers in reinforcement learning, which centers on optimal decision-making. Financial applications of reinforcement learning include portfolio selection and optimization, optimal trade execution, and market-making.10
Applied machine learning systems rely on data and computational resources; thus, having access to more data and computing power leads to better and faster predictions. In finance, where computational resources and datasets have grown, financial machine learning has emerged as a promising yet challenging area of research and practice.
According to Marcos López de Prado, a leading hedge fund manager and quantitative analyst, financial machine learning has proven to be very successful and is likely to be a major factor in shaping the future of financial markets, but it shouldn’t be ignored that it presents major challenges that need to be taken into consideration. Perhaps the most relevant challenge that’s worth mentioning is the problem of false discoveries. This refers to the practice of finding what seems like a valid pattern in the data, yet in reality is a spurious relationship.11 Other challenges include the interpretability/explainability of the models, performance, costs, and ethics.
For financial institutions to effectively invest in and leverage financial machine learning, they must ensure they are machine learning ready. This involves having the right team with expertise in both finance and machine learning, a sufficient quality and quantity of financial data for ML algorithms, a robust data infrastructure, dedicated ML-oriented data pipelines, DevOps (or MLOps) practices for seamless deployment and integration, and monitoring tools. With this foundation, financial data engineering becomes crucial. Financial data engineers collaborate closely with financial ML scientists and ML engineers to define data requirements, automate data transformations, perform quality checks, and structure ML workflows for fast and high-performance computations.
The Disruptive FinTech Landscape
Following the 2007–2008 financial crisis, traditional financial institutions have faced a significant increase in regulatory requirements. Consequently, the focus of market participants has shifted substantially toward compliance. At the same time, as customers became more accustomed to using services online, demand for simple and user-friendly online financial products has increased. These factors paved the way for a new wave of technological innovation in the financial sector, commonly known as FinTech.
The term FinTech has emerged as a market portmanteau to describe both innovative technologies developed for the financial sector and the startup firms that develop these technologies. FinTech firms have attracted particular attention in the media and the market due to their innovative, flexible, and experimental approach. Not being constrained by regulatory debt, FinTechs have been employing modern and nonconventional approaches to solving and improving a wide range of financial problems, such as payments, lending, investment, fraud detection, and cryptocurrency. Traditional financial institutions would lack this flexibility due to factors such as organizational inertia, regulatory constraints, security concerns, and a lack of innovative culture.
The main distinguishing features of FinTech services are specialization and personalization. As small firms, FinTechs tend to focus on penetrating only specific and niche areas of the financial system. Figure 1-8 illustrates the different areas of specialization of FinTech firms. As the figure illustrates, the FinTech landscape spans all segments of the financial sector, from fundamental functions such as payments and investment to more specialized areas such as regulatory compliance (often called regtech) and analytics.
Moreover, the FinTech business model has demonstrated competitiveness through its customizable and personalized offerings. For example, digital wealth management platforms like Betterment and Wealthfront provide clients with detailed surveys to assess their financial goals and risk preferences, enabling them to offer investment plans tailored to each investor’s unique objectives and expectations.
Overall, the FinTech market has seen rapid growth since its inception. According to a report published by Boston Consulting Group, as of 2023, there were roughly 32,000 FinTech firms globally, securing more than $500 billion in funding. The same report predicts that by 2030, the annual revenue of the FinTech sector is expected to reach $1.5 trillion, with banking FinTech representing 25% of the overall banking evaluations.
To thrive in this technology-intensive, high-performance, and data-driven landscape, aspiring FinTech companies must prioritize their software and data engineering strategies. To compete with and/or collaborate with incumbent financial institutions, FinTechs must ensure the highest standards of quality, reliability, and security. In this context, financial data engineers play a crucial role by designing efficient and reliable data ingestion, processing, and analysis pipelines that can scale and seamlessly integrate with other solutions.
Regulatory Requirements and Compliance
Financial institutions, and banks in particular, have a special status in the economic system. This is justified by the fact that the financial sector forms a complex network of asset cross-holdings, ownerships, investments, and transactions among financial institutions. As a consequence, a market shock that leads to the failure of one or more financial institutions can trigger a cascade of failures that might destabilize the entire financial system and cause an economic meltdown.12 The global financial crisis of 2007–2008 is the best example of such a scenario.
To avoid costly financial crises, the financial sector has been subjected to a large number of regulations, both national and international. Crucially, a significant part of financial regulatory requirements concerns the way banks should collect, store, aggregate, and report data. For example, following the financial crisis of 2007–2008, the Basel Committee on Banking Supervision noted that banks, and in particular Global Systemically Important Banks (G-SIBs), lacked a data infrastructure that could allow for quick aggregation of risk exposures to identify hidden risks and risk concentrations. To overcome this problem, the Basel Committee issued a list of 13 principles on data governance and infrastructure that banks need to implement to strengthen their risk data aggregation and reporting capabilities.
Beyond banks, other financial institutions are also considered systemically important. These include Financial Market Infrastructures (FMIs), which facilitate the processing, clearing, settlement, and custody of payments, securities, and transactions. Examples of FMIs are stock exchanges, multilateral trading facilities, central counterparties, central securities depositories, trade repositories, payment systems, clearing houses, securities settlement systems, and custodians. FMIs are critical to the functioning of financial markets and the broader economy, making them subject to extensive regulation.13
Occasionally, regulators may require financial institutions to collect new types of data. For example, the European directive known as the Markets in Financial Instruments Directive, or MiFID, requires firms providing investment services to collect information regarding their clients’ financial knowledge to assess whether their level of financial literacy matches the complexity of the desired investments.
To comply with regulations, financial institutions need dedicated financial data engineering and management teams to design and implement a robust data infrastructure. This infrastructure must capture, process, and aggregate all relevant data and metadata from multiple sources while ensuring high standards of security and operational and financial resilience. It should enable risk and compliance officers to quickly and accurately access the data needed to demonstrate regulatory compliance. Financial data engineers will also be tasked with creating and enforcing a financial data governance framework that guarantees data quality and security, thereby increasing trust among management, stakeholders, and regulators. In Chapter 5, Financial Data Governance, we will explore these topics in detail.
The Financial Data Engineer Role
The financial data engineer is at the core of everything we’ve discussed so far. Working in the financial industry can be a very rewarding and exciting career. A decade ago, the most in-demand roles in finance were analytical, such as financial engineers, quantitative analysts (or quants), and analysts. But with the digital revolution that took place with big data, the cloud, and FinTech, titles such as data engineer, data architect, data manager, and cloud architect have established themselves as primary roles within the financial industry. In this section, I will provide an overview of a financial data engineer’s role, responsibilities, and skills.
Description of the Role
The role of a financial data engineer is in high demand, though the title, required skills, and responsibilities can vary significantly between positions. For example, the title of a financial data engineer could be any of the following:
-
Financial data engineer
-
Data engineer, finance
-
Data engineer, fintech
-
Data engineer, finance products
-
Data engineer, data analytics, and financial services
-
Financial applications data engineer
-
Platform data engineer, financial services
-
Software engineer, financial data platform
-
Software engineer, financial ETL pipelines
-
Data management developer, FinTech
-
Data architect, finance platform
In many cases, other titles that don’t include the term “data engineering” involve, to a large extent, practices and skills related to financial data engineering. For example, the role of a machine learning engineer could involve many responsibilities concerning the creation, deployment, and maintenance of reliable analytical data pipelines for machine learning. The role of quantitative developer, common among financial institutions, often involves tasks relating to developing data pipelines, data extraction, and data transformations.
It is important to know that the role of a financial data engineer is neither a closed circle nor a professional lock-in. Even though financial domain knowledge is a major plus for financial data engineering roles, many financial institutions would accept people with data engineering experience who come from different backgrounds. Similarly, working as a financial data engineer would easily allow you to fit into other domains, given the rich variety of technical problems and challenges you might encounter in the financial industry.
Where Do Financial Data Engineers Work?
The demand for financial data engineers primarily arises from financial institutions that generate and store data and are willing or required to invest in data-related technologies. Let’s consider a few examples.
FinTech
FinTech firms are technology oriented and data driven; therefore, they are one of the best places to work as a financial data engineer. One of the main advantages of working for a FinTech is that you get to witness the entire lifecycle of product development. This provides engineers a solid overview of how data, business, and technology are combined to make a successful product. Another advantage is that you get to contribute original ideas and solutions to major infrastructural and software problems (e.g., choosing a database or finding a financial data vendor).
Commercial banks
Commercial banks are financial institutions that accept deposits from individuals and institutions while providing loans to consumers and investors, process a significant volume of daily transactions, and adhere to numerous regulatory requirements. To effectively manage their internal operations and ensure timely reporting, commercial banks typically employ teams of software and data engineers. These are responsible for developing and maintaining database systems, data aggregation and reporting mechanisms, customer analytics infrastructure, and transactional systems for various banking activities such as accounts, transfers, withdrawals, and deposits. Working as a data engineer at a commercial bank offers the opportunity to gain valuable insights into industry standards and best practices related to security, reliability, and compliance.
Interestingly, commercial banks frequently form collaboration agreements with FinTech firms to extend their services to the public. These partnerships necessitate a robust data infrastructure that facilitates secure and efficient server communication, often through financial APIs. Consequently, banks and FinTech firms need to hire financial data engineers to design and implement backends for data collection, transmission, aggregation, and integration.
Investment banks
An investment bank is a financial institution that provides corporate finance and investment services, such as mergers and acquisitions, leveraged buyouts, and initial public offerings (IPOs). Unlike commercial banks, investment banks do not accept deposits or give loans. Sometimes, they invest their own money via proprietary trading.
Investment banks engage in various activities that involve the generation, extraction, transformation, and analysis of financial data. These include building and backtesting investment strategies, asset pricing, company valuation, and market forecasting. This requires frequent and easy access to different types of financial data. Additionally, investment banks must regularly report compliance-related data to regulatory authorities. To facilitate quick and straightforward access to this data, investment banks need a team of financial data engineers to design and maintain systems for data collection, transformation, aggregation, and storage.
Asset management firms
Asset management firms are financial institutions that provide investment and asset management services to customers looking to invest their money. These can be independent entities or divisions within a large financial institution. Typically, asset managers operate on an institutional level, with clients such as mutual funds, pension funds, insurance companies, universities, and sovereign wealth funds.
To provide investment services, asset managers require access to a wide array of financial data to build investment strategies, construct portfolios, analyze financial markets, manage risks, and report on behalf of their clients. To manage such data, asset management firms employ in-house financial data engineers to design and maintain effective data strategies, governance, and infrastructure. Even when using third-party data management solutions, in-house engineers are crucial for overseeing and enhancing the data infrastructure.
Hedge funds
Hedge funds are financial institutions that actively invest a large pool of money in various market positions (buy and sell) and asset classes (equity, fixed income, derivatives, alternative investments) to generate above-market returns. To meet their financial return objectives, hedge funds build and test (backtest) a large number of complex investment strategies and portfolio combinations.
To achieve their goals, hedge funds rely on a large number of heterogeneous financial data sources from various providers. Financial engineers and quantitative developers working at hedge funds need high-quality and timely access to financial data. Moreover, hedge funds may invest in algorithmic and high-frequency strategies, which require robust and efficient data infrastructure for easy data read and write operations. This environment makes hedge funds an ideal workplace for financial data engineers.
Regulatory institutions
A variety of national and international regulatory bodies have been established to oversee financial markets. Examples include national entities like central banks and local market regulators, as well as international bodies such as the Bank for International Settlements, its Committee on Payments and Market Infrastructures, and the Financial Stability Board.
These institutions perform a wide variety of activities that require significant investments in financial data engineering and management. For example, if a regulatory agency establishes mandatory reporting and filing requirements, it requires a scalable data infrastructure capable of processing and storing all the reported data. Additionally, regulatory agencies might provide their members with principles and best practices on financial data infrastructure and governance system design. This requires internal teams of data engineers, data managers, and industry experts who can develop and formulate market recommendations.
Financial data vendors
Data vendors are key players in financial markets, providing subscription-based access to financial data collected from numerous sources. Notable examples include Bloomberg, LSEG, and FactSet. Due to their business model, these companies face various challenges related to data collection, curation, formatting, ingestion, storage, and delivery. Consequently, they offer some of the best opportunities for developing a career in financial data engineering.
Security exchanges
Security exchanges are centralized venues where buyers and sellers of financial securities conduct their transactions. Prominent examples include the New York Stock Exchange, NASDAQ, and the London Stock Exchange.
Exchanges need to record all activities and transactions that they facilitate on a daily basis. Some exchanges offer paid subscriptions to their transaction and quotation data. Additionally, they manage tasks like symbology, i.e., assigning identifiers and tickers to listed securities. All this makes exchanges an ideal place to develop a career as a financial data engineer, especially if you want to be at the heart of the financial center.
Big tech firms
Big tech companies such as Google, Amazon, Meta, and Apple have developed into major platforms for user interactions, transactions, and various online activities. Tech companies rely on two mechanisms to expand their activities: user data and network effect. The more activities happen on an online platform, the more data can be collected. Data is then used to study customer behavior to offer new products and services. This, in turn, encourages others to join the platform, which generates yet more data, and so on.
Relying on these self-reinforcing mechanisms, tech giants like Amazon, Apple, Google, and Alibaba have expanded into financial services, offering products like payments, insurance, loans, and money management. This move capitalizes on their extensive customer data, wide-reaching networks, and advanced technology, leading to the creation of user-friendly services such as mobile device payments. Consequently, dedicated teams of data engineers, finance specialists, and machine learning experts are required to support these operations.
Responsibilities and Activities of a Financial Data Engineer
The financial data engineer’s set of tasks and responsibilities will depend on the nature of the job and business problems, the hiring institution, and, most importantly, the firm’s data maturity.
Data maturity is an important concept that relates to data strategy. A data strategy is a long-term plan that describes the roadmap of objectives, people, processes, rules, tools, and technologies required to manage an organization’s data assets. To measure data strategy progress, data maturity approaches are often used. With a data maturity framework, an organization can illustrate the stages of development toward data usability, analytical capabilities, and integration. To further illustrate the concept, I borrow and build on the framework proposed by Joe Reis and Matt Housley in their book, Fundamentals of Data Engineering, which organizes data maturity into three steps: starting with data, scaling with data, and leading with data.
Starting with data
A financial institution that is starting with data is at the very early stage of its data maturity. Note that this doesn’t necessarily mean that the institution is new; old institutions (e.g., traditional banks) might decide to initiate digital transformation plans to automate and modernize their operations (e.g., cloud migration).
When starting with data, the financial data engineer’s responsibilities are likely to be broad and span multiple areas such as data engineering, software engineering, data analytics, infrastructure engineering, and web development. This early phase prioritizes speed and feature expansion over quality and best practices.
Scaling with data
During this stage, the financial institution needs to assess its processes, identify bottlenecks, and determine current and future scaling requirements. With these insights in hand, the institution can proceed to enhance the scalability, reliability, quality, and security of its financial data infrastructure. The primary objective here is to eliminate/handle any technological constraints that may be an obstacle to the company’s growth.
During this stage, financial data engineers will be able to focus on adopting best practices for building reliable and secure systems, e.g., codebase quality, DevOps, governance, security, standards, microservices, system design, API and database scalability, deployability, and a well-established financial data engineering lifecycle.
Leading with data
Once a financial institution reaches the stage at which it is able to lead the market with data, it is considered data driven. In this stage, all processes are automated, requiring minimal manual intervention; the product can scale to any number of users; internal processes and governance rules are well established and formalized; and feature requests go through a well-defined development process.
During this stage, financial data engineers can specialize in and focus on specific aspects of the financial data infrastructure. There will always be space for further optimizations via roles and departments like site reliability engineering, platform engineering, data operations, MLOps, FinOps, data contracts, and new integrations.
Skills of a Financial Data Engineer
Financial data engineers bring together three types of skills: financial domain knowledge, technical data engineering skills, and soft and business skills. We’ll briefly illustrate these skillsets in the upcoming sections.
Financial domain knowledge
Having a good understanding of finance, financial markets, and financial data is an essential and competitive asset in any finance-related job, including financial data engineering. Examples of financial domain skills include the following:
-
Understanding the different types of financial instruments (stocks, bonds, derivatives, etc.)
-
Understanding the different players in financial markets (banks, funds, exchanges, regulators, etc.)
-
Understanding the data generation mechanisms in finance (trading, lending, payments, reporting, etc.)
-
Understanding company reports (balance sheet, income statement, prospectus, etc.)
-
Understanding the market for financial data (vendors, providers, distributors, subscriptions, delivery mechanisms, coverage, etc.)
-
Understanding financial variables and measures (price, quote, volume, yield, interest rate, inflation, revenue, assets, liability, capitalization, etc.)
-
Understanding financial theory terms (risk, uncertainty, return, arbitrage, volatility, etc.)
-
Understanding of compliance and privacy concepts (personally identifiable information (PII), anonymization, etc.)
-
Knowledge of financial regulation and data protection laws is a plus (Basel rules, MiFID, Solvency II, GDPR, the EU’s General Data Protection Regulation, etc.)
Technical data engineering skills
Financial data engineering requires strong technical skills, which can vary across financial institutions, depending on their business needs, products, technological stack, and data maturity. Crucially, it’s important to keep in mind that the data engineering landscape is quite dynamic, with new technologies emerging and diffusing every year. For this reason, this book will focus more on immutable and technology-agnostic principles and concepts rather than on tools and technologies. But to give you an illustrative and nonexhaustive overview of the current landscape (as of 2024), expect as a financial data engineer to be asked about your knowledge of the following areas:
- Database query and design
-
-
Experience with relational database management systems (RDBMSs) and related concepts, in particular Oracle, MySQL, Microsoft SQL Server, and PostgreSQL
-
Solid knowledge of database internals and properties such as transactions, transaction control, ACID (atomicity, consistency, isolation, durability), BASE (basically available, soft state, evenutally consistent), locks, concurrency management, WAL (write-ahead logging), and query planning
-
Experience with data modeling and database design
-
Experience with the SQL language, including advanced concepts such as user-defined functions, window functions, indexing, clustering, partitioning, and replication
-
Experience with data warehouses and related concepts and design patterns
-
- Cloud skills
-
-
Experience with cloud providers (Amazon Web Services, Azure, Google Cloud Platform, Databricks, etc.)
-
Experience with cloud data warehousing (Redshift, Snowflake, BigQuery, Cosmos, etc.)
-
Experience with serverless computing (lambda functions, AWS Glue, Google Workflows, etc.)
-
Experience with different cloud runtimes (Amazon EC2, AWS Fargate, cloud functions, etc.)
-
Experience with infrastructure as code (IaC) tools such as Terraform
-
- Data workflow and frameworks
-
-
Experience with ETL (extract, transform, load) workflow solutions (AWS Glue, Informatica, Talend, Alooma, SAP Data Services, etc.)
-
Experience with general workflow tools such as Apache Airflow, Prefect, Luigi, AWS Glue, and Mage
-
Experience with messaging and queuing systems such as Apache Kafka and Google Pub/Sub
-
Experience in designing and building highly scalable and reliable data pipelines (dbt, Hadoop, Spark, Hive, Cassandra, etc.)
-
- Infrastructure
-
-
Experience with containers and container orchestration such as Docker, Kubernetes, AWS Fargate, and Amazon Elastic Kubernetes Service (EKS)
-
Experience with version control using Git, GitHub, GitLab, feature branches, and automated testing
-
Experience with system design and software architecture (distributed systems, batch, streaming, lambda architecture, etc.)
-
Understanding of the Domain Name System (DNS), TCP, firewalls, proxy servers, load balancing, virtual private networks (VPNs), and virtual private clouds (VPCs)
-
Experience building integrations with and reporting datasets for payments, finance, and business systems like Stripe, NetSuite, Adaptive, Anaplan, and Salesforce
-
Experience working in a Linux environment
-
Experience with software architecture diagramming and design tools such as draw.io, Lucidchart, CloudSkew, and Gliffy
-
- Programming languages and frameworks
-
-
Experience with Object-Oriented Programming (OOP)
-
Experience optimizing data infrastructure, codebase, tests, and data quality
-
Experience with generating data for reporting purposes
-
Experience working with Pandas, PySpark, Polars, and NumPy
-
Experience working with financial vendor APIs and feeds like the Bloomberg Server API, LSEG’s Eikon, FactSet APIs, the OpenFIGI API, and LSEG’s PermID
-
Experience with web development frameworks such as Flask, FastAPI, and Django
-
Understanding of software engineers’ best practices, Agile methodologies, and DevOps
-
- Analytical skills
-
-
Knowledge of data matching and record linkage techniques
-
Knowledge of financial text analysis such as extracting entities from text, fraud detection, and know your customer (KYC)
-
Knowledge of financial data cleaning techniques and quality metrics
-
Experience performing financial data analysis and visualization using various tools such as Microsoft Power BI, Apache Superset, D3.js, Tableau, and Amazon QuickSight
-
Basic experience in machine learning algorithms and generative AI
-
Business and soft skills
For most financial institutions, data represents a valuable asset. Therefore, financial data engineers need to ensure that their work aligns with the data strategy and vision of their institution. To do so, they can complement their technical skills with business and soft skills such as the following:
-
Ability to comprehend technical aspects of the product and technology to communicate effectively with engineers, and to explain these concepts in simpler terms to finance and business stakeholders
-
Understanding the value generated by financial data and its associated infrastructure for the institution
-
Collaborating closely with finance and business teams to identify their data requirements
-
Staying informed about the evolving financial and data technology landscape
-
Establishing policies for company members to access and request new financial data
-
Interest in data analysis and machine learning, leveraging financial data
-
Proactively gathering and analyzing high-value financial data needs from business and analyst teams, and clearly communicating deliverables, timelines, and tradeoffs
-
Providing guidance and education on financial data engineering, expectations from a financial data engineer, and how to search, find, and access financial data
-
Participating in the assessment of new financial data sources, technologies, products, or applications suitable for the company’s business
Certainly, not every job demands proficiency in all these skills. Instead, a tailored combination is often sought based on the specific business needs. Throughout this book, you’ll learn about several of the aforementioned skills, diving deeply into some while experiencing an overview of others, all with demonstrations of their importance and practical application within the financial domain.
Summary
This chapter provided an overview of financial data engineering, summarized as follows:
-
Defining financial data engineering and outlining its unique challenges
-
Justifying the need for financial data engineering and illustrating its applications
-
Describing the role and responsibilities of the financial data engineer
Now that you have an idea about financial data engineering, it’s time to learn about the most important asset in this field: financial data. In Chapter 2, you will gain a thorough understanding of financial data, including its sources, types, structures, and distinguishing features. You will also learn about key benchmark financial datasets that are widely used in the financial industry.
1 If you want to learn more in depth about financial instruments, I encourage you to read Investments by Zvi Bodie, Alex Kane, and Alan Marcus (McGraw Hill, 2023).
2 Data management is a broader term than data engineering. It refers to all plans and policies put in place to make sure that data is strategically managed and optimized for business value creation. To read about data management, I highly recommend Data Management at Scale by Piethein Strengholt (O’Reilly, 2023).
3 For a good reference on these challenges, see Antoni Munar, Esteban Chiner, and Ignacio Sales, “A Big Data Financial Information Management Architecture for Global Banking”, presented at the 2014 International Conference on Future Internet of Things and Cloud (IEEE, August 2014): 385–388.
4 To know more about this topic, check this excellent reference: Tanya S. Beder and Cara M. Marshall’s Financial Engineering: The Evolution of a Profession (Wiley, 2011).
5 See the Operating Metrics file in the official OPRA document library page.
6 Alternative datasets and their use cases in finance are discussed in detail in Chapter 2.
7 Tom Mitchell, Machine Learning (McGraw-Hill, 1997), p. 2.
8 For a good reference on performance metrics, I recommend Aurélien Géron’s book, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (O’Reilly, 2022).
9 For an overview of clustering techniques, I recommend the official documentation of scikit-learn.
10 An excellent reference on reinforcement learning in finance is the book by Ashwin Rao and Tikhon Jelvis, Foundations of Reinforcement Learning with Applications in Finance (CRC Press, 2022).
11 The problem of false discoveries is well-known in finance. For an introduction to this topic, please refer to the article by Campbell R. Harvey, Yan Liu, and Heqing Zhu, “… and the Cross-Section of Expected Returns”, Review of Financial Studies 29, no. 1 (January 2016): 5–68.
12 To read more about the topic of systemic risk, I recommend Jaime Caruana’s article, “Systemic Risk: How to Deal With It?”, Bank for International Settlements (February 2010).
13 To read more about FMI regulation, see “Principles for Financial Market Infrastructures”, the Bank for International Settlements (April 2012), and “Core Principles for Systemically Important Payment Systems”, the Bank for International Settlements (January 2001).
Get Financial Data Engineering now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.