Chapter 4. Feature Stores
As we have seen in the first three chapters, data management is one of the most challenging aspects of building and operating AI systems. In the last chapter, we used a feature store to build our air quality forecasting system. The feature store stored the output of the feature pipelines, provided training data for the training pipeline, and inference data for the batch inference pipeline. The feature store is a central data platform that stores, manages, and serves features for both training and inference. It also ensures consistency between features used in training and inference, and enables the construction of modular AI systems by providing a shared data layer and well-defined APIs to connect feature, training, and inference pipelines.
In this chapter, we will dive deeper into feature stores and answer the following questions:
-
What problems does the feature store solve and when do I need one?
-
What is a feature group, how does it store data, and how do I write to one?
-
How do I design a data model for feature groups?
-
How do I read feature data spread over many feature groups for training or inference?
We will look at how feature stores are built from a columnar store, a row-oriented store, and a vector index. We describe how feature stores solve challenges related to feature reuse, how to manage time-series data, and how to prevent skew between feature, training, and inference pipelines. We will also weave a motivating example of a real-time AI system that predicts credit card fraud throughout the chapter.
A Feature Store for Fraud Prediction
We start by presenting the problem of how to design a feature store for an AI system that makes real-time fraud predictions for credit card transactions. The ML System Card for the system is shown in Table 4-1.
Dynamic Data Sources | Prediction Problem | UI or API | Monitoring |
---|---|---|---|
Credit Card Transactions arrive in an Event Bus. Credit card, issuer, and merchant details in tables are in a Data Warehouse. |
Whether a credit card transaction is suspected of fraud or not. |
Real-time API that rejects suspected fraud transactions. |
Offline investigations of suspected vs actual reported fraud. |
The source data for our AI system comes from a Data Mart (consisting of a data warehouse and an event bus (such as Apache Kafka or AWS Kinesis), see Figure 4-1.
Starting from our data sources, we will learn how to build a feature store with four main steps:
-
identify entities and features for those entities,
-
organize entities into tables of features (feature groups), and identify relationships between feature groups,
-
select the features for a model, from potentially different feature groups, in a feature view
-
retrieve data for model training, batch/online inference with the feature view.
This chapter will provide more details on what feature groups and feature views are, but before that, we will look at the history of feature stores, what makes up a feature store (its anatomy), and when you may need a feature store.
Brief History of Feature Stores
As mentioned in Chapter 1, Uber introduced the first feature store for machine learning in 2017 as part of its Michelangelo platform. Michelangelo includes a feature store (called Palette), a model registry, and model serving capabilities. Michelangelo also introduced a domain-specific language (DSL) to define feature pipelines. In the DSL, you define what type of feature to compute on what data source (such as count the number of user clicks in the last 7 days using a clicks table), and Michelangelo transpiles your feature definition into a Spark program and runs it on a schedule (for example, hourly or daily).
In late 2018, Hopsworks was the first open-source feature store, introducing an API-based feature store, where external pipelines read and write feature data using a DataFrame API and there is no built-in pipeline orchestration. The API-based feature store enables you to write pipelines in different frameworks/languages (for example, Flink, PySpark or Pandas). In late 2019, the open-source Feast feature store adopted the same API-based architecture (DataSets) for reading/writing feature data. Now, feature stores from GCP, AWS, and Databricks follow the API-based architecture, while the most popular DSL-based feature store is Tecton. In the rest of this Chapter, we describe the common functionality offered by both API-based and DSL-based feature stores, while in the next chapter, we will look at the Hopsworks Feature Store, which is representative of API-based feature stores.
Note
The term feature platform has been used to describe feature stores that support managed feature pipelines (but not managed training or inference pipelines). The virtual feature store is a moniker for a feature store that has pluggable offline and online stores. Finally, the AI Lakehouse describes a feature store that uses Lakehouse tables as its offline store and has an integrated online store for building real-time AI systems.
The Anatomy of a Feature Store
The feature store is a factory that produces and stores feature data. It enables the faster production of higher quality features by managing the storage and transformation of data for training and inference, and allows you to reuse features in any model. In Figure 4-2, we can see the main inputs, outputs, and the data transformations managed by the feature store.
Feature pipelines feed the feature store with feature data. They take new data or historical data (backfilling) as input and transform it into reusable feature data, primarily using model-independent transformations. On-demand transformations can also be applied to historical data to create reusable feature data. The programs that execute the model-independent (and on-demand transformations) are called feature pipelines. Feature pipelines can be batch or streaming programs and they update the feature data over time - the feature store stores mutable feature data. For supervised machine learning, labels can also be stored in the feature store and are treated as feature data until they are used to create training or inference data, in which case, the feature store is aware of which columns are features and which columns are labels.
Feature stores enable the creation of versioned training datasets by taking a point-in-time consistent snapshot of feature data and then applying model-dependent transformations to the features (and labels). Training datasets are used to train models, and the feature store should store the lineage of the training dataset for models. The feature store also creates point-in-time consistent snapshots of feature data for batch inference, that should have the same model-dependent transformations applied to them as were applied when creating the training data for the model used in batch inference.
The feature store also provides low latency feature data to online applications or services. They issue prediction requests, and parameters from the prediction request can be used to compute on-demand features and retrieve precomputed rows of feature data from the feature store. Any on-demand and precomputed features are merged into a feature vector that can have further model-dependent transformations applied to it (the same as those applied in training) before the model makes a prediction with the transformed feature vector.
Feature stores support and organize the data transformations in the taxonomy from Chapter 2. Model-independent transformations (MITs) are applied only in feature pipelines on new or historical data to produce reusable feature data. On-demand transformations (ODTs) are applied in both feature pipelines and online inference pipelines, and feature stores should ensure that exactly the same ODT is executed in the feature and online inference pipelines, otherwise there is a risk of skew. Model-dependent transformations (MDTs) are applied in training pipelines, batch inference pipelines, and online inference pipelines. Again, the feature store should ensure that the same MDT is executed in the training and inference pipelines, preventing skew. In Figure 4-3, you can see examples of directed acyclic graphs (DAGs) of valid and invalid combinations of MITs, ODTs, and MDTs.
Feature stores can support the composition of model-independent transformations (MITs), model-dependent transformations (MDTs), and on-demand transformations (ODTs) in pipelines, subject to the following constraints:
-
MDTs are always the last transformation in a DAG (just before the model is called),
-
MDTs are not normally composed (for example, you don’t encode a categorical feature twice or normalize and then standardize a numerical feature),
-
You can build a DAG of MITs and ODTs, but ODTs should not come before MITs in the DAG - in an online inference pipeline, there is no way to execute a MIT after the ODT. If you could run the MIT after the ODT, then, by definition, the MIT would then be an ODT.
This chapter, however, is concerned primarily with the storage, modeling, and querying of the feature data. Chapters 6, 7, and 8 will address the MITs, MDTs, and ODTs.
When Do You Need a Feature Store?
When is it appropriate for you to use a feature store? Many organizations already have operational databases, an object store, and a data warehouse or lakehouse. Why would they need a new data platform? The following are scenarios where a feature store can help.
For Context and History in Real-Time AI Systems
We saw in chapter 1 how real-time AI systems need history and context to make personalized predictions. In general, when you have a real-time prediction problem but the prediction request has low information content, you can benefit from a feature store to provide context and history to enrich the prediction request. For example, a credit card payment has limited information in the prediction request - only the credit card number, the merchant ID (unique identifier), the timestamp and location for the payment, the category of goods purchased, and the amount of money. Building an accurate credit card fraud prediction service with AI using only that input data is almost impossible, as we are missing historical information about credit card payments. With a feature store, you can enrich the prediction request at runtime with history and context information about the credit card’s recent usage, the customer details, the issuing bank’s details, and the merchant’s details, enabling a powerful model for predicting fraud.
For Time-Series Data
Many retail, telecommunications, and financial AI systems are built on time-series data. The air quality and weather data from Chapter 3 is time-series data that we update once per day and store in tables along with the timestamps for each observation or forecast. Time-series data is a sequence of data points for successive points in time. A major challenge in using time-series data for machine learning is how to read (query) feature data that spread over many tables - you want to read point-in-time correct training data from the different tables without introducing future data leakage or including any stale feature values, see Figure 4-4.
Feature stores provide support for reading point-in-time correct training data from different tables containing time-series feature data. The solution, described later in this chapter, is to query data with temporal joins. Writing correct temporal joins is hard, but feature stores make it easier by providing APIs for reading consistent snapshots of feature data using temporal joins.
Note
You have probably encountered data leakage in the context of training models - if you leak data from your test set or any external dataset into your training dataset, your model may perform better during testing than when it is used in production on unseen data. Future data leakage is when you build training datasets from time-series data and incorrectly introduce one or more feature data points from the future.
For Improved Collaboration with the FTI Pipeline Architecture
An important reason many models do not reach production is that organizations have silos around the teams that collaborate to develop and operate AI systems. In Figure 4-5, you can see a siloed organization where the data engineering team has a metaphorical wall between them and the data science team, and there is a similar wall between the data science team and the ML engineering team. In this siloed organization, collaboration involves data and models being thrown over the wall from one team to another.
The system for collaboration at this organization is an example of Conway’s Law, where the process for collaboration (throwing assets over walls) mirrors the siloed communication structure between teams. The feature store solves the organizational challenges of collaboration across teams by providing a shared platform for collaboration when building and operating AI systems. The feature, training, and inference (FTI) pipelines from Chapter 2 also help with collaboration. They decompose an AI system into modular pipelines that use the feature store acting as the shared data layer connecting the pipelines. The responsibilities for the FTI pipelines map cleanly onto the teams that develop and operate production AI systems:
-
data engineers and data scientists collaborate to build and operate feature pipelines;
-
data scientists train and evaluate the models;
-
ML engineers write inference pipelines and integrate models with external systems.
For Governance of AI Systems
Feature stores help ensure that an organization’s governance processes keep feature data secure and accountable throughout its lifecycle. That means auditing actions taken in your feature store for accountability and tracking lineage from source data to features to models. Feature stores manage mutable data that needs to comply with regulatory requirements, such as the European Union’s AI Act that categorizes AI systems into four different risk levels: unacceptable, high, limited, and minimal risk.
Beyond data storage, the feature store also needs support for lineage for compliance with other legal and regulatory requirements involving tracking the origin, history, and use of data sources, features, training data, and models in AI systems. Lineage enables the reproducibility of features, training data, and models, improved debugging through quicker root cause analysis, and usage analysis for features. Lineage tells us where AI assets are used. Lineage does not, however, tell you whether a particular feature is allowed to be used in a particular model - for example, a high risk AI system. Access control, while necessary, also does not help here either as it only informs you whether you have the right to read/write the data, not whether your model will be compliant if you use a certain feature. For compliance, feature stores support custom metadata to describe the scope and context under which a feature can be used. For example, you might tag features that have personally identifiable information (PII). With lineage (from data sources to features to training data to models) and PII metadata tags for features, you can easily identify which models use features containing PII data.
For Discovery and Reuse of AI Assets
Feature reuse is a much advertised benefit of feature stores. Meta reported that “most features are used by many models” in their feature store, and the most popular 100 features are reused in over 100 different models each. The benefits of feature reuse include: higher quality features through increased usage and scrutiny, reduced storage cost, and reduced feature development and operational costs, as models that reuse features do not need new feature pipelines. Computed features are stored in the feature store and published to a feature registry, enabling users to easily discover and understand features. The feature registry is a component in a feature store that has an API and user interface (UI) to browse and search for available features, feature definitions, statistics on feature data, and metadata describing features.
For Elimination of Offline-Online Feature Skew
Feature skew is when significant differences exist between the data transformation code in either an ODT or MDT in an offline pipeline (a feature or training pipeline, respectively), and the data transformation code for the ODT or MDT in the corresponding inference pipeline. Feature skew can result in silently degraded model performance that is difficult to discover. It may show up as the model not generalizing well to the new data during inference due to the discrepancies in the data transformations. Without a feature store, it is easy to write different implementations for an ODT or MDT - one implementation for the feature or training pipeline and a different one for the inference pipeline. In software engineering, we say that such data transformation code is not DRY (Do not Repeat Yourself). Feature stores support the definition and management of ODTs and MDTs, and ensure the same function is applied in the offline and inference pipelines.
For Centralizing your Data for AI in a single Platform
Feature stores aspire to be a central platform that manages all data needed to train and operate AI systems. Existing feature stores have a dual-database architecture, including an offline store and an online store. However, feature stores are increasingly adding support for vector database capabilities - vector indexes to store vector embeddings and support similarity search.
The online store is used by online applications to retrieve feature vectors for entities. It is a row-oriented data store, where data is stored in relational tables or in a NoSQL data structure (like key-value pairs or JSON objects). The key properties of row-oriented data stores are:
-
low latency and high throughput CRUD (create, read, update, delete) operations using either SQL or NoSQL,
-
support for primary keys to retrieve features for specific entities,
-
support for time-to-live (TTL) for tables and/or rows to expire stale feature data,
-
high availability through replication and data integrity through ACID transactions.
The offline store is a columnar store. Column-oriented data stores are:
-
central data platforms that store historical data for analytics,
-
low cost storage for large volumes of data (including columnar compression of data) at the cost of high latency for row-based retrieval of data,
-
faster complex queries than row-oriented stores through more efficient data pruning and data movement, aided by data models designed to support complex queries.
The offline stores for existing feature stores are lakehouses. The lakehouse is a combination of a data lake for storage and a data warehouse for querying the data. In contrast to a data warehouse, the lakehouse is an open platform that separates the storage of columnar data from the query engines that use it. Lakehouse tables can be queried by many different query engines. The main open-source standards for the lakehouse are the table formats for data storage (Apache Iceberg, Delta Lake, Apache Hudi). A table format consists of data files (Parquet files) and metadata that enables ACID (atomic, consistent, isolation, durable) updates to the Parquet files - a commit for every batch append/update/delete operation. The commit history is stored as metadata and enables time-travel support for lakehouse tables, where you can query historical versions of tables (using a commit ID or timestamp). Lakehouse tables also support schema evolution (you can add columns to your table without breaking clients), as well as partitioning, indexing, and data skipping for faster queries.
The offline and/or online store may also support storing vector embeddings in a vector index that supports approximate nearest neighbor (ANN) search for feature data. Feature stores either include a separate standalone vector database (such as Weaviate, Pinecone), or an existing row-oriented database that supports a vector index and ANN search (such as Postgres PgVector, OpenSearch, and MongoDB). Now that we have covered why and when you may need a feature store, we will look into storing data in feature stores in feature groups.
Feature Groups
Feature stores use feature groups to hide the complexity of writing and reading data to/from the different offline and online data stores. We encountered feature groups in Chapters 2 and 3, but we haven’t formally defined them. Feature groups are tables, where the features are columns and the feature data is stored in offline and online stores. Not all feature stores use the term feature groups - some vendors call them feature sets or feature tables, but they refer to the same concept. We prefer the term feature group as the data is potentially stored in a group of tables - more than one store. We will cover the most salient and fundamental properties of feature groups employed by existing feature stores, but note that your feature store might have some differences, so consult its documentation before building your feature pipelines.
A feature group consists of a schema, metadata, a table in an offline store, an optional table in an online store, and an optional vector index. The metadata typically contains the feature group’s:
-
name
-
version (a number)
-
entity_id (a primary key, defined over one or more columns)
-
onlined_enabled - whether the feature group’s online table is used or not
-
event_time column (optional)
-
tags to help with discovery and governance.
The entity_id is needed to retrieve rows of online feature data and prevent duplicate data, while the version number enables support for A/B tests of features by different models and enables schema breaking changes to feature groups. The event_time column is used by the feature store to create point-in-time consistent training data from time-series feature data. Depending on your feature store, a feature group may support some or all of the following:
-
foreign_key columns (references to a primary key in another feature group)
-
a partition_key column (used for faster queries through partition pruning)
-
vector embedding features that are indexed for similarity search
-
feature definitions that define the data transformations used to create the features stored in the feature group.
In Figure 4-6, we can see a feature group containing different columns related to credit-card payments. You will notice that most columns are not feature columns.
The first four columns are collectively known as index columns - the cc_num
(entity id), trans_ts
is the event_time, account_id
is a foreign key to an account_details
feature group (not shown), and day
is a partition_key column enabling queries that filter by day to be faster by only reading the needed data (for example, read yesterday’s feature data will not read all rows, only the rows where day value is yesterday). The next 3 columns (amount, category, and embedding_col) are features - the embedding_col is a vector embedding that is indexed for similarity search in the vector index. Finally, the is_fraud column is also a feature column but is identified as a ‘label’ in the figure. That is because features can also be labels - the is_fraud column could be a label in one model, but a feature in another model. For this reason, labels are not defined in feature groups, but only defined when you select the features and labels for your model.
You can perform inserts, updates, and deletes on feature groups, either via a batch (DataFrame) API or a streaming API (for real-time AI systems). As a feature group has a schema, your feature store defines the set of supported data types for features - strings, integers, arrays, and so on. In most features, you can either explicitly define the schema for a feature group or the feature store will infer its schema using the first DataFrame written to it. If a feature group contains time-series data, the event_time column value should reflect the time the feature values in that row were created (not when the row of data was ingested). If the feature group contains non time-series data, you can omit the event_time column.
Note
The entity ID is a unique identifier for an entity that has features in the modeled world. The entity ID can be either a natural key or a surrogate key. An example of a natural key is an email address or social security number for a user, while an example of a surrogate key is a sequential number, such as an auto increment number, representing a user.
Feature Groups store untransformed feature data
Feature pipelines write untransformed feature data to feature groups. The MDTs, such as encoding a categorical feature, are performed in training and inference pipelines after reading feature data from the feature store. In general, feature groups should not store transformed feature values (that is, MDTs should not have been applied) as:
-
The feature data is not reusable across models (model-specific transformations transform the data for use by a single model or set of related models).
-
It can introduce write amplification. If the MDT is parameterized by training data, such as standardizing a numerical feature, the time taken to perform a write becomes proportional to the number of rows in the feature group, not the number of rows being written. For standardization, this is because updates first require reading all existing rows, recomputing the mean and standard deviation, then updating the values of all rows with the new mean and standard deviation.
-
Exploratory data analysis works best with unencoded feature data - it is hard for a data scientist to understand descriptive statistics for a numerical feature that has been scaled.
Feature Definitions and Feature Groups
A feature definition is the source code that defines the data transformations used to create one or more features in a feature group. In API-based feature stores, this is the source code for your MITs (and ODTs) in your feature pipelines. For example, this could be a Pandas, Polars, or Spark program for a batch feature pipeline. In DSL-based feature stores, a feature definition is not just the declarative transformations that create the features, but also the specification for the feature pipeline (batch, streaming, or on-demand).
Writing to Feature Groups
Feature stores provide an API to ingest feature data. The feature store manages the complexity of then updating the feature data in the offline store, online store, and vector index on your behalf - the updates in the background are transparent to you as a developer. Figure 4-7 shows two different types of APIs for ingesting feature data. In Figure 4-7(a), you have a single batch API for clients to write feature data to the offline store. The offline store is normally a lakehouse table and they provide change data capture (CDC) APIs where you can read the data changes for the latest commit. A background process either runs periodically or continually and reads any new commits since the last time it ran and copies them to the online store and/or vector index. For feature groups storing time-series data, the online store only stores the latest feature data for each entity (the row with the most recent event_time key value for each primary key).
In Figure 4-7(b), there are two APIs - a batch API and a stream API. Clients can use the batch API to write to only the offline store. If a feature group is online_enabled, clients write to the steam API. Clients that write to the stream API can be either batch programs (Spark, Pandas, Polars) or stream processing programs (Flink, Spark Streaming). The difference with the stream API is that updates are written first to the online store and vector index (here via an event bus), and then synchronized periodically with the offline store. Feature data is available at lower latency in the online store via the stream API - that is, the stream API enables fresher features. For feature groups storing time-series data, the online store can again store either the latest feature data for each entity (the row with the most recent event_time key value for each primary key). Some online feature stores with a stream API also support computing aggregations as ODTs (for example, max amount for a credit card transaction in the last 15 minutes), and, in this case, a TTL can be specified for each row or table so that feature data is removed when its TTL has expired.
Feature Freshness
The freshness of feature data in feature groups is defined as the total time taken from when an event is first read by a feature pipeline to when the computed feature becomes available for use in an inference pipeline, see Figure 4-8. It includes the time taken for feature data to land in the online feature store and the time taken to read from the online store.
Fresh features for real-time AI systems typically require streaming feature pipelines that update the feature store via a stream API. In Chapter 1, we described how TikTok is a real-time AI system - when you swipe or click, features are created using information about your viewing activity using streaming feature pipelines, and within a couple of seconds they are available as precomputed features in feature groups for predictions. If it took minutes, instead of seconds, TikTok’s recommender would not feel as if it tracks your intent in real-time - it’s AI would be too laggy to be useful as a recommender.
Data Validation
Some feature stores support data validation when writing feature data to feature groups. For each feature group, you specify constraints for valid feature data values. For example, if the feature is an adult user’s age, you might specify that the age should be greater than 17 and less than 125. Data validation helps avoid problems with data quality in feature groups. Note that there are some exceptions to the general “garbage-in, garbage-out” principle. For example, it is often ok to have missing feature values in a feature group, as you can impute those missing values later in your training and inference pipelines.
Now that we’ve covered what a feature group is, what it stores, and how you write to one, let’s now look at how to design a data model for feature groups.
Data Models for Feature Groups
If the feature store is to be the source of our data for AI, we need to understand how to model the data stored in its feature groups. Data modeling for feature stores is the process of deciding:
-
what features to create for which entities and what features to include in feature groups,
-
what relationships between the feature groups look like,
-
what the freshness requirements for feature data is,
-
and what type of queries will be performed on the feature groups.
Data modeling includes the design of a data model. A data model is a term from database theory, that refers to how we decompose our data into different feature groups (tables), with the goals of:
-
ensuring the integrity of the data,
-
improving the performance of writing the data,
-
improving the performance of reading (querying) the data,
-
improving the scalability of the system as data volumes and/or throughput increases
You may have heard of Entity-Relationship diagrams (see Figure 4-8, for example) from relational databases. It is a way of identifying entities (such as credit card transactions, user accounts, bank details, and merchant details) and the relationships between those entities. For example, a credit card transaction could have a reference (foreign key) to the credit card owner’s account, the bank that issued the card, and the merchant that performed the transaction. In the relational data model, entities typically map to tables and relationships to foreign keys. Similarly, in feature stores an entity maps to a feature group and relationships map to foreign keys in a feature group.
What is the process to go from requirements and data sources to a data model for feature groups, such as an Entity Relationship Diagram? There are 2 basic techniques we can use:
- Normalization
-
Reduce data redundancy and improve data integrity,
- Denormalization
-
Improve query performance by increasing data redundancy.
These two techniques produce data models that can be categorized into one of two types: denormalized data models that include redundant (duplicated) data and normalized data models that eliminate redundant data. The benefits and drawbacks of both approaches are shown in Table 4-2.
Denormalized Data Model | Normalized Data Model | |
---|---|---|
Data Storage Costs |
Higher due to redundant data in the (row-oriented) online store |
Lower due to no redundant data |
Query Complexity |
Lower, due to less need for JOINs when reading from the online store |
Higher, due to more JOINs needed when querying data |
In general, denormalized data models are more prevalent in columnar data stores (lakehouses and data warehouses) as they can often efficiently compress redundant data in columns with columnar compression techniques like run-length encoding, while row-oriented data stores cannot compress redundant data, and, therefore, favor normalized data models.
Before we start identifying entities, features, and feature groups for entities/features, we should consider the types of AI systems that will use the feature data:
-
batch AI systems
-
real-time AI systems
For (1), feature groups only need to store data in their offline store. As such, we could consider existing data models for columnar stores, such as the star schema, snowflake schema that are widely used in analytical and business intelligence environments. For (2), we have feature groups with tables in both the offline and online store. For this, we should use a general purpose data model that works equally well for both batch and real-time queries. We will see in the next section that the snowflake schema (a normalized data model) is our preferred methodology for data modeling in feature stores, although some feature stores only support the star schema, so we will introduce both data models. The star schema and snowflake schema are data models that organize data into a fact table that connects to dimension tables. In the star schema, columns in the dimension tables can be redundant (duplicated), but the snowflake schema extends the star schema to enable dimension tables to be connected to other dimension tables, enabling a normalized data model with no redundant data. We will now look at how to design a star schema or snowflake schema data model with fact and dimension tables using dimension modeling.
Note
Other popular data models used in columnar stores include the data vault model (used to efficiently handle data ingestion, where data can arrive late and schema changes happen frequently), and the one big table (OBT) data model (which simplifies data modeling by storing as much data as possible in a single wide table). OBT is not suitable for AI systems, as it would store all the labels and features in a single denormalized table which would explode storage requirements in the (row-oriented) online store, and it is not suited for storing feature values that change over time. You can learn more on data modeling in the book ”Fundamentals of Data Engineering”.
Dimension modeling with a Credit Card Data Mart
The most popular data modeling technique in data warehousing is dimension modeling that categorizes data as facts and dimensions. Facts are usually measured quantities, but can also be qualitative. Dimensions are attributes of facts. Some dimensions change value over time and are called slowly changing dimensions (SCD). Let’s look at an example of facts and dimensions in a credit card transactions data mart. A data mart is a subset of a data warehouse (or lakehouse) that contains data focused on a specific business line, team, or product.
In our example, the credit card transactions are the facts and the dimensions are data about the credit card transactions, such as the card holder, their account details, the bank details, and the merchant details. We will use this data mart to power a real-time AI system for predicting credit card fraud. But first, let’s look at our data mart, illustrated in an Entity-Relationship Diagram in Figure 4-9 using a snowflake schema data model.
The fact table stores:
- credit_card_transactions
-
A unique ID for the transaction (t_id), the credit card number (cc_num), a timestamp for the transaction (event_time), the amount of money spent (amount), the location (longitude and latitude), and the category of the item purchased.
The dimension tables for the credit card transactions are:
- card_details
-
Its expiry date, issue date, the date if the card has been invalidated, and foreign keys to account and bank details tables (the foreign keys make this a snowflake schema data model).
- account_details
-
Name, address, debt at the end of the previous month, date when the account was created and closed (end_date), and date when a row was last_modified.
- bank_details
-
Credit rating, country, the date when a row was last_modified.
- merchant_details
-
Count of chargebacks for the merchant in the previous week (cnt_chrgeback_pre_week), its country, and date when a row was last_modified.
The credit card transactions table is populated using the event sourcing pattern, whereby once per hour, an ETL Spark job reads all the credit card transactions that arrived in Kafka during the previous hour, and persists the events as rows in the credit_card_transactions table. The dimension tables are updated by ETL or ELT pipelines that read changes to dimensions for operational databases (not shown). We will now see how we can use the credit card transaction events in Kafka and the dimension tables to build our real-time fraud detection AI system.
Labels are Facts and Features are Dimensions
In a feature store, the facts are the labels (or targets/observations) for our models, while the features are dimensions for the labels. Like facts, the labels are immutable events that often have a timestamp associated with them. For example, in our credit card fraud model, we will have a label is_fraud for a given credit card transaction and a timestamp for when the credit card transaction took place. The features for that model will be the card usage statistics, details about the card itself (expiry date), the cardholder, the bank, and the merchant. These features are dimensions for the labels, and they are often mutable data. Sometimes they are SCDs, but in real-time machine learning systems, they might be fast changing dimensions. Irrespective of whether the feature values change slowly or quickly, if we want to use a feature as training data for a model, it is crucial to save all values for features at all points in time. If you don’t know when and how a feature changes its value over time, then training data created using that feature could have future data leakage or include stale feature values.
Dimension modeling in data warehousing introduced SCD types to store changing values of dimensions (features). There are at least 5 well-known ways to implement SCDs (SCD Types), each optimized for different ways a dimension could change. Implementing different SCD Types in a data mart is a skilled and challenging job. However, we can massively simplify managing SCDs for feature stores for two reasons. Firstly, as feature values are observations of measurable quantities, each new feature value replaces the old feature value (a feature cannot have multiple alternative values at the same time). Secondly, there are a limited number of query patterns for reading feature data - you read training data and batch inference data from the offline store and rows of feature vectors from the online store. That is, feature stores do not need to support all 5 SCD Types, instead they need a very specific set of SCD Types (0, 2, and 4), and support for those types can be unobtrusively added to feature groups by simply specifying the event_time column in your feature group. This way, feature stores simplify support for SCDs compared to general purpose data warehouses.
Table 4-3 shows how feature stores implement SCD Types 0, 2, and 4 with the relatively straightforward approach of specifying the feature group column that stores the event_time.
SCD Type | Usage | Description | Feature Store |
---|---|---|---|
Type 0 |
Immutable feature data |
No history is kept for feature data, suitable for features that are immutable. |
Feature Group, no event_time |
Type 2 |
Mutable feature data used by batch AI systems |
When a feature value is updated for an entity ID, a new row is created with a new event_time (but the same entity ID). Each new row is a new version of the feature data. |
Offline Feature Group with event_time |
Type 4 |
Online features for real-time AI Systems. Offline data for training. |
Features are stored as records in two different tables - a table in the online store with the latest feature values and a table in the offline store with historical feature values. |
Online/Offline Feature Group with event_time |
Type 0 SCD is a feature group that stores immutable feature data. If you do not define the event_time column for your feature group, you have a feature group with Type 0 SCD. Type 2 SCD is an offline-only feature group (for batch AI systems), where we have the historical records for the time-series data. In classical Type 2 SCD, it is assumed that rows need both an end_date and an effective_date (as multiple dimension values may be valid at any point-in-time). However, in the feature store, we don’t need an end_date, only the effective_date, called the event_time, as only a single feature value is valid at any given point-in-time. Type 4 SCD is implemented as a feature group, backed by tables in both the online and offline stores. A table in the online store stores the latest feature data values, and a table with the same name and schema in the offline store stores all of the historical feature data values. In traditional Type 4 SCD, the historical table does not store the latest values, but in feature stores, the offline store stores both the latest feature values and the historical values.
Feature stores hide the complexity of designing a data model that implements these 3 different SCD Types by implementing the data models in their read/write APIs. For example, in the AWS Sagemaker feature store (an API-based feature store), you only need to specify the event time column when defining a feature group:
feature_group.create( description = "Some info about the feature group", feature_group_name = feature_group_name, event_time_feature_name = event_time_feature_name, enable_online_store = True, ... tags = ["tag1","tag2"] )
Writes to this feature group will create Type 4 SCD features, with the latest feature data in a key-value store, ElastiCache or DynamoDB, and historical feature data in a columnar store (Apache Iceberg).
Real-Time Credit Card Fraud Detection AI System
Let’s now start designing our real-time AI system to predict if a credit card transaction is fraudulent. This operational AI system (online inference pipeline) has a service level objective (SLO) of 50ms latency or lower to make the decision on suspicion of fraud or not. It receives a prediction request with the credit card transaction details, retrieves precomputed features from the feature store, computes any on-demand features, merges the precomputed and on-demand features in a single feature vector, applies any model-dependent transformations, makes the prediction, logs the prediction and the untransformed features, and returns the prediction (fraud or not-fraud) to the client.
To build this system and meet our SLO, we will need to write a streaming feature pipeline to create features directly from the events from Kafka, as shown in Figure 4-8. Stream processing enables us to compute aggregations on recent historical activity on credit cards, such as how often a card has been used in the last 5 minutes, 15 minutes, or hour. These features are called windowed aggregations, as they compute an aggregation over events that happen in a window of time. It would not be possible to compute these features within our SLO if we only use the credit_card_transaction table in our data mart, as it is only updated hourly. We can, however, compute other features from the data mart, such as the credit rating of the bank that issued the credit card, and the number of chargebacks for the merchant that processed the credit card transaction.
We will also create on-demand features from the input request data. A feature with good predictive power for geographic fraud attacks is the distance and time between consecutive credit card transactions. If the distance is large and the time is short, that is often indicative of fraud. For this, we compute haversine_distance and time_since_last_transaction features. These on-demand features are computed at run-time with an on-demand transformation function that takes one or more parameters passed as part of the prediction request. On-demand features can additionally take precomputed features from feature groups as parameters.
We have described here an AI system that contains a mix of features computed using stream processing, batch processing, and on-demand transformations. However, when we want to train models with these features, the training data will be stored in feature groups in the feature store. So, we need to identify the features and then design a data model for the feature groups.
Data Model for our Real-Time Fraud Detection AI System
We are using a supervised ML model for predicting fraud, so we will need to have some labeled observations of fraud. For this, there is a new cc_fraud
table, not in the data mart, with a t_id column (the unique identity for credit card transactions) that contains the credit card transactions identified as fraudulent, along with columns for the person who reported the fraud and an explanation for why the transaction is marked as fraudulent. The fraud team updates the cc_fraud
table weekly in a Postgres database they manage. Using the data mart and the event bus, we can create features that have predictive power for fraud and the labels, as shown in Table 4-4.
Data Sources | Simple Features | Engineered Features |
---|---|---|
credit_card_transactions |
amount category |
{num}/{sum}_trans_last_10_mins {num}/{sum}_trans_last_hour {num}/{sum}_trans_last_day {num}/{sum}_trans_last_week prev_ts_transaction prev_loc_transaction haversine_distance time_since_last_transaction |
credit_card_transactions cc_fraud |
is_fraud |
|
credit_card_transactions card_details table |
days_until_expired max_debt_last_12_months |
|
account_details table |
debt_end_prev_month |
max_debt_last_12_months |
mechant_details table |
cnt_chrgeback_prev_week |
cnt_chrgeback_prev_month |
bank_details table |
credit_rating |
days_since_bank_cr_changed |
There are many frameworks and programming languages that we could use to create these features, and we will look at source code for them in the next few chapters. For now, we are interested in the data model for our feature groups that we will design to store and query these features, as well as the fraud labels. The feature groups will need to be stored in both online and offline stores, as we will, respectively, use these features in our real-time AI system for inference and in our offline training pipeline. We will now design two different data models, first using the star schema and then using the snowflake schema.
Star Schema Data Model
The star schema data model is supported by all major feature stores. In Figure 4-10, we can see that the feature group containing the fraud labels is called a label feature group.
The feature group that contains the labels for our credit card transaction (fraud or not-fraud) is known as the label feature group. In practice, a label feature group is just a normal feature group. As we will see later, it is only when we select the features and labels for our model that we need to identify the columns in feature groups as either a feature or a label.
Note
Some feature stores do not support storing labels in feature groups. Instead, for these feature stores, clients provide the labels, label timestamps (event_time), and entity IDs for feature groups (containing features they want to include) when creating training data and inference data. In the Feast feature store, clients provide the labels, label timestamps, and entity IDs in a DataFrame called the Spine DataFrame. The Spine DataFrame contains the same data as our label feature group, but it is not persisted to the feature store. The Spine DataFrame approach can be more flexible for prototyping, as it can also contain additional columns (features) for creating training data. Instead of having to first create your features and write them to a feature group, you can include them as columns in your Spine DataFrame. However, be warned as additional columns can result in skew - it is your responsibility to ensure that any additional columns provided when creating training data are also included (in the same order, with the same data types) when reading inference data.
In Figure 4-10, you can see that the label feature group contains foreign keys to the 4 feature groups that contain features computed from the data mart tables and the event bus. These feature groups are all updated independently in separate feature pipelines that run on their own schedule. For example, the aggregated_transactions feature group is computed by a streaming feature pipeline, while the account_details, bank_details, and merchant_details feature groups are computed by batch jobs that run daily.
Snowflake Schema Data Model
The snowflake schema is a data model that, like the star schema, consists of tables containing labels and features. In contrast to the star schema, however, the feature data is normalized, making it suitable as a data model for both online and offline tables. Each feature is split until it is normalized, see Figure 4-11. That is, there is no redundancy in the feature tables, no repetition of values (except for foreign keys that point to primary keys).
In the snowflake schema, you can see that the label feature group now only has 2 foreign keys, compared to 4 foreign keys in the star schema data model. As we will see in the next section, the advantage of the snowflake schema here over the star schema is clearest when building a real-time AI system. In a real-time AI system, the foreign keys in the label feature groups need to be provided as part of prediction requests by clients. With a snowflake schema, clients only need to provide the cc_num and merchant_id as request parameters in order to retrieve all of the features - features from the nested tables are retrieved with a subquery. In the star schema, however, our real-time AI system needs to additionally provide the bank_id and account_id as request parameters. This makes the real-time AI system more complex, as the client has to also provide the values for bank_id and account_id.
Feature Store Data Model for Inference
Labels are obviously not available during inference - our model predicts them. Similarly, none of the values of the features in our label feature group (credit_card_transactions) are available as precomputed features at online inference time (either for the star schema or snowflake data model). They are all either passed as feature values in the prediction request (the foreign keys to the feature groups and the amount and category features) or computed as on-demand feature values using request parameters (time_since_last_trans, haversine_distance, days_to_card_expiry). For this reason, the label feature group is offline only. Its rows can be computed using historical data to create offline training data, but for online inference all of its columns are either passed as parameters, computed, or predicted (the label(s)).
Online Inference
For online inference, a prediction request includes as parameters the foreign keys, any passed features from the label feature group, and any parameters needed to compute on-demand features, see Figure 4-12. The online inference pipeline uses the foreign keys to retrieve all the precomputed features from online feature groups. Feature stores provide either language level APIs (such as Python) or a REST API to retrieve the precomputed features.
Batch Inference
Batch inference has similar data modeling challenges to online inference. In Chapter 11, we will re-imagine our credit card fraud prediction problem as a daily batch job that, for all of yesterday’s credit card transactions, predicts whether they were fraudulent or not. In this case, the labels are not available, of course, but all features in the label feature group can be populated by a feature pipeline ahead of time. This includes computing the on-demand features using historical data. In this case, the on-demand and passed features are updated at a different cadence from the labels, and, as such, it is often beneficial to move labels into their own feature group, separate from the on-demand features.
Feature stores often support batch inference data APIs, such as:
-
read all feature data that have arrived in the last 24 hours and return them as a DataFrame, or
-
read all the latest feature data for a batch of entities (such as all users or all users who live in Sweden).
An alternative API is to allow batch inference clients to provide a Spine DataFrame containing the foreign keys (and timestamps) for features. The feature store takes the Spine DataFrame and adds columns containing the feature values from the feature groups (using the foreign keys and timestamps to retrieve the correct feature values). The Spine DataFrame approach does not work well for case (1), but works well for case (2) above. You have to do the work of adding all foreign keys to the Spine DataFrame, which is easy if we want to read the latest feature values for all users, and we pass a Spine DataFrame containing all user IDs. However, reading all feature data since yesterday requires a more complex query over feature groups, and, here, dedicated batch inference APIs to support such queries are helpful.
Reading Feature Data with a Feature View
After you have designed a data model for your feature store, you need to be able to query it to read training and inference data. Feature stores do not provide full SQL query support for reading feature data. Instead, they provide language level APIs (Python, Java, etc) and/or a REST API for retrieving training data, batch inference data, and online inference data. But, reading precomputed feature data is not the only task for a feature store. The feature store should also apply any model-dependent transformations and on-demand transformations before returning feature data to clients.
Feature stores provide an abstraction that hides the complexity of retrieving/computing features for training and inference for a specific model (or group of related models) called a feature view.
The feature view is a selection of features and, optionally, labels to be used by one or more models for training and inference. The features in a feature view may come from one or more feature groups.
When you have defined a feature view, you can typically use it to:
-
retrieve point-in-time correct training data,
-
retrieve point-in-time correct batch inference data,
-
build feature vectors for online inference by reading precomputed features and merging them with on-demand and passed features,
-
apply model-dependent transformations to features when reading feature data for training and inference without introducing offline-online skew,
-
apply on-demand transformations in online inference pipelines.
The feature view prevents skew between training and inference by ensuring that the same ordered sequence of features is returned when reading training and inference data, and that the same model-dependent transformations are applied to the training and inference data read from the feature store. Feature views also apply on-demand transformations in online inference pipelines and ensure they are consistent with the feature pipeline.
For training and batch inference data, feature stores support reading data as either DataFrames or files. For small data volumes, Pandas DataFrames are popular, but when data volumes exceed a few GBs, some feature stores support reading to Polars and/or Spark DataFrames. Spark DataFrames are, however, not that widely used for training pipelines (Python ML frameworks are used to train the vast majority of models). For large amounts of data (that don’t fit in a Polars or Pandas DataFrame), feature stores support creating training data as files in an external file system or object store, in file formats such as Parquet, CSV, and TFRecord (TensorFlow’s row-oriented file format that is also supported by PyTorch).
Different feature stores use different names for feature views, including FeatureLookup (Databricks) and FeatureService (Feast, Tecton). I prefer the term feature view due to its close relationship to views from relational databases - a feature view is a selection of columns from different feature groups and it is metadata-only (feature views do not store data). A feature view is also not a service when it is used in training or batch inference pipelines, and it is not just a selection of features (as implied by a FeatureLookup). For these reasons, we use the term feature view.
All major feature stores support on-demand transformations, but, at the time of writing, Hopsworks is the only feature store to support model-dependent transformations in feature views. Not all feature stores implement on-demand transformations according to the data transformation taxonomy presented in this book, either. For example, Databricks’ on-demand transformations are applied in training and online inference pipelines. That is, Databricks’ on-demand transformations produce model-specific features (as they cannot be used in feature pipelines to create reusable features).
Point-in-Time Correct Training Data with Feature Views
Now we look at how feature views create point-in-time correct training data using temporal joins (see Figure 4-4 earlier). A temporal join processes each row from the table containing the labels, and uses each row’s event_time value to join columns from other feature tables, where the joined rows from the feature tables are the rows that have their own event_time value that is closest to, but not greater than the label’s event_time value. If there are no matching rows in the feature tables, the temporal join should return null values.
This temporal join is implemented as an ASOF LEFT (OUTER) JOIN, where the query starts from the table containing the labels, pulling in columns (features) from the tables containing the features, with the ASOF condition ensuring there is no future data leakage for the joined feature values and the LEFT OUTER JOIN condition ensuring rows are returned even if feature values are missing in the resultant training data. The number of rows in the training data should be the same as the number of rows in the table containing the labels.
In Figure 4-13, we can see how the ASOF LEFT JOIN creates the training data from 4 different feature groups (we omitted the account_details feature group for brevity). Starting from the label feature group (credit_card_transactions), it joins in features from the other 3 feature groups (aggregated_transactions, bank_details, merchant_details), as of the event_time in credit_card_transactions.
For example, in our credit card fraud data model, if we want to create training data from the 1st January 2022, we could execute the following nested ASOF LEFT JOIN on our label table and feature tables:
SELECT label.loc_diff, label.amount, aggs.last_week, bank.country, bank.credit_rating as b_rating, merchant.chrgbk, label.fraud FROM credit_card_transactions as label ASOF LEFT JOIN aggregated_transactions as aggs ON label.cc_num=aggs.cc_num AND label.event_ts >= aggs.event_ts ASOF LEFT JOIN bank_details as bank ON aggs.bank_id=bank.bank_id AND label.event_ts >=bank.event_ts ASOF LEFT JOIN merchant_details as merchant ON label.merc_id=merchant.merc_id AND label.event_ts >=merchant.event_ts WHERE label.event_ts > '2022-01-01 00:00';
The above query returns all the rows in the label feature group where the event_ts is greater than the 1st of January 2022, and joins each row with 1 column from aggregated_transactions (last_week) and the 2 columns from bank_details (rating and country), and 1 column from the merchant_details table (chrgbk). For each row in the final output, the joined rows have the event_ts that is closest to, but less than, the value of event_ts in the label feature group. It is a LEFT JOIN, not an INNER JOIN, as the INNER JOIN excludes rows from the training data where a foreign key in the label table does not match a row in a feature table. In most cases, it is ok to have missing feature values as you can impute missing feature values in model-dependent transformations.
Feature Vectors for Online Inference with a Feature View
In online inference, the feature store provides APIs for retrieving precomputed features, computing on-demand features
, and applying model-dependent transformations. Creating feature vectors for online inference involves reading precomputed features, computing on-demand features, and applying model-dependent transformations. In our example from Figure 4-13, there are 2 queries required to retrieve features (1) a primary key lookup for the mechant features using merchant_id and (2) a left join to read the aggregation and bank features using cc_num. The feature view provides an API to retrieve the precomputed features, as done in Hopsworks:
df = feature_view.get_feature_vectors( entry = [{"cc_num": 1234567811112222, "merchant_id": 212}] )
On-demand transformations and model-dependent transformations also need to be applied to the returned feature data, and we will look more at how feature views support them in Chapter 7.
Conclusions
Feature stores are the data layer for AI systems. We dived deep into the anatomy of a feature store and we looked at when it is appropriate for you to use one. We looked at how to organize your feature data in Feature Groups, and how to organize your data in a data model for batch and real-time AI systems. We also looked at how feature views help prevent skew between training and inference, and how they are used to query feature data for training and inference. In the next chapter we will look at the Hopsworks Feature Store in detail.
Get Building Machine Learning Systems with a Feature Store now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.