Chapter 4. Feature Store Service

So far, we have discovered the available datasets and artifacts that can be used to generate the required insight. In the case of ML models, there is an additional step of discovering features. For instance, a revenue forecasting model that needs to be trained would require the previous revenue numbers by market, product line, and so on as input. A feature is a data attribute that can be either extracted directly or derived by computing from one or more data sources—e.g., the age of a person, a coordinate emitted from a sensor, a word from a piece of text, or an aggregate value like the average number of purchases within the last hour. Historic values of the data attribute are required for using a feature in ML models.

Data scientists spend a significant amount of their time creating training datasets for ML models. Building data pipelines to generate the features for training, as well as for inference, is a significant pain point. First, data scientists have to write low-level code for accessing datastores, which requires data engineering skills. Second, the pipelines for generating these features have multiple implementations that are not always consistent—i.e., there are separate pipelines for training and inference. Third, the pipeline code is duplicated across ML projects and not reusable since it is embedded as part of the model implementation. Finally, there is no change management or governance of features. These aspects impact the overall time to insight. This is especially true as data users typically lack the engineering skills to develop robust pipelines and monitor them in production. Also, feature pipelines are repeatedly built from scratch instead of being shared across ML projects. The process of building ML models is iterative and requires exploration with different feature combinations.

Ideally, a feature store service should provide well-documented, governed, versioned, and curated features for training and inference of ML models (as shown in Figure 4-1). Data users should be able to search and use features to build models with minimal data engineering. The feature pipeline implementations for training as well as inference are consistent. In addition, features are cached and reused across ML projects, reducing training time and infrastructure costs. The success metric of this service is time to featurize. As the feature store service is built up with more features, it provides economies of scale by making it easier and faster to build new models.

The feature store as the repository of features that are used for training and inference of models across multiple data projects
Figure 4-1. The feature store as the repository of features that are used for training and inference of models across multiple data projects.

Journey Map

Developing and managing features is a critical piece of developing ML models. Often, data projects share a common set of features, allowing reuse of the same features. An increase in the number of features reduces the cost of implementing new data projects (as shown in Figure 4-2). There is a good overlap of features across projects. This section discusses key scenarios in the journey map for the feature store service.

The time and effort required for new data projects goes down as the number of available features in the feature store grows
Figure 4-2. The time and effort required for new data projects goes down as the number of available features in the feature store grows.

Finding Available Features

As a part of the exploration phase, data scientists search for available features that can be leveraged to build the ML model. The goal of this phase is to reuse features and reduce the cost to build the model. The process involves analyzing whether the available features are of good quality and how they are being used currently. Due to a lack of a centralized feature repository, data scientists often skip the search phase and develop ad hoc training pipelines that have a tendency to become complex over time. As the number of models increases, it quickly becomes a pipeline jungle that is hard to manage.

Training Set Generation

During model training, datasets consisting of one or more features are required to train the model. The training set, which contains the historic values of these features, is generated along with a prediction label. The training set is prepared by writing queries that extract the data from the dataset sources and transform, cleanse, and generate historic data values of the features. A significant amount of time is spent in developing the training set. Also, the feature set needs to be updated continuously with new values (a process referred to as backfilling). With a feature store, the training datasets for features are available during the building of the models.

Feature Pipeline for Online Inference

For model inference, the feature values are provided as an input to the model, which then generates the predicted output. The pipeline logic for generating features during inference should match the logic used during training, otherwise the model predictions will be incorrect. Besides the pipeline logic, an additional requirement is having a low latency to generate the feature for inferencing in online models. Today, the feature pipelines embedded within the ML pipeline are not easily reusable. Further, changes in training pipeline logic may not be coordinated correctly with corresponding model inference pipelines.

Minimize Time to Featurize

Time to featurize is the time spent creating and managing features. Today, the time spent is broadly divided into two categories: feature computation and feature serving. Feature computation involves data pipelines for generating features both for training as well as inference. Feature serving focuses on serving bulk datasets during training, low-latency feature values for model inference, and making it easy for data users to search and collaborate across features.

Feature Computation

Feature computation is the process of converting raw data into features. This involves building data pipelines for generating historic training values of the feature as well as current feature values used for model inference. Training datasets need to be continuously backfilled with newer samples. There are two key challenges with feature computation.

First, there is the complexity of managing pipeline jungles. Pipelines extract the data from the source datastores and transform them into features. These pipelines have multiple transformations and need to handle corner cases that arise in production. Managing these at scale in production is a nightmare. Also, the number of feature data samples continues to grow, especially for deep learning models. Managing large datasets at scale requires distributed programming optimizations for scaling and performance. Overall, building and managing data pipelines is typically one of the most time-consuming parts of the overall time to insight of model creation.

Second, separate pipelines are written for training and inference for a given feature. This is because there are different freshness requirements, as model training is typically batch-oriented, while model inference is streaming with near real-time latency. Discrepancies in training and inference pipeline computation is a key reason for model correctness issues and a nightmare to debug at production scale.

Feature Serving

Feature serving involves serving feature values in bulk for training, as well as at low latency for inference. It requires features to be easy to discover and compare and analyze with other existing features. In a typical large-scale deployment, feature serving supports thousands of model inferences. Scaling performance is one of the key challenges, as is avoiding duplicate features given the fast-paced exploration of data users across hundreds of model permutations during prototyping.

Today, one of the common issues is that the model performs well on the training dataset but not in production. While there can be multiple reasons for this, the key problem is referred to as label leakage. This arises as a result of incorrect point-in-time values being served for the model features. Finding the right feature values is tricky. To illustrate, Zanoyan et al. cover an example illustrated in Figure 4-3. It shows the feature values selected in training for prediction at Time T1. There are three features shown: F1, F2, F3. For prediction P1, feature values 7, 3, 8 need to be selected for training features F1, F2, F3, respectively. Instead, if the feature values post-prediction are used (such as value 4 for F1), there will be feature leakage since the value represents the potential outcome of the prediction, and incorrectly represents a high correlation during training.

The selection of correct point-in-time values for features F1, F2, F3 during training for prediction P1. The actual outcome Label L is provided for training the supervised ML model.
Figure 4-3. The selection of correct point-in-time values for features F1, F2, F3 during training for prediction P1. The actual outcome Label L is provided for training the supervised ML model.

Defining Requirements

Feature store service is a central repository of features, providing both the historical values of features over long durations like weeks or months as well as near real-time feature values over several minutes. The requirements of a feature store are divided into feature computation and feature serving.

Feature Computation

Feature computation requires deep integration with the data lake and other data sources. There are three dimensions to consider for feature computation pipelines.

First, consider the diverse types of features to be supported. Features can be associated with individual data attributes or are composite aggregates. Further, features can be relatively static instead of changing continuously relative to nominal time. Computing features typically requires multiple primitive functions to be supported by the feature store, similar to the functions that are currently used by data users, such as:

  • Converting categorical data into numeric data

  • Normalizing data when features originate from different distributions

  • One-hot encoding or feature binarization

  • Feature binning (e.g., converting continuous features into discrete features)

  • Feature hashing (e.g., to reduce the memory footprint of one-hot-encoded features)

  • Computing aggregate features (e.g., count, min, max, and stdev)

Second, consider the programming libraries that are required to be supported for feature engineering. Spark is a preferred choice for data wrangling among users working with large-scale datasets. Users working with small datasets prefer frameworks such as NumPy and pandas. Feature engineering jobs are built using notebooks, Python files, or .jar files and run on computation frameworks such as Samza, Spark, Flink, and Beam.

Third, consider the source system types where the feature data is persisted. The source systems can be a range of relational databases, NoSQL datastores, streaming platforms, and file and object stores.

Feature Serving

A feature store needs to support strong collaboration capabilities. Features should be defined and generated such that they are shareable across teams.

Feature groups

A feature store has two interfaces: writing features to the store and reading features for training and inference. Features are typically written to a file or a project-specific database. Features can be further grouped together based on the ones that are computed by the same processing job or from the same raw dataset. For instance, for a car-sharing service like Uber, all the trip-related features for a geographical region can be managed as a feature group since they can all be computed by one job that scans through the trip history. Features can be joined with labels (in the case of supervised learning) and materialized into a training dataset. Feature groups typically share a common column, such as a timestamp or customer ID, that allows feature groups to be joined together into a training dataset. The feature store creates and manages the training dataset, persisted as TFRecords, Parquet, CSV, TSV, HDF5, or .npy files.

Scaling

There are some aspects to consider with respect to scaling:

  • The number of features to be supported in the feature store

  • The number of models calling the feature store for online inferences

  • The number of models for daily offline inference as well as training

  • The amount of historic data to be included in training datasets

  • The number of daily pipelines to backfill the feature datasets as new samples are generated

Additionally, there are specific performance scaling requirements associated with online model inference—e.g., TP99 latency value for computing the feature value. For online training, take into account time to backfill training sets and account for DB schema mutations. Typically, historical features need to be less than 12 hours old, and near real-time feature values need to be less than 5 minutes old.

Feature analysis

Features should be searchable and easily understandable, to ensure they are reused across ML projects. Data users need to be able to identify the transformations as well as analyze the features, finding outliers, distribution drift, and feature correlations.

Nonfunctional Requirements

Similar to any software design, the following are some of the key NFRs that should be considered in the design of a feature store service:

Automated monitoring and alerting
The health of the service should be easy to monitor. Any issues during production should generate automated alerts.
Response times
It is important to have the service respond to feature search queries on the order of milliseconds.
Intuitive interface
For the feature store service to be effective, it needs to be adopted across all data users within the organization. As such, it is critical to have APIs, CLIs, and a web portal that are easy to use and understand.

Implementation Patterns

Corresponding to the existing task map, there are two levels of automation for the feature store service (as shown in Figure 4-4). Each level corresponds to automating a combination of tasks that are currently either manual or inefficient:

Hybrid feature computation pattern
Defines the pattern to combine batch and stream processing for computing features.
Feature registry pattern
Defines the pattern to serve the features for training and inference.
The different levels of automation for the feature store service
Figure 4-4. The different levels of automation for the feature store service.

Feature store services are becoming increasingly popular: Uber’s Michelangelo, Airbnb’s Zipline, Gojek’s Feast, Comcast’s Applied AI, Logical Clock’s Hopsworks, Netflix’s Fact Store, and Pinterest’s Galaxy are some of the popular open source examples of a feature store service. A good list of emerging feature stores is available at featurestore.org. From an architecture standpoint, each of these implementations has two key building blocks: feature computation and serving.

Hybrid Feature Computation Pattern

The feature computation module has to support two sets of ML scenarios:

  • Offline training and inference where bulk historic data is calculated at the frequency of hours

  • Online training and inference where feature values are calculated every few minutes

In the hybrid feature computation pattern, there are three building blocks (as shown in Figure 4-5):

Batch compute pipeline
Traditional batch processing runs as an ETL job every few hours, or daily, to calculate historic feature values. The pipeline is optimized to run on large time windows.
Streaming compute pipeline
Streaming analytics performed on data events in a real-time message bus to compute feature values at low latency. The feature values are backfilled into the bulk historic data from the batch pipeline.
Feature spec
To ensure consistency, instead of data users creating pipelines for new features, they define a feature spec using a domain-specific language (DSL). The spec specifies the data sources and dependencies and the transformation required to generate the feature. The spec is automatically converted into batch and streaming pipelines. This ensures consistency in pipeline code for training as well as inference without user involvement.
Parallel pipelines in the hybrid feature computation pattern
Figure 4-5. Parallel pipelines in the hybrid feature computation pattern.

An example of the hybrid feature computation pattern is Uber’s Michelangelo. It implements a combination of Apache Spark and Samza. Spark is used for computing batch features and the results are persisted in Hive. Batch jobs compute feature groups and write to a single Hive table as a feature per column. For example, Uber Eats (Uber’s food delivery service) uses the batch pipeline for features like a restaurant’s average meal preparation time over the last seven days. For the streaming pipeline, Kafka topics are consumed with Samza streaming jobs to generate near real-time feature values that are persisted in key-value format in Cassandra. Bulk precomputing and loading of historical features happens from Hive into Cassandra on a regular basis. For example, Uber Eats uses the streaming pipeline for features like a restaurant’s average meal preparation time over the last hour. Features are defined using a DSL that selects, transforms, and combines the features that are sent to the model at training and prediction times. The DSL is implemented as a subset of Scala, which is a pure functional language with a complete set of commonly used functions. Data users also have the ability to add their own user-defined functions.

Strengths of the hybrid feature computation pattern:

  • It provides optimal performance of feature computation across batch and streaming time windows.

  • The DSL to define features avoids inconsistencies associated with discrepancies in pipeline implementation for training and inference.

Weakness of the hybrid feature computation pattern:

  • The pattern is nontrivial to implement and manage in production. It requires the data platform to be fairly mature.

The hybrid feature computation pattern is an advanced approach for implementing computation of features that is optimized for both batch and streaming. Programming models like Apache Beam are increasingly converging the batch and streaming divide.

Feature Registry Pattern

The feature registry pattern ensures it is easy to discover and manage features. It is also performant in serving the feature values for online/offline training and inference. The requirements for these use cases are quite varied, as observed by Li et al. Efficient bulk access is required for batch training and inference. Low-latency, per-record access is required for real-time prediction. A single store is not optimal for both historical and near real-time features for the following reasons: a) datastores are efficient for either point queries or for bulk access, but not both, and b) frequent bulk access can adversely impact the latency of point queries, making them difficult to coexist. Irrespective of the use case, features are identified via canonical names.

For feature discovery and management, the feature registry pattern is the user interface for publishing and discovering features and training datasets. The feature registry pattern also serves as a tool for analyzing feature evolution over time by comparing feature versions. When starting a new data science project, data scientists typically begin by scanning the feature registry for available features, only adding new features that do not already exist in the feature store for their model.

The feature registry pattern has the following building blocks:

Feature values store
Stores the feature values. Common solutions for bulk stores are Hive (used by Uber and Airbnb), S3 (used by Comcast), and Google BigQuery (used by Gojek). For online data, a NoSQL store like Cassandra is typically used.
Feature registry store
Stores code to compute features, feature version information, feature analysis data, and feature documentation. The feature registry provides automatic feature analysis, feature dependency tracking, feature job tracking, feature data preview, and keyword search on feature/feature group/training dataset metadata.

An example of the feature registry pattern is Hopsworks feature store. Users query the feature store as SQL, or programmatically, and then the feature store returns the features as a dataframe (as shown in Figure 4-6). Feature groups and training datasets in the Hopsworks feature store are linked to Spark/NumPy/pandas jobs, which enables the reproduction and recomputation of the features when necessary. In addition to a feature group or training dataset, the feature store does a data analysis step, looking at cluster analysis of feature values, feature correlation, feature histograms, and descriptive statistics. For instance, feature correlation information can be used to identify redundant features, feature histograms can be used to monitor feature distributions between different versions of a feature to discover covariate shift, and cluster analysis can be used to spot outliers. Having such statistics accessible in the feature registry helps users decide on which features to use.

User queries to the feature store generate dataframes, represented in popular formats, namely pandas, NumPy, or Spark (from the Hopsworks documentation)
Figure 4-6. User queries to the feature store generate dataframes (represented in popular formats, namely pandas, NumPy, or Spark) (from the Hopsworks documentation).

Strengths of the feature registry pattern:

  • It provides a performant serving of training datasets and feature values

  • It reduces feature analysis time for data users

Weaknessess of the feature registry pattern:

  • The potential performance bottleneck while serving hundreds of models

  • Scaling for continuous feature analysis with a growing number of features

Summary

Today, there is no principled way to access features during model serving and training. Features cannot easily be reused between multiple ML pipelines, and ML projects work in isolation without collaboration and reuse. Given that features are deeply embedded in ML pipelines, when new data arrives, there is no way to pin down exactly which features need to be recomputed; rather, the entire ML pipeline needs to be run to update features. A feature store addresses these symptoms and enables economies of scale in developing ML models.

Get The Self-Service Data Roadmap now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.