Chapter 7. Model-Dependent and On-Demand Transformations
In this chapter, we will look at data transformations in training and inference pipelines and how to ensure that transformations in both pipelines are equivalent. We introduced model-dependent transformations (MDTs) in Chapter 2 as data transformations that are performed on data after it has been read from the feature store and that create features that are specific to one model. There are two broad classes of MDTs—feature transformations (for numerical and categorical features) and transformations that are tightly coupled to only one model. An example of the former is one-hot encoding of categorical variables, while an example of the latter is text encoding for an LLM.
We also look at how to prevent skew between MDTs that are applied separately in training and inference pipelines. This is not always as trivial as applying the same versioned function in both training and inference pipelines, as many MDTs are stateful, requiring the same state (the model’s training data statistics) as a parameter in both training and inference pipelines. We start by introducing common examples of feature transformations and different classes of model-specific transformations. We then look at different mechanisms for preventing skew, including Scikit-Learn pipelines, PyTorch transforms, and transformation functions in feature views for Hopsworks. We also cover our final class of data transformation—on-demand transformations (ODTs) that are ...