Chapter 8. Batch Feature Pipelines
In the previous two chapters, we looked at how to implement data transformations to create reusable features and model-specific features. Now we’ll look at how to productionize the creation of reusable feature data using batch feature pipelines. A batch feature pipeline is a program that reads data from data sources, applies MITs to the extracted data, and stores the computed feature data in the feature store. The batch feature pipeline can run on a schedule, for example, once per hour or day, incrementally processing new data as it becomes available for processing. It can also be run on demand to transform a large volume of historical data into features, in a process known as backfilling.
The goal of a batch feature pipeline is to automate feature creation in what is known as batch processing, which is efficient in its use of resources compared with processing a single record at a time. For example, imagine comparing the time it takes to empty a dishwasher one glass or plate at a time with unloading batches of plates and glasses. Similarly, in data processing, processing batches of data is much more efficient than processing one record at a time. Also, if batch processing is performed daily, you can take advantage of lower-cost off-peak processing time at night. Another operational benefit, compared with stream processing, is that errors only need to be fixed before the next scheduled run of your batch feature pipeline—you might not need to ...