Chapter 13. Continuous Integration Service
So far, we have covered building the transformation logic to implement the insight and training of ML models. Typically, ML model pipelines evolve continuously with source schema changes, feature logic, dependent datasets, data processing configurations, model algorithms, model features, and configuration. These changes are made by teams of data users to either implement new product capabilities or improve the accuracy of the models. In traditional software engineering, code is constantly updated with multiple changes made daily across teams. To get ready for deploying ML models in production, this chapter covers details of continuous integration of ML pipelines, similar to traditional software engineering.
There are multiple pain points associated with continuous integration of ML pipelines. The first is holistically tracking ML pipeline experiments involving data, code, and configuration. These experiments can be considered feature branches with the distinction that a vast majority of these branches will never be integrated with the trunk. These experiments need to be tracked to pick the optimal configuration as well as for future debugging. Existing code-versioning tools like GitHub only track code changes. There is neither a standard place to store the results of training experiments nor an easy way to compare one experiment to another. Second, to verify the changes, the ML pipeline needs to be packaged for deploying in a test environment. ...