Appendix A. Introduction to Infrastructure for Machine Learning
This appendix gives a brief introduction to some of the most useful infrastructure tools for machine learning: containers, in the form of Docker or Kubernetes. While this may be the point at which you hand your pipeline over to a software engineering team, it’s useful for anyone building machine learning pipelines to have an awareness of these tools.
What Is a Container?
All Linux operating systems are based on the filesystem, or the directory structure that includes all hard drives and partitions. From the root of this filesystem (denoted as
/), you can access almost all aspects of a Linux system. Containers create a new, smaller root and use it as a “smaller Linux” within a bigger host. This lets you have a whole separate set of libraries dedicated to a particular container. On top of that, containers let you control resources like CPU time or memory for each container.
Docker is a user-friendly API that manages containers. Containers can be built, packaged, saved, and deployed multiple times using Docker. It also allows developers to build containers locally and then publish them to a central registry that others can pull from and immediately run the container.
Dependency management is a big issue in machine learning and data science. Whether you are writing in R or Python, you’re almost always dependent on third-party modules. These modules are updated frequently and may cause breaking changes to your pipeline ...