Appendix A. Introduction to Infrastructure for Machine Learning

This appendix gives a brief introduction to some of the most useful infrastructure tools for machine learning: containers, in the form of Docker or Kubernetes. While this may be the point at which you hand your pipeline over to a software engineering team, it’s useful for anyone building machine learning pipelines to have an awareness of these tools.

What Is a Container?

All Linux operating systems are based on the filesystem, or the directory structure that includes all hard drives and partitions. From the root of this filesystem (denoted as /), you can access almost all aspects of a Linux system. Containers create a new, smaller root and use it as a “smaller Linux” within a bigger host. This lets you have a whole separate set of libraries dedicated to a particular container. On top of that, containers let you control resources like CPU time or memory for each container.

Docker is a user-friendly API that manages containers. Containers can be built, packaged, saved, and deployed multiple times using Docker. It also allows developers to build containers locally and then publish them to a central registry that others can pull from and immediately run the container.

Dependency management is a big issue in machine learning and data science. Whether you are writing in R or Python, you’re almost always dependent on third-party modules. These modules are updated frequently and may cause breaking changes to your pipeline ...

Get Building Machine Learning Pipelines now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.