Appendix A. Introduction to Infrastructure for Machine Learning
This appendix gives a brief introduction to some of the most useful infrastructure tools for machine learning: containers, in the form of Docker or Kubernetes. While this may be the point at which you hand your pipeline over to a software engineering team, it’s useful for anyone building machine learning pipelines to have an awareness of these tools.
What Is a Container?
All Linux operating systems are based on the filesystem, or the directory structure that includes all hard drives and partitions. From the root of this filesystem (denoted as /), you can access almost all aspects of a Linux system. Containers create a new, smaller root and use it as a “smaller Linux” within a bigger host. This lets you have a whole separate set of libraries dedicated to a particular container. On top of that, containers let you control resources like CPU time or memory for each container.
Docker is a user-friendly API that manages containers. Containers can be built, packaged, saved, and deployed multiple times using Docker. It also allows developers to build containers locally and then publish them to a central registry that others can pull from and immediately run the container.
Dependency management is a big issue in machine learning and data science. Whether you are writing in R or Python, you’re almost always dependent on third-party modules. These modules are updated frequently and may cause breaking changes to your pipeline ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access