Chapter 2. Governing ML During the Development Stage

Data scientists are, first and foremost, scientists. Much like biology researchers in a lab, they conduct iterative, potentially long-running experiments to solve a problem. Researchers are almost always required to keep an auditable trail of their experiments, and the same applies to the data scientist.

ML development is the iterative cycle data scientist(s) undergo to optimize model performance for a given problem. It involves data preparation, model training, and evaluation. It’s worth noting that this process is a bit different from software development. In software engineering, code is the primary concern. But in ML development, data is the primary determinant of success.

Code in an ML application is all about servicing the data rather than the application itself. Most ML microservices have a minimal amount of code but interface with a large number of data sources and services. At the high level, software engineering is code-driven while ML is artifact-driven (an “artifact” in this case being the ML model binary and/or data powering the model). This shift in priorities gives ML applications a different set of concerns than standard software applications.1

We mentioned earlier that governance at the development stage is not as difficult as it is at the delivery or operations stage. Since development is preproduction, organizations can generate significant value just by implementing a consistent framework for ML experiments, ...

Get The Framework for ML Governance now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.