Putting the science back in data science
Best practices and scalable workflows for reproducible data science.
One of key tenets of science (physics, chemistry, etc.), or at least the theoretical ideal of science, is reproducibility. Truly “scientific” results should not be accepted by the community unless they can be clearly reproduced and have undergone a peer review process. Of course, things get messy in practice for both academic scientists and data scientists, and many workflows employed by data scientists are far from reproducible. These workflows may take the form of:
- A series of Jupyter notebooks with increasingly descriptive names, such as second_attempt_at_feature_selection_for_part2.ipynb
- Python or R scripts manually copied to a machine and run at periodic times via cron
- Fairly robust, but poorly understood, applications built by engineers based on specifications handed off to the engineers from data scientists
- Applications producing results that are nearly impossible to tie to specific states of one or more continuously changing input data sets
At the very best, the results generated by these sorts of workflows could be re-created by the person(s) directly involved with the project, but they are unlikely to be reproduced by anyone new to the project or by anyone charged with reviewing the project.
Reproducibility-related data science woes are being expressed throughout the community:
Data analysis is incredibly easy to get wrong, and it’s just as hard to know when you’re getting it right, which makes reproducible research all the more important!—Reproducibility is not just for researchers, Data School
Ever tried to reproduce an analysis that you did a few months ago or even a few years ago? You may have written the code, but it’s now impossible to decipher whether you should use make_figures.py.old, make_figures_working.py or new_make_figures01.py to get things done.—Cookiecutter Data Science
Six months later, someone asks you a question you didn’t cover so you need to reproduce your analysis. But you can’t remember where the hell you saved the damn thing on your computer. If you’re a data scientist (especially the decision sciences/analysis focused kind), this has happened to you.—The Most Boring/Valuable Data Science Advice, by Justin Bozonier
The problem of reproducibility is one that data science teams within an organization will have to tackle at some point. However, there is good news! With a little bit of discipline and the right tooling, data science teams can achieve reproducibility. This post will discuss the value of reproducibility and will provide some practical steps toward achieving it.
Why should your data science team care about reproducibility?
One could argue that as long as your models and analyses produce “good” results, it doesn’t matter whether those results could be re-created. However, even small teams of data scientists will hit a wall if they neglect reproducibility. Reproducibility in data science shouldn’t be forsaken, regardless of the size your organization or the maturity of your company, because reproducibility is a precursor to:
Collaboration: Data science, and science in general for that matter, is a collaborative endeavor. No data scientist knows all relevant modeling techniques and analyses, and, even if they did, the size and complexity of the data-related problems in modern companies are almost always beyond the control of a single person. Thus, as a data scientist, you should always be concerned about how you share your results with your colleagues and how you collaborate on analyses/models. Specifically, you should share your work and deploy your products in a way that allows others to do exactly what you did, with the same data you used, to produce the same result. Otherwise, your team will not be able to capitalize on its collective knowledge, and advances within the team will only be advanced and understood by individuals.
Creativity: How do you know if a new model is performing better than an old model? How can you properly justify adding creative sophistication or complexity to analyses? Unfortunately, these questions are often addressed via one individual’s trial and error (e.g., in a notebook), which is lost forever after the decisions are made. If analyses are reproducible, however, data science teams can: (1) concretely determine how new analyses compare to old analyses because the old analyses can be exactly reproduced and the new analyses can be run against the known previous data; and (2) clearly see which analyses performed poorly in the past to avoid repeating mistakes.
Compliance: As more and more statistical, machine learning, and artificial intelligence applications make decisions that directly impact users, there will be more and more public pressure to explain and reproduce results. In fact, the EU is already demanding a “right to an explanation” for many algorithmically generated, user-impacting decisions. How could such an explanation be given or an audit trail be established without a clearly understood and reproducible workflow that let to the results?
How can a data science team achieve reproducibility?
Successfully enabling reproducibility will look slightly different for every organization because data science teams are tasked with such a wide variety of projects. However, implementing some combination of the following best practices, techniques, and tooling is likely to help move your workflows closer to reproducibility.
Strive for and celebrate simple, interpretable solutions
Deep learning is a prime example of powerful, yet often difficult to reproduce, analytical tools. Not all business problems require it, even though deep learning and other types of neural networks are clearly very powerful. Often a simple statistical aggregation (e.g., calculating a min or max) does wonders with respect to data-driven decision-making. In other cases, a linear regression or a decision tree might produce adequate, or even very good, predictions.
In these cases, the price paid in interpretability with more complicated modeling techniques might not be worth gains in precision or accuracy. The bottom line is that it is harder to ensure reproducibility for complicated data pipelines and modeling techniques, and reproducibility should be valued above using the latest and greatest models.
No reproducibility, no deployment
No matter what time crunch you are facing, it’s not worth putting a flaky implementation of an analysis into production. As data scientists, we are working to create a culture of data-driven decision-making. If your application breaks without an explanation (likely because you are unable to reproduce the results), people will lose confidence in your application and stop making decisions based on the results of your application. Even if you eventually fix it, that confidence is very, very hard to win back.
Data science teams should require reproducibility in the same way they require unit testing, linting, code versioning, and review. Without consistently producing results as good or better than known results for known data, analyses should never be passed on to deployment. This performance can be measured via techniques similar to integration testing. Further, if possible, models can be run in parallel on current data running through your systems for a side-by-side comparison with current production models.
You can orchestrate this sort of testing and measurement on your own, but you might consider taking advantage of something like LeVar. LeVar provides “a database for storing evaluation data sets and experiments run against them, so that over time you can keep track of how your methods are doing against static, known data inputs.”
Version your data
Even if you have code or Jupyter notebooks versioned, you simply can’t reproduce an analysis if you don’t run the code or notebook on the same data. This means that you need to have a plan and tooling in place to retrieve the state of both your analysis and your data at certain points in history. As time goes on, there are more and more options to enable data versioning, and they will be discussed below, but your team should settle on a plan for data versioning and stick to it. Data science prior to data versioning is a little bit like software engineering before Git.
Pachyderm is a tool I know well (disclosure: I work at Pachyderm) that allows you to commit data with versioning similar to committing code via Git. The Pachyderm file system is made up of “data repositories” into which you can commit data via files of any format.
For any manipulation of your data you can encapsulate that modification in a commit to Pachyderm. That means the operation is reversible and the new state is reproducible for you and your colleagues. Just as in Git, commits are immutable so you can always refer back to a previous state of your data.
Know your provenance
Actually, it’s not always enough to version your data. Your data comes with its own baggage. It was generated from a series of transformations and, thus, you likely need some understanding of the “provenance” of your data. Results without context are meaningless. At every step of your analysis, you need to understand where the data came from and how it reached its current state.
Tools like Pachyderm can help us out here as well, as a tool or a model for your own processes. Analyses that are run via Pachyderm, for example, automatically record provenance as they execute and it’s impossible for analysis to take input without those inputs becoming provenance for the output.
Write it down
Call it documentation if you want. In any event, your documentation should have a “lab notebook” spin on it that tells the story about how you came to the decisions that shaped your analysis. Every decision should have a documented motivation with an understanding of the costs associated with those decisions.
For example, you very well might need to normalize a variable in your analysis. Yet, when you normalize that variable, the numbers associated with that variable will lose their units and might not be as readable to others. Moreover, others building off of your analysis might assume certain units based on column names, etc.
Lab notebooks are so great. Without them, it’s genuinely really hard for a data scientist to pick up where he or she left off in an experimental project, even if it’s only been a day or two since she/he was working on it.
Ensuring reproducibility in your data science workflows can seem like a daunting task. However, following a few best practices and utilizing appropriate tooling can get you there. The effort will be well worth it in the end and will pay off with an environment of collaboration, creativity, and compliance.