Development Workflows for Data Scientists

Engineers learn in order to build, whereas scientists build in order to learn, according to Fred Brooks, author of the software development classic The Mythical Man Month. It’s no mistake that the term “data science” includes the word “science.” In contrast with the work of engineers or software developers, the product of a data science project is not code; the product is useful insight.

“A data scientist has a very different relationship with code than a developer does,” says Drew Conway, CEO of Alluvium and a coauthor of Machine Learning for Hackers. Conway continues:

I look at code as a tool to go from the question I am interested in answering to having some insight. That code is more or less disposable. For developers, they are thinking about writing code to build into a larger system. They are thinking about how can I write something that can be reused?

However, data scientists often need to write code to arrive at useful insight, and that insight might be wrapped in code to make it easily consumable. As a result, data science teams have borrowed from software best practices to improve their own work. But which of those best practices are most relevant to data science? In what areas do data scientists need to develop new best practices? How have data science teams improved their workflows and what benefits have they seen? These are the questions this report addresses.

Many of the data scientists with whom I spoke said that software ...

Get Development Workflows for Data Scientists now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.