Chapter 4. Choosing the Technical Infrastructure

So your team is well placed in the organization and has the right people on it. You are working to create a strong culture of providing value, and you have methods for managing the tasks. But those tasks require your data scientists to be doing technical work, which means they need the tools to do so. These include questions like what programming languages to use (R, Python, or something else); what kinds of databases to store information in; if models should be deployed as APIs by the engineering team or as batch scripts run by the data science team; and more. A data science leader has to be heavily involved in choosing which technology a team should use and deciding when it’s time to switch between them.

How to Make Decisions on Technical Infrastructure

The actual process for making decisions about a team’s technical infrastructure is as important as the actual decisions. The leader of the data science team may be making these decisions directly, or it may be a person on the team like the technical lead or a principal data scientist who has the final call. With any of these sorts of decisions, there is a spectrum of options for how the decision is made (see Figure 4-1). On one side of the spectrum is authoritarianism, the idea that all decisions within the team are solely made by the person in charge. On the other side is anarchy, the idea that anyone on the team can make whatever decision they personally feel is best.

Figure 4-1. The decision-making spectrum on which each data science team falls.

Data scientists usually love to make decisions, because they’ll choose what they personally like best. In anarchy environments, the data scientist will choose precisely the tools they prefer, so you may end up with a data science team of five people and a code base written in six programming languages (and two of the languages are dead). Each data scientist will be individually happy with their tool set, but the data scientists won’t be able to work together. If one code base uses Python on an AWS EC2 instance and another uses MATLAB on a Windows laptop, you’ll be in trouble when data scientists switch projects or leave the team. But so long as each data scientist is only working on their own projects, they’ll be happy enough.

On the other end, in authoritarian environments all decisions are made at the leadership level. The programming language, platform, architecture, and other technical decisions are all made by leaders who keep things uniform across the team. For example, everyone uses R, commits to a very specifically formatted GitHub repository, writes their code in the exact same style, and does projects in the same way. In these situations, there are few issues with the data science team changing. Everyone can work on everyone else’s projects, and there is less infrastructure to maintain because everything is consistent.

The problem is that regardless of what team-wide tools the leader chooses, there will be situations where they don’t work well. Sometimes something is hard in R but easy in Python, or hard when you’re doing your data science on virtual machines but easy when you are using Docker. In those situations, if you don’t allow your data scientists to use the right tool for the situation, they could end up spending far more time and effort on the task at hand. Worse, they’ll quickly get demoralized and may leave the team. See Table 4-1 for some examples of technical decision making along this spectrum.

Table 4-1. Example scenarios of technical decision making
Scenario Style
Data scientists are given laptops and can use whatever tool set and programming language they want so long as they finish the assigned work well. Extreme anarchy
Data scientists are allowed to use the programming language and library they want on their laptops, but are encouraged to use tools that align with the existing cloud platform whenever possible. Leans toward anarchy
Data scientists are required to use cloud infrastructure, Python, and a specific set of ML libraries. Other tooling can be used, but they first need approval of the technical lead. Leans toward authoritarianism
Data scientists must only use exactly the tool set allowed by the technical leader, the manager, and DevOps. Extreme authoritarianism

So both ends of the spectrum, anarchy and authoritarianism, are not good places for data science teams to be. The healthy spot is somewhere in the middle, where the data science leader works in collaboration with the data scientists to ensure the right tools are being used for the job while minimizing the number of distinct tools and incompatible systems being used. Where exactly the best spot falls depends on the particular organization and its objectives. It also largely depends on the particular industry. For example, consulting firms doing many one-off projects should have fewer rigid structures, whereas heavily regulated industries like finance should have more guardrails.

It’s easy to look at this spectrum as a data science leader and point to a place you want to be. And as the leader, it is easier for you to manage with more authority, so you’ll likely want to have your team lightly authoritarian. But it’s worth reflecting on the strengths and weaknesses of the particular data scientists on your team and giving them as much autonomy as you can.

Components of Data Science Team Infrastructure

Data science teams require lots of different infrastructure systems, some explicitly built or purchased for them, like a database solution, some implicitly decided by the team, like setting up a Git repository with a particular programming language and set of libraries. This section covers several important areas of infrastructure to consider.

Storing the Data

Most data science teams don’t need to worry about where to store raw data because they get it from other parts of the organization. Other divisions create data with marketing information, sales and revenue, and data directly from the product, and engineers store it in databases for data scientists to use. A data science team may be able to influence some of the decisions on how the data is stored, like asking for certain columns to be included, but rarely are they responsible for it.

However, data science teams do have a need to store intermediate or output data—data that has been cleaned or outputted from a model and is under the ownership of the data science team itself. An example of intermediate data is a large data table after the string columns have been formatted or new features to be added that will get fed into a model. An example of output data is predictions from a model for each customer in a dataset. These sorts of datasets are tricky because they often don’t have a fixed schema and can be very large in size. Most often the data science team doesn’t have ownership of the data servers, so they may not be able to acquire a storage location themselves and need another team’s help.

In an ideal scenario, the data science team’s intermediate and output data would be stored closely to the input data. If all of the data is stored in a single location, it’s easy to join it together for further analysis and keep track of changes. In practice, this is sometimes not possible; for instance, if the input data is production data that needs to live on secure servers that the data scientists cannot write to. In these situations, you’ll have to set up a different location to store your data and create processes for managing it.

As the creators of the intermediate and output data, you’ll be responsible for keeping track of it. This is where data governance practices are important. You’ll want a structure for deciding what data to store and how to consistently store it. You’ll want to do this in a way that, in the future, people will be able to understand what was done. A full data engineering team may make a data warehouse or a data mart for their data, but since that isn’t your primary focus, you will likely not need to go that far.

As a team leader, you need to put thought into the best way to store this data and how you’ll be keeping track of it over time. If you don’t think it through, then the data might end up spread out over many locations, such as multiple database servers, shared network drives, and file storage systems, and you’ll be unable to keep track of it. Worse, over time, other systems will start relying on this data—for example, processes that use customer predictions from a model to then adjust email campaigns—and if the data isn’t being stored correctly, then you can incur tech debt trying to use it.

A Workspace for Running Analyses and Training Models

The day-to-day job of a data scientist mostly involves cleaning data, making analyses, training models, and other forms of work that all happen in a single place. These data science workplaces take different forms, depending on the setup of the data science team:

Each data scientist works on a company-owned laptop

For many companies, the data scientists do their day-to-day work on a company-owned laptop using an IDE of their choice, like RStudio or JupyterLab. A data scientist will download data to this machine, do their work, then upload their results to a shared location. Using laptops has the benefit that they require almost no collaborative setup—each data scientist can independently install whatever they want on a machine and use that. They have the problem of having no standardization: anyone can do whatever they want, so it’s harder for code that runs for one data scientist to run on a different teammate’s laptop. Because each machine is set up differently, often more senior employees on the team have to help junior ones when their unusual setups cause things to break. The data scientists are also limited in hardware by the specifications of the machine, so if a particular analysis requires something different, that analysis can’t be done. There is also the security risk that laptops can be physically stolen.

Data scientists work on virtual machines in the cloud

Some data science teams improve on the first scenario by replacing laptops with virtual machines like AWS EC2 instance or Google Cloud Platform VMs. This allows data scientists to change the instance size if they have different hardware needs and removes the chance of a laptop getting stolen. The downside is that there still can be a total lack of standardization of the machines, so just like with laptops, something that works on one virtual machine might not run on another. There also is a security risk of virtual machines being open outside of the network: because each data scientist sets up their machine, there is a decent chance one might set it up incorrectly.

Shared cloud workplace platforms

Recently, data science teams have been adopting cloud platforms that are tailored to data scientists. These platforms, like AWS SageMaker, Saturn Cloud, and DataBricks, are meant to provide a location where data scientists can do all of their work. By using a standard platform, data science code is more easily passed between different teammates, less time is spent on setup and upkeep of the workspace, and code can often more easily be deployed. They also have fewer security risks because there are administrative tools built in for oversight. Each platform has its own strengths and weaknesses, so if you are considering one of these platforms, it’s worth having your data scientists try them and see which they like.

Note that some data science teams have datasets that are so large they can’t feasibly be analyzed on a single machine. In these situations, a separate technology has to be used to run the computations across a distributed cluster of machines. The cluster must be connected to the data science workspace so the team can take results and analyze them further. Spark is a popular technology for these computations, and the DataBricks platform has Spark built in. Dask is a more recent Python-based framework for distributed computing—it’s built into the Saturn Cloud platform or can be used on its own by the service provided by Coiled. That said, for most data science teams, there isn’t a need to use distributed computing. Often datasets are small enough to use a single machine, or you can run things on a single large virtual machine if needed. The overhead of maintaining a distributed system can be a large burden if your team doesn’t need it.

Sharing Reports and Analyses

If your team is focused on using data to help drive business strategy, you’ll be creating lots of reports and analyses. If your team is focused more on creating machine learning models, you’ll still need analyses to inform which models to use and way. There is almost no situation where your team isn’t creating information that needs to be saved and shared with others, and thus you’ll want infrastructure to support that. You’ll also want the ability to connect an analysis with the code that generated it in case you need to rerun it.

If you don’t explicitly choose a method for storing and sharing analyses, then your “infrastructure” will end up being whatever emails and Slack messages are used to share the information. This is very difficult to maintain in practice. While it’s easy to share results with others in this manner, there is almost no way to find an older analysis or trace the code that made it.

A more sophisticated approach is to create a shared location to save your analyses, such as an AWS S3 bucket, a Dropbox folder, or potentially a GitHub repository. In these approaches, the data science team has to be vigilant about enforcing a standard structure so that particular analyses can be found in the shared location and traced back to the code that made them. Ideally, the results should be visible to both data scientists and non–data scientists alike. Tools like Dropbox folders are inherently easier for nontechnical people to navigate than an AWS S3 bucket or something that requires technical knowledge to view. Regardless of the approach, you’ll still want to have data governance policies so that these are effectively organized.

Projects like Knowledge Repo, an open source tool by Airbnb, or RStudio Connect, a platform for sharing items like R Markdown reports and R Shiny dashboards, are being built to solve this problem. By providing a platform where an analysis can be easily uploaded and directly viewed and the code that made the analysis can also be stored, data science teams are more capable of cataloging work and keeping them maintained over time.

Deploying Code

If your data science team is creating code that is consistently run, either on a batch schedule or continuously as an API, then you’ll need a platform that the code can be run on. There are generally two possible scenarios that your team might fall into. In one case, there is a supporting engineering team that maintains the code, and in the other, your data science team is all on its own:

You have the support of an engineering team

If your data science work is being built directly into a product, then you likely have an engineering team to help you out. The engineering team is in charge of connecting your models and work to the product; they are the ones who call your APIs or use the output of your batch scripts. Because of this, they almost always already have their own platform setup for deploying all of the software engineering code, and the best thing to do is have the data science code merge in. Your data science team doesn’t have to worry about maintaining a platform but instead just needs to hand over Docker containers, Python libraries, or some standard format that the code can be run from. Your team is, however, on the hook for making sure the code is up to standards, and as the leader of the team, you should be checking that the data scientists are adhering to those standards.

Your data science team is on its own

There are many data science teams that aren’t directly connected to an engineering group, such as data science teams that generate insights for a business unit. There are still situations where teams like these may want to deploy code. For instance, they may want to score each customer once a month with predicted future value. A number of companies, including Algorithmia, RStudio Connect, and Saturn Cloud, provide platforms for data scientists to deploy models without being experts in engineering.

In either of these scenarios, you still want to have strong processes and infrastructure: systems to ensure that your code is tested before being deployed, ways of monitoring how well the models are maintaining accuracy, etc. Setting these up will require a combination of data science and engineering expertise, and the effort required to have them work smoothly shouldn’t be underestimated.

When Your Team Members Don’t Align with Your Infrastructure

Your infrastructure decisions will become more solidified as your team matures. More and more processes will be built around your particular databases and workplaces, and your team will become more comfortable with them. In general, this is a positive thing! It means your team is working through issues and becoming faster and more experienced. You may, however, find a discrepancy when it comes to bringing new people onto your team. When hiring new people, you’ll need to assess how much of the infrastructure you use a candidate must have experience with. This can be decided explicitly by only considering resumes that have the necessary skill set or implicitly by having interview questions that weed out people without experience.

It has been the case for many years now (and will likely continue to be the case) that there are not that many experienced data science candidates on the job market. While there may be many people who want to be data scientists in general, the number of those who have a lot of experience and are actively looking for jobs becomes quite small. It may be the case that few people on the job market are experienced in some areas of your tech stack, and certainly no one will be already experienced with all of them.

The good news is that data scientists generally love to learn—it’s a profession built around discovering new things. When hiring, if a candidate doesn’t have experience with your particular tech stack, do not worry: they can learn on the job. As much as possible, be relaxed with your technical constraints on what candidates should know and allow as many substitutes as you can. For example, if your team uses Python but the candidate only knows R, then that’s an indication they can still learn Python once they join. Further, by hiring from a more diverse set of technical backgrounds, you increase the chance that someone arriving might know a better way of doing things than your team’s current practices. It really is the case that it’s worth playing the long game here and hiring people who will be great with a little ramp-up rather than only hiring people who know everything on day one.

Where to Go Next

Having read through this report, you hopefully have thought about leading a data science team in new ways. While the report has covered many areas, we can summarize it with a few key concepts:

Thinking about how your team integrates and communicates is important

The success of a data science team often comes down to things like how well the stakeholders and team can work together with clear communication and how the goals of the data science team are integrated with the goals of the broader organization. A data science team leader’s job is to monitor this and tackle issues the moment they arise. A leader also needs to keep track of how communication happens within the data science team—between data scientists, between independent contributors and managers, and between data scientists and stakeholders.

A leader is responsible for ensuring that the data science work gets done regardless of how tricky it is

Data science teams have a constant flow of new tasks coming in, and each task can be risky because you don’t know if you’ll have the data or signal to actually do it. A leader needs to keep track of the work, prioritizing it based on the risk levels and importance, and ensuring the data scientists are focused on finishing the work and not getting distracted. This is a lot of distinct components to have to keep track of.

The technology powering your team matters, as does how you make the decisions around it

There are lots of technology decisions your team will have to make and many companies out there trying to sell you technology you don’t need. You’ll want your team to thoughtfully decide the right balance of distinct platforms to use in a way that leaves everyone happy. A leader will need to find the right balance of personally picking which technologies everyone is required to use versus letting each person on the team make their own decisions. Choosing the way you make decisions is as important as choosing the technology itself.

If you want more information and discussion around being a data science leader, here are a few resources:

  • How to Lead in Data Science by Jike Chong and Yue Cathy Chang (Manning) is a deeper dive into many of the topics discussed in this report.

  • For more general thoughts on engineering leadership, check out The Manager’s Path by Camille Fournier (O’Reilly).

  • Social media platforms like Twitter and LinkedIn can often have great discussions by data science professionals and leaders about the challenges they face and solutions they’ve found.

Best of luck on your continued journey as a data science leader!

Get Leading Data Science Teams now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.