Beyond the Venn diagram

Identifying the essential skills for data scientists.

By Daniel Tunkelang
December 3, 2015
Detail from one of the shelves in Clowes chemist shop in Buxton. Detail from one of the shelves in Clowes chemist shop in Buxton. (source: By Simon Harrod on Flickr)
Editor’s note: This is the first in a three-part series of posts by Daniel Tunkelang dedicated to data science as a profession. In this series, Tunkelang will cover the recruiting, organization, and essential functions of data science teams.

If you work at a company where the core asset is data, then you know how hard it is to hire data scientists. Indeed, we’ve heard for years about a shortage of data scientists. The last few years have seen a proliferation of data science bootcamp programs to address this shortage, some of them charging as much as $16,000 to transform students into data scientists in just a few months. Meanwhile, some of the top universities in the U.S. are offering master’s degrees in data science. But what skills are these programs teaching their students? More importantly, what skills should you be looking for when hiring data scientists?

The Data Science Venn Diagram

In 2010, Drew Conway published the Data Science Venn Diagram, which characterizes the skill set of data scientists as hacking skills, math and statistics knowledge, and domain knowledge (which he calls substantive expertise).

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more
Data Science Venn Diagram
The Data Science Venn Diagram, courtesy of Drew Conway, used with permission.

This diagram has been a useful guide for countless data science managers, but it only goes so far. How much of each skill does a data scientist need to possess? Can data scientists be smart generalists who learn specific skills on the job? I’ve spent the last several years hiring and managing data scientists, and I’ll address these questions with some suggestions based on my experience.

Hacking skills matter, but don’t require skills that can be acquired quickly

Does your team work with Hadoop or Spark? Tableau or Qlikview? If so, it’s tempting to add a line to your job description requiring experience with those particular tools. Don’t do it. Typically, new employees can achieve basic fluency in these tools within days or weeks — not a huge investment — and holding out for a qualified candidate to show up in your hiring funnel may take twice that long.

Granted, not everyone acquires new skills at the same rate. It is important to exercise common sense in estimating how long it will take a candidate to pick up a new hacking skill, be it a tool, framework, or programming language. And you won’t always get it right, so be ready to have patience when your estimates prove too aggressive.

Also, don’t restrict your pool of candidates by requiring technical skills that can be acquired quickly. Instead, look for candidates whose technical skills are in the right ballpark, and whose track records demonstrate the ability and willingness to acquire new skills.

Math and statistics skills matter, but you’re not looking for Fields Medalists

Modeling and analyzing data requires a minimum level of mathematical literacy. And a job that includes determining the statistical significance of experimental results certainly requires a fundamental understanding of statistics. Data scientists need strong foundations in these areas. But, beyond those foundations, there’s a diminishing return.

A data scientist doesn’t need to have an undergraduate degree in math or statistics, let alone a master’s or doctorate. Most data science is the application of basic theoretical knowledge to messy, real-world problems. That being said, make sure you hire people who know the basics. Someone who doesn’t understand Bayes’ Law or the central limit theorem is unlikely to succeed as a data scientist. But don’t try to hire the best mathematicians or statisticians as data scientists. Let them work on solving Millennium Prize Problems — or simply getting tenure.

The importance of domain knowledge is domain-dependent

In Conway’s Venn diagram, people who have hacking and math skills are deemed capable of doing machine learning. But data science requires an additional component of domain knowledge.

The importance of domain knowledge in data science is a matter of debate. Conway has argued that asking good questions is the most critical element in a data science project, and that the ability to ask good questions requires domain understanding. In contrast, KDD Cup and Kaggle competition winners often lack expertise in the domains associated with their data science projects.

So, how much does domain knowledge really matter? It depends on the domain. For fraud detection, data scientists can probably learn the requisite information on the job — many people do — and then develop a stronger understanding with experience. In contrast, it’s unlikely that a data scientist will make much progress on drug discovery without a substantial background in biology.

Domain knowledge is always nice to have, but it’s rarely a must-have, so be open to data scientists picking it up on the job.

Don’t look for unicorns

The core intersection of the Data Science Venn Diagram reflects a rare combination of competencies, and it’s hard to find people who fulfill all three skills. Don’t make it even harder by requiring skills that aren’t necessary. You’ll just waste your time looking for unicorns. Instead, look beyond the Venn diagram. Focus on your candidates’ fundamental skills, experience, and their ability to acquire new skills and knowledge on the job.

Hiring data scientists is hard, but not impossible. Good luck!

Post topics: Building a data culture