# 2014 Data Science Salary Survey

Tools, trends, what pays (and what doesn't) for data professionals.

October 31, 2014
Dollar Bill Crop

## Executive Summary

For the second year, O’Reilly Media conducted an anonymous survey to examine factors affecting the salaries of data analysts and engineers. We opened the survey to the public, and heard from over 800 respondents who work in and around the data space.

Major industries with the highest median salaries included banking/finance ($117k) and software ($116k). Surprisingly, respondents from the entertainment industry have the highest median salary ($135k), which is likely an artifact of a small sample of only 20 people. Employees from larger companies reported higher salaries than those from smaller companies, while public companies and late startups had higher median salaries ($106k and $112k) than private companies ($90k) and early startups ($89k). The interquartile range of early startups was huge –$34k to $135k – so while many early startup employees do make a fraction of what their counterparts at more established companies do, others earn comparable salaries. Some of these patterns will be revisited in the final section, where we present a regression model. ## Tool Analysis Tool usage can indicate to what extent respondents embrace the latest developments in the data space. We find that use of newer, scalable tools often correlates with the highest salaries. When looking at Hadoop and RDBMS usage and salary, we see a clear boost for the 30% of respondents who know Hadoop – a median salary of$118k for Hadoop users versus $88k for those who don’t know Hadoop. RDBMS tools do matter – those who use both Hadoop and RDBMSs have higher salaries ($122k) – but not in isolation, as respondents who only use RDBMSs and not Hadoop earn less ($93k). In cloud computing activity, the survey sample was split fairly evenly: 52% did not use cloud computing or only experimented with it, and the rest either used cloud computing for some of their needs (32%) or for most/all of their needs (16%). Notably, median salary rises with more intense cloud use, from$85k among non–cloud users to $118k for the “most/all” cloud users. This discrepancy could arise because cloud users tend to use advanced Big Data tools, and Big Data tool users have higher salaries. However, it is also possible that the power of these tools – and thus their correlation with high salary – is in part derived from their compatibility with or leveraging of the cloud. ### Tool Use in Data Today While this general information about data tools can be useful, practitioners might find it more valuable to look at a more detailed picture of the tools being used in data today. The survey presented respondents with eight lists of tools from different categories and asked them to select the ones they “use and are most important to their workflow.” Tools were typically programming languages, databases, Hadoop distributions, visualization applications, business intelligence (BI) programs, operating systems, or statistical packages.Two exceptions were “Natural Language/Text Processing” and “Networks/Social Graph Processing,”” which are less tools than they are types of data analysis. One hundred and fourteen tools were present on the list, but over 200 more were manually entered in the “other” fields. Just as in the previous year’s salary survey, SQL was the most commonly used tool (aside from operating systems); even with the rapid influx of new data technology, there is no sign that SQL is going away.In comparing the Strata Salary Survey data from this year and last year, it is important to note two changes. First, the sample was very different. The data from last year was collected from Strata conference attendees, while this year’s data was collected from the wider public. Second, in the previous survey only three tools from each category were permitted. The removal of this condition has dramatically boosted the tool usage rates and the number of tools a given respondent uses. This year R and Python were (just) trailing Excel, but these four make up the top data tools, each with over 50% of the sample using them. Java and JavaScript followed with 32% and 29% shares, respectively, while MySQL was the most popular database, closely followed by Microsoft SQL Server. The most commonly used tool – whose users’ median salary surpassed$110k – was Tableau (used by 25% of the sample), which also stands out among the top tools for its high cost. The common usage of Tableau may relate to the high median salaries of its users; companies that cannot afford to pay high salaries are likely less willing to pay for software with a high per-seat cost.

Further down the list we find tools corresponding to even higher median salaries, notably the open source Hadoop distributions and related frameworks/platforms such as Apache Hadoop, Hive, Pig, Cassandra, and Cloudera. Respondents using these newer, highly scalable tools are often the ones with the higher salaries.

Also in line with last year’s data, the tools whose users tended to be from the lower end of the salary distribution were largely commercial tools such as SPSS and Oracle BI, and Microsoft products such as Excel, Windows, Microsoft SQL Server, Visual Basic, and C#. A change on the bottom 10 list has been the inclusion of two Google products: BigQuery/Fusion Tables and Chart Tools/Image API. The median salary of the 95 respondents who used one (or both) of these two tools was only $94k. Note that “tool median salaries” – that is, the median salaries of users of a given tool – tend to be higher than the median salary figures quoted above for demographics. This is not a mistake: respondents who reported using many tools are overrepresented in the tool median salaries, and their salaries are counted many times in the tool median salary chart. As it happens, the number of tools used by a respondent correlates sharply with salary, with a median salary of$82k for respondents using up to 10 tools, rising to $110k for those using 11 to 20 tools and$143k for those using more than 20.

### Tool Correlations

In addition to looking at how tools relate to salary, we also can look at how they correlate to each other, which will help us develop predictor variables for the regression model. Tool correlations help us identify established ecosystems of tools: i.e., which tools are typically used in conjunction. There are many ways of defining clusters; we chose a strategy that is similar to that used last yearFor cluster formation, only tools with over 35 users in the sample were considered. Tools in each cluster positively correlated (at the α = .01 level using a chi-squared distribution) with at least one-third of the others, and no negative correlations were permitted between tools in a cluster. The one exception is SPSS, which clearly fits best into Cluster 1 (three of the five tools with which it correlates are in that group). SPSS was notable in that its users tended to use a very small number of tools. but found more distinct clusters, largely due to the doubling of the sample size.

The “Microsoft-Excel-SQL” cluster was more or less preserved (as “Cluster 1”), but the larger “Hadoop-Python-R” cluster was split into two parts. The larger of these, Cluster 2, is made up of Hadoop tools, Linux, and Java, while the other, Cluster 3, emphasizes coding analysis with tools such as R, Python, and Matlab. With a few tool omissions, it is possible to join Clusters 2 and 3 back into one, but the density of connections within each separately is significantly greater than the density if they are joined, and the division allows for more tools to be included in the clusters. Cluster 4, centered around Mac OS X, JavaScript, MySQL, and D3, is new this year. Finally, the smallest of the five is Cluster 5, composed of C, C++, Unix, and Perl. While these four tools correlated well with each other, none were exceedingly common in the sample, and of the five clusters this is probably the least informative.

The only tool with over 35 users that did not fit into a cluster was Tableau: it correlated well with Clusters 1 and 2, which made it even more of an outlier in that these two clusters had the highest density of negative correlations (i.e., when variable a increases, variable b decreases) between them. In fact, all of the 53 significant negative correlations between two tools were between one tool from Cluster 1 and another from Cluster 2 (35 negative correlations), 3 (6), or 4 (12).

Most respondents did not cleanly correspond to one of these tool categories: only 7% of respondents used tools exclusively from one of these groups, and over half used at least one tool from four or five of the clusters. The meaning behind the clusters is that if a respondent uses one tool from a cluster, the chance that she uses another from that cluster increases. Many respondents tended toward one or two of the clusters and used relatively few tools from the others.

#### Interpreting the clusters

To a certain extent it is easy to see why tools in each cluster would correlate with the others, but it is worth identifying features of the tools that appear more or less relevant in determining their assignment. Whether a tool is open source is perhaps the most important feature, dividing Cluster 1 from the others. Cluster 1 also contains Microsoft tools, although the producer of the tool does not necessarily determine cluster membership (MySQL and Oracle RDB are in different clusters).

The large number of tools in Cluster 2 is no anomaly: people working with Hadoop-like tools tend to use many of them. In fact, for tools such as EMR, Cassandra, Spark, and MapR, respondents who used each of these tools used an average of 18–19 tools in total. This is about double the average for users of some Cluster 1 tools (e.g., users of SPSS used an average of 9 tools, and users of Excel used an average of 10 tools). Some of the Cluster 2 tools complement each other to form a tool ecosystem: that is, these tools work best together, and might even require one another. From the perspective of individuals deciding which tools to learn next, the high salaries correlated with use of Cluster 2 tools is enticing, but it may be the case that not just one but several tools need to be learned to realize the benefits of such skills.

Other tools in Cluster 2 are not complements to each other, but alternatives: for example, MapR, Cassandra, Cloudera, and Amazon EMR. The fact that even these tools correlate could be an indication of the newness of Hadoop: individuals and companies have not necessarily settled on their choice of tools and are trying different combinations among the many available options. The community nature of the open source tools in Cluster 2 may provide another explanation for why alternative tools are often used by the same respondents. That community element, plus the single-purpose nature of many of the open source tools, contrasts Cluster 2 with the more mature, and vertically integrated, proprietary tools in Cluster 1.

Some similar patterns exist in Clusters 1 and 3 as well, though perhaps not to the same extreme. For example, R and Python, while they are often used together, are capable of doing many of the same things (stated differently, many – even most – uses of either R or Python for data analysis can be done entirely by one). However, these two correlate very strongly with one another. Similarly, business intelligence applications such as MicroStrategy, BusinessObjects, and Oracle BI correlate with each other, as do statistical packages SAS and SPSS. In what is a relatively rare cross-cluster bond between Clusters 1 and 3, R and SAS also correlate positively.Whether SAS and R are complements or rivals depends on who you ask. Analysts often have a clear preference for one or the other, although there has been a recent push from SAS to allow for integration between these tools.

While such correlations of “rival” tools could partly be attributable to the division of labor in the data space (coding analysts versus big data engineers versus BI analysts), it is also a sign that data workers often try different tools with the same function. Some might feel that the small set of tools they work with is sufficient, but they should know that this makes them outliers – and given the aforementioned correlation between number of tools used and salary, this might have negative implications in terms of how much they earn.

## Regression Model of Total Salary

Continuing toward the goal of understanding how demographics, position, and tool use affect salary, we now turn to the regression model of total salary.We had respondents earning more than $200k select a “greater than$200k” choice, which is estimated as $250k in the regression calculation. This might have been advisable even had we had the exact salaries for the top earners (to mitigate the effects of extreme outliers). This does not affect the median statistics reported earlier. Earlier, we mentioned some one-variable comparisons, but there is an important difference between those observations and this model: before there was no indication of whether a given discrepancy was attributable to the variable being compared or another one that correlates with it, but here observations about a variable’s effect on salary can be understood with the phrase “holding other variables constant.” For each tool cluster, one variable was included in the potential predictors with a value equal to the number of this cluster’s tools used by a respondent. Demographic variables were given approximate ordinal values when appropriate,For several of these ordinal variables, the resulting coefficient should be understood to be very approximate. For example, data was collected for age at 10-year intervals, so a linear coefficient for this variable might appear to be predicting the relation between age and salary at a much finer level than it actually can. and most variables that obviously overlapped with others were omitted.Variables that repeat information, such as the total number of tools, are typically omitted (there is too much overlap between this and the cluster tool count variables; the same goes for individual tool usage variables). One exception is position/role: the role percentages were kept in the pool of potential predictor variables, including one variable describing the percentage of a respondent’s time spent as a manager (in fact, this was the only role variable to be kept in the final model). The respondent’s overall position (non-manager, tech lead, manager, executive) clearly correlates with the manager role percentage, but both variables were kept as they do seem to describe somewhat orthogonal features. While this may seeming confusing, this is partly due to the difference in the meaning of “manager” as a position or status, and “manager” as a task or role component (e.g., executives also “manage”). From the 86 potential predictor variables, 27 were included in the final model.Variables were included in or excluded from the model on the basis of statistical significance. The final model was obtained through forward stepwise linear regression, with an acceptance error of .05 and rejection error of .10. Alternative models found through various other methods were very similar (e.g., inclusion of one more industry variable) and not significantly superior in terms of predictive value. The adjusted R-squared was .58: that is, approximately 58% of the variation in salary is explained by the 27 coefficients. Variable (unit) Coefficient in USD (constant) +$30,694
Europe – $24,104 Asia –$30,906
California + $25,785 Mid-Atlantic +$21,750
Northeast + $17,703 Industry: education –$30,036
Industry: science and technology – $17,294 Industry: government –$16,616
Gender: female – $13,167 Age per 1 year +$1,094
Years working in data per 1 year + $1,353 Doctorate degree +$11,130
Position per levelThe “level” units of position correspond to integers, from 0 to 4. Thus, to find the contribution of this variable to the estimated total salary we multiply $10,299 by 1 for non-managers, 2 for tech leads, 3 for managers, and 4 for executives. +$10,299
Portion of role as manager per 1% + $326 Company size per 1 employee +$0.90
Company age per 1 year, up to ~30 – $275 Company type: early startup –$17,318
Cloud computing: no cloud use – $12,994 Cloud computing: experimenting –$9,196
Cluster 1 per 1 tool – $1,112 Cluster 2 per 1 tool +$1,645
Cluster 3 per 1 tool + $1,900 Bonus +$17,457
Stock options + $21,290 Stock ownership +$14,709

## Conclusion

This report highlights some trends in the data space that many who work in its core have been aware of for some time: Hadoop is on the rise; cloud-based data services are important; and those who know how to use the advanced, recently developed tools of Big Data typically earn high salaries. What might be new here is in the details: which tools specifically tend to be used together, and which correspond to the highest salaries (pay attention to Spark and Storm!); which other factors most clearly affect data science salaries, and by how much. Clearly the bulk of the variation is determined by factors not at all specific to data, such as geographical location or position in the company hierarchy, but there is significant room for movement based on specific data skills.

As always, some care should be taken in understanding what the survey sample is (in particular, that it was self-selected), although it seems unlikely that the bias in this sample would completely negate the value of patterns found in the data as industry indicators. If there is bias, it is likely in the direction of the O’Reilly audience: this means that use of new tools and of open source tools is probably higher in the sample than in the population of all data scientists or engineers.

For future research we would like to drill down into more detail about the actual roles, tasks, and goals of data scientists, data engineers, and other people operating in the data space. After all, an individual’s contribution – and thus his salary – is not just a function of demographics, level/position, and tool use, but also of what he actually does at his organization.

The most important ingredient in continuing to pass on valuable information is participation: we hope that whatever you get out of this report, it is worth the time to fill out the survey. The data space is one that changes quickly, and we hope that this annual report will help the reader stay on its cutting edge.