Factors that Influence Salary: The Regression Model

WE HAVE INCLUDED OUR FULL regression model in Appendix B. For this year’s report, we have made two important changes to the basic, parsimonious linear model we presented in the 2015 report. We have included: 1) external geographic data (GDP by US state and country), and 2) a square root transformation. The transformation adds one step to the linear model: we add up model coefficients, and then square the result. Both of these changes significantly improve the accuracy in salary estimates.

Our model explains about three-quarters of the variance in the sample salaries (with an R2 of 0.747). Roughly half of the salary variance is due to geography and experience. Given the important factors that can not be captured in the survey— for example, we don’t measure competence or evaluate the quality of respondents’ work output—it’s not surprising that a large amount of variance is left unexplained.

Impact of Geography

Geography has a huge impact on salary, but is not adequately captured due to sample size. For example, if a country is represented by only one or two respondents, this isn’t enough to justify giving the country its own coefficient. For this reason, we use broad regional coefficients (e.g., “Asia” or “Eastern Europe”), keeping in mind however that economic differences within a region are huge, and thus the accuracy of the model suffers.

To get around this problem, we’ve used publicly available records of per capita GDP of ...

Get 2016 Data Science Salary Survey now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.