Chapter 4. Work Setup and Tools

EVERYONE HAS DIFFERENT TASKS, NEEDS, AND ROLES, but it is good to have a peek at what others are using to ensure that you are staying on top of new trends, in addition to justifying the tools you might already use.

Operating Systems

When it comes to processing data, we see a mix of different operating systems in use. 67% of our respondents are using Windows at some point in their work. 55% are using Linux, whereas only 18% use Unix. MacOS has around 46% use by our respondents.

When it comes to mobile operating systems, only 2% are using iOS, and 2% are using Android for development.

Programming Languages

When asked about programming languages, SQL was on top with 64% of our respondents saying they are using it. 63% are using Python, and 54% use R.

Then we begin to get into the long tail of other languages. Bash has a strong following at 33%, Javascript at 20%, Java at 18%, and Scala at 13%.

C++, C, and C# are used by 9%, 8%, and 7%, respectively.

Some programming languages certainly equate to higher salaries than others. For instance, Visual Basic/VBA is used by around 13% of our respondents, but the median salary is $69,000, followed by C# at $78,000. Perl is the language with the highest median salary at $109,000, but it was used by only 6% of our respondents.

When we look back at the responses from 2016, we can see which programming languages are gaining in adoptions and which are declining. SQL has dropped from 75% in 2016 to only 64% in 2017. Maybe more data scientists are using GUI tools or working with other parts of the workflow than data retrieval? The other big surprise is that Python jumped from 58% in 2016 to 63% this year. Bash saw a big jump from only 26% of people using it in 2016 to 33% in 2017.

Relational Databases

A big part of any data management is databases, so it is interesting to see what is popular and being used.

The top relational database—used by 37% of our respondents—is MySQL. That is followed by Microsoft’s SQL Server, used by 30% of our respondents. PostgreSQL appeared quickly on their heels at 28%. Oracle takes the fourth spot, with 20%.

Then we get into the long tail of other databases, from SQLite at 12% to EMC/Greenplum with only 1% at the bottom of the table.

The top five spots all have a median salary between $83,000 and $96,000. It seems that knowing the most popular databases isn’t a great differentiator when it comes to salary.

Note

The top relational database— used by 37% of our respondents—is MySQL.

Hadoop

Hadoop is the industry-standard way to do large map-reduce queries on standard hardware, and it has moved to cloud services, as well. 18% of respondents reported using Apache Hadoop (on their own infrastructure).

The most popular cloud-based Hadoop solution was Cloudera with 12% of our respondents having used it. Following quickly behind was Amazon’s Elastic MapReduce (ERM), with 10%. Hortonworks was the next most popular with 8%, and then MapR with 3%, IBM with 2%, and Oracle with 1%.

Even though some of these solutions might have small responses, we need to take into consideration the number of database instances by that vendor in general. Also, some of these services are cloud-based, whereas others are dedicated datacenters. That factor will also affect their popularity.

Search

Search and retrieval are also important parts of data collection and storage, and it seems like only a small amount of our respondents are using these tools.

The most popular, Elasticsearch, is used by only 15% of our respondents, followed by Solr at 6%, and Lucene at 4%.

That leaves about 75% of our respondents not using any search tools in their daily workflow.

Big Data Platforms

When the respondents were asked about various platforms used within their company, we saw a wide range of languages, architectures, and vendors.

Spark is the current favorite, with 27% of our respondents using it in some way. That is quickly followed by 18% using Hive, 13% using MongoDB, and 12% using Amazon Redshift.

After that, there is a long tail of other technologies, all below 10%.

Business Intelligence and Reporting

When asked about spreadsheets, business intelligence (BI) tools, and reporting, there was no contest: two-thirds of the respondents said Microsoft Excel. Excel is a completely ubiquitous tool for reporting. The next closest competitor is Microsoft Power BI, with less than 10% of the respondents.

That’s a 57% gap between the most popular and second most popular.

There is a long list of other BI tools, but they trail off in popularity pretty quickly. Some of these might be legacy tools, others might be the exact right tool for the job, so just because only 3% of respondents are using Oracle BI, it might be the perfect tool if you use Oracle DB.

Machine Learning

Machine learning is a very hot topic. With more and more vendors entering into the arena and attempting to make it easier to use, we’ll see an explosion in what is considered machine learning as well as a very long tail of potential software packages.

Our respondents seem to have chosen a few popular software solutions, but it is still a diverse choice in the tail. 37% of our respondents use Scikit-learn, and 16% Spark MLlib. Given that 27% of our respondents are using Spark in their big data platform, Spark MLlib makes sense.

H2O, ML as a service, is used by 8%, the Java-based Weka by 7%, and then we drop to 4% and below for the rest of the options.

Viz Tools

Our respondents were asked about which data visualization tools they are using. There is a good mix of different tools, with no single one dominating the group. ggplot which is used in R, Python, and Jupyter Notebooks, is used by 43% of our respondents. 34% have used Matplotlib, 32% Tableau, and 21% Shiny (another R tool).

At 18% is D3, an open source JavaScript library used for visualization. Hosted Google Charts has around 10% usage, and then the percentage drops from there: Bokeh, 7%; Processing, 2%; and Processing.js, 1%.

These tools can serve different purposes. Using something like D3 means that you are focusing on HTML output, whereas ggplot might be more for screens and reports.

Importance of Tasks

We asked our respondents about various tasks and whether they had major, minor, or no involvement in those tasks. When just looking at how they rate themselves in major involvement, we get a good picture of what it is to be considered a data scientist.

67% of our respondents said they have major involvement in “Basic exploratory data analysis.” 61% said they “conduct data analysis to answer research questions.” These are the most popular tasks, and they both deal directly with the data-sets. No big surprise there.

The third most popular task was to “communicate findings to business decision-makers.” This is interesting, because beyond just crunching the data, this role is expected to be a communicator: finding the story in the data and expose that to those in charge.

53% have major involvement in “data cleaning”: checking for outliers or missing data, reformatting values, and so on. This role is also probably one of the longest and most tedious tasks, calling to mind the old quote attributed to Abe Lincoln, “Give me four hours to chop down a tree and I’ll spend the first three sharpening my axe.”

The rest of the tasks are all less than 50% major involvement, but that’s not to say they aren’t important; rather, they are taken on by other members of the team or company. For instance, “create data visualizations” involved only 47% of our respondents. This number could simply be a result of companies hiring a dedicated illustrator or design team, with raw data sent to them for processing.

Extract, transform, and load (ETL) is an important part of working with data, but according to this survey, only 30% of our respondents are working on ETL pipelines as one of their major tasks. Maybe this role is shifting to a dedicated person or a different team. It is worth watching this in the future.

Note

The third most popular task was to “communicate findings to business decision-makers.”

The bottom three tasks were to “develop products that depend on real-time data analytics” at 18%; “use dashboards and spreadsheets (made by others) to make decisions” at 15%; and “develop hardware (or work on software projects that require expert knowledge of hardware)” at 4%. Although these tasks might be important for some data scientists, they do not seem to be central to the field.