Seedling (source: U.S. Department of Agriculture on Flickr)

Big data is often called one of the most important skill sets in the 21st century, and it’s experiencing enormous demand in the job market. Hiring data scientists and other big data professionals is a major challenge for large enterprises, leading many to shift their efforts to training existing staff. Companies quickly realize that identifying good candidates for training is a major challenge. At The Data Incubator, a data science training and placement company, we’re often asked by employers to help define the necessary skills and competencies for their workforce. These requests are complex, requiring consideration of tiered levels of expertise across different silos and involving the assessment of hundreds or thousands of individuals. While results vary by client and are highly multidimensional, we’ve found that outcomes often boil down to three simple competency axes, which nicely summarize the performance of each individual.


Big data is about studying large populations, but with such vast quantities of data, we can’t hope to understand how each individual behaves. We can only begin to understand how individuals behave through understanding how entire populations behave. This requires a strong background in descriptive analytics and statistics, and knowing how to turn vague human questions into statistically testable hypotheses, as well as converting results back into a language that non-technical managers can understand.

While descriptive statistics may help explain what is happening, business needs often require knowing what will happen. Budding data scientists often have a passing familiarity with topics in predictive statistics or analytics, and machine learning. Both these topics are subjects littered with potential pitfalls—mastering them requires understanding topics like hypothesis testing and significance thresholds (descriptive statistics) as well as overfitting and cross validation (predictive statistics). A proto data scientist often has heard of these fundamental concepts and is eager to master them. That preparation makes them ready to take courses in more advanced machine learning topics such as gradient boosted trees, support vector machines, or deep neural networks.

Software engineering

Of course, statistics of big data are not computed by hand. Having a strong grasp of the fundamentals of computation and software engineering are prerequisites for data science at scale. The challenges are often overlooked, particularly by data analysts practiced in working with small amounts of data who naively espouse, “It’s the same formulas, just applied to more data.” Such critics fail to understand that with big data, we broke the implicitly assumed single-computer paradigm and entered the world of distributed computing. This means that new computation factors—like paging from disk, network latency, intermittent errors, and semigroups and idempotency—become incredibly important.

Aspiring data scientists have run into these computational challenges when handling big data initially. Even if they have not mastered computation, they are keenly aware that the challenge of learning big data does not stop at “the formulas aren't the same,” even if they do not understand all the computational nuances. They’re keen to learn about the distributed computing tools (like MapReduce, HIVE, or Spark) needed to handle big data. Once they understand the need for these core skills, they begin to understand how good software engineering practices like source control, version control, dependency management, modularization, etc., affect their productivity and effectiveness as they start productionizing and productizing their work.


Having knowledge of computing statistics using sophisticated software engineering is not enough—big data professionals have to communicate the results of their analyses to their non-technical counterparts. A large part of the excitement around big data is its ability to transform business—but for those results to be useful, they have to be shared broadly within the organization. Data scientists need to master two languages: an internal technical language that emphasizes precision and accuracy, and an external non-technical language that emphasizes explicability and relatability. The ability to translate between these two languages and mindsets lies at the heart of what they do.

One of the key communication skill sets for data professionals is data visualization. While budding big data talent may not have mastered the full range of advanced visualization techniques, they do understand the importance of conveying data visually and presenting their results in a clear and comprehensible manner. They’re often eager to learn about tools like d3.js and Bokeh that can dynamically and visually convey complex statistical or data-driven results in a simple, easily digestible manner, and they understand that communication is as much about style as it is about substance.

Data science sits at the intersection of statistics, software engineering, and communication. Fully fledged data scientists have mastered all three skills—any one of which is hard to master on one’s own. Rather than search for “the unicorn,” managers should try to identify employees with lower levels of proficiency in each of these categories—particularly the technical ones—and then train them on missing skills. Managers who have employees with a strong foundation in statistics, software engineering, and communication have found ideal candidates to train up to become data scientists.


Article image: Seedling (source: U.S. Department of Agriculture on Flickr).