15Data Science, Random Numbers, and Statistics

Data science is a relatively new and evolving interdisciplinary field that sits at the intersection between computer science, software engineering, and statistics. As with many evolving fields, the term data science means different things to different people. One somewhat glib but relatively accurate definition of a data scientist is someone who knows more about programming than a statistician and more about statistics than a programmer.

The need for people with these skills has been driven in large part by big data. Big data is another generally ill-defined term, but here we refer to collections of data that are too large to be effectively analyzed or understood by traditional methods, recognizing that people may differ in what they consider large and what constitutes a traditional method. Big data is made possible by increasingly inexpensive and ubiquitous computers and digital devices for collecting data, networks for assembling and moving it, and storage for maintaining it.

Analysis of data has traditionally been performed by statisticians. Although nearly all statisticians now perform their calculations on computers, historically the focus has been on data sets of a size that can be reasonably collected by a small team of researchers. Typically this would be hundreds to at most tens of thousands of records with tens to hundreds of variables. Data on this scale can generally be organized and curated using semi-manual techniques ...

Get Programming Interviews Exposed, 4th Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.