♣38♣R and Big Data
In the rest of this book we implicitly assume that the data can be loaded in memory. However, imagine that we have a dataset of all the data that you have transmitted over your mobile phone, including messages, images, complete with timestamps and location that results from being connected. Imagine now that we have this dataset for 50 million customers. This dataset will be so large that this becomes impractical, or even impossible to store on one computer. In that case, it is fair to speak of “big data.”
Usually, the academic definition of big data implies that the data has to be big in terms of
- velocity: the speed at which the data comes in,
- variety: the number of columns and formats,
- veracity: the reliability or data quality, and
- volume: the amount of data.
Commercial institutions will add “value” as a fourth word that starts with the letter “v.” While this definition of “big data” has its merits, we rather opt for a very practical approach.We consider our data to be “big” if it is no longer practically possible to store all data on one machine and/or use all processing units (PUs)1 of that one machine to do all calculations (such as calculating a mean or fitting a neural network).
Get The Big R-Book now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.