3 Data and Statistics
Data have no meaning in themselves; they are meaningful only in relation to a conceptual model of the phenomenon studied.
G. Box, W. Hunter, and J. Hunter1
Synopsis
This chapter covers the principles and techniques of data collection, handling, and manipulation needed for a variety of machine learning (ML) investigations. In addition, this chapter reviews fundamentals of statistics and serves as a refresher and/or a reinforcer to some of the essential principles behind statistical analyses. First, we define data and then lay out strategies for visualizing and plotting the different types of data we are likely to encounter in our field. Then, we spend some time exploring and diagnosing data, and for this, a series of methods will be presented and showcased. I will be sharing techniques to carry out fundamental2 data analysis and visualization using Excel, Python, R, and Exploratory.
3.1 Data and Data Science
We have previously defined data as pieces (or units) of information, facts, quantities, and statistics collected for the purpose of reference or analysis. Data is a big element of ML,3 especially one that is data-driven (or driven by the data4). The data has two main components: an explanatory portion (a numerical value or a category) and a label describing such a unit. Numerical data are made of numeric or numbers, and categorical data are made from categories or classes. We also have footage data, audio data, video data, text data, time series ...