CHAPTER 12Basic Overview of Statistics

This chapter presents several tools which are useful when we need to go through the data at hand and get an initial understanding of the data content. The tools can be used individually or as an entire suite. It is an excellent habit to analyse the data before constructing a machine learning model. The tools presented in this chapter are a combination of statistics, tests and visualisation hints.

12.1 HISTOGRAM

A histogram is a representation of the data distribution, a visualisation tool for the data points in our sample. Theoretically, it represents an estimate of the probability distribution function for a continuous variable. It is, in fact, an empirical distribution function. In this chapter, we treat histograms as a visual tool which allows us to see and investigate the structure of our data.

A histogram function can have two inputs, the data and the bins (or buckets); it returns the number of data points per bin. The bins represent a scheme which will split the range interval of the data into a list of non-overlapping sub-intervals. Although the bins are not necessarily equal, in this section we will only consider equidistant intervals.

It is convenient to normalise the histogram so it can represent a probability distribution function, where every plot captures relative frequency. In such a case, the histogram is a kernel density estimator. Using some form of polynomials or splines, the kernel can be further enhanced to obtain ...

Get Machine Learning and Big Data with kdb+/q now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.