Chapter 49. In Depth: Kernel Density Estimation
In Chapter 48 we covered Gaussian mixture models, which are a kind of hybrid between a clustering estimator and a density estimator. Recall that a density estimator is an algorithm that takes a -dimensional dataset and produces an estimate of the -dimensional probability distribution that data is drawn from. The GMM algorithm accomplishes this by representing the density as a weighted sum of Gaussian distributions. Kernel density estimation (KDE) is in some senses an algorithm that takes the mixture-of-Gaussians idea to its logical extreme: it uses a mixture consisting of one Gaussian component per point, resulting in an essentially nonparametric estimator of density. In this chapter, we will explore the motivation and uses of KDE.
We begin with the standard imports:
In
[
1
]:
%
matplotlib
inlineimport
matplotlib.pyplot
as
plt
plt
.
style
.
use
(
'seaborn-whitegrid'
)
import
numpy
as
np
Motivating Kernel Density Estimation: Histograms
As mentioned previously, a density estimator is an algorithm that seeks to model the probability distribution that generated a dataset. For one-dimensional data, you are probably already familiar with one simple density estimator: the histogram. A histogram divides the data into discrete bins, counts the number of points that fall in each bin, and then visualizes the results in an intuitive manner.
For example, let’s create some data that is drawn from two normal distributions:
In
[
2
]:
def
make_data
Get Python Data Science Handbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.