Chapter 48. In Depth: Gaussian Mixture Models
The k-means clustering model explored in the previous chapter is simple and relatively easy to understand, but its simplicity leads to practical challenges in its application. In particular, the nonprobabilistic nature of k-means and its use of simple distance from cluster center to assign cluster membership leads to poor performance for many real-world situations. In this chapter we will take a look at Gaussian mixture models, which can be viewed as an extension of the ideas behind k-means, but can also be a powerful tool for estimation beyond simple clustering.
We begin with the standard imports:
In[1]:%matplotlibinlineimportmatplotlib.pyplotaspltplt.style.use('seaborn-whitegrid')importnumpyasnp
Motivating Gaussian Mixtures: Weaknesses of k-Means
Let’s take a look at some of the weaknesses of k-means and think about how we might improve the cluster model. As we saw in the previous chapter, given simple, well-separated data, k-means finds suitable clustering results.
For example, if we have simple blobs of data, the k-means algorithm can quickly label those clusters in a way that closely matches what we might do by eye (see Figure 48-1).
In[2]:# Generate some datafromsklearn.datasetsimportmake_blobsX,y_true=make_blobs(n_samples=400,centers=4,cluster_std=0.60,random_state=0)X=X[:,::-1]# flip axes for better plotting
In[3]:# Plot the data with k-means labelsfromsklearn.clusterimportKMeanskmeans ...