Chapter 12. Classification

Classification might be the most well-known application of Bayesian methods, made famous in the 1990s as the basis of the first generation of spam filters.

In this chapter, I’ll demonstrate Bayesian classification using data collected and made available by Dr. Kristen Gorman at the Palmer Long-Term Ecological Research Station in Antarctica (see Gorman, Williams, and Fraser, “Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis)”, March 2014). We’ll use this data to classify penguins by species.

Penguin Data

I’ll use pandas to load the data into a DataFrame:

import pandas as pd

df = pd.read_csv('penguins_raw.csv')
df.shape

(344, 17)

The dataset contains one row for each penguin and one column for each variable.

Three species of penguins are represented in the dataset: Adélie, Chinstrap and Gentoo.

The measurements we’ll use are:

Body Mass in grams (g).
Flipper Length in millimeters (mm).
Culmen Length in millimeters.
Culmen Depth in millimeters.

If you are not familiar with the word “culmen”, it refers to the top margin of the beak.

These measurements will be most useful for classification if there are substantial differences between species and small variation within species. To see whether that is true, and to what degree, I’ll plot cumulative distribution functions (CDFs) of each measurement for each species.

The following function takes the DataFrame and a column name. It ...

Get Think Bayes, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Think Bayes, 2nd Edition by Allen B. Downey

Chapter 12. Classification

Penguin Data

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly