Exploring data with Pandas

Learn to use Pandas to explore a dataset looking for anything that could form the basis for an interesting visualization.

By Kyran Dale
July 26, 2016
Closeup of the London Science Museum's difference engine Closeup of the London Science Museum's difference engine (source: Carsten Ullrich, CC-BY-SA-2.5)

In the previous chapter, we cleaned the Nobel Prize dataset that we scraped from Wikipedia in not available. Now it’s time to start exploring our shiny new dataset, looking for interesting patterns, stories to tell, and anything else that could form the basis for an interesting visualization.

First off, let’s try to clear our minds and take a long, hard look at the data to hand to get a broad idea of the visualizations suggested. Example 1-1 shows the form of the Nobel dataset, with categorical, temporal, and geographical data.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more
Example 1-1. Our cleaned Nobel Prize dataset
[{
 'category': u'Physiology or Medicine',
 'date_of_birth': u'8 October 1927',
 'date_of_death': u'24 March 2002',
 'gender': 'male',
 'link': u'http://en.wikipedia.org/wiki/C%C3%A9sar_Milstein',
 'name': u'C\xe9sar Milstein',
 'country': u'Argentina',
 'place_of_birth': u'Bah\xeda Blanca ,  Argentina',
 'place_of_death': u'Cambridge , England',
 'year': 1984
 },
 ...
 }]

The data in Example 1-1 suggests a number of stories we might want to investigate, among them:

  • Gender disparities among the prize winners

  • National trends (e.g., which country has most prizes in Economics)

  • Details about individual winners, such as their average age on receiving the prize or life expectancy

  • Geographical journey from place of birth to adopted country using the born_in and country fields

These investigative lines form the basis for the coming sections, which will probe the dataset by asking questions of it, such as “How many women other than Marie Curie have won the Nobel Prize for Physics?”, “Which countries have the most prizes per capita
rather than absolute?”, and “Is there a historical trend to prizes by nation, a changing of the guard from old (science) world (big European
nations) to new (US and upcoming Asians)?”
Before beginning our explorations, let’s set up IPython and load our Nobel Prize dataset.

Starting to Explore

Before starting our exploration, we need to set up our IPython environment. First fire up an IPython Notebook or Qt console session from the Nobel work directory:

$ ipython [notebook | qt]

If using the latest IPython Jupyter Notebook, you’ll need to run this:

$ jupyter notebook

In your IPython Qt console or Notebook, use the magic matplotlib command to enable inline plotting:

%matplotlib inline

Then import the standard set of data exploration modules:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import seaborn as sb 1

plt.rcParams['figure.figsize'] = 8, 4 2
1

Importing Seaborn will apply its arguably superior looks to all the plots, not just the Seaborn-specific.

2

Sets the default plotting size to eight inches by four.

At the end of not available, we saved our clean dataset to MongoDB using a utility function (see not available). Let’s load the clean data into a Pandas DataFrame, ready to begin exploring.

df = mongo_to_dataframe('nobel_prize', 'winners_clean')

Let’s get some basic information about our dataset’s structure:

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 858 entries, 0 to 857
Data columns (total 12 columns):
award_age         858 non-null int64
category          858 non-null object
country           858 non-null object
date_of_birth     858 non-null object
date_of_death     559 non-null object
gender            858 non-null object
link              858 non-null object
name              858 non-null object
place_of_birth    831 non-null object
place_of_death    524 non-null object
text              858 non-null object
year              858 non-null int64
dtypes: int64(2), object(10)
memory usage: 87.1+ KB

Note that our dates of birth and death columns have the standard Pandas datatype of object. In order to make date comparisons, we’ll need to convert those to the datetime type, datetime64. We can use Pandas’ to_datetime method to achieve this conversion:

df.date_of_birth = pd.to_datetime(df.date_of_birth)
df.date_of_death = pd.to_datetime(df.date_of_death)

Running df.info() should now show two datetime columns:

df.info()

...
date_of_birth     858 non-null datetime64[ns]
date_of_death     559 non-null datetime64[ns]
...

to_datetime usually works without needing extra arguments, but it’s worth checking the converted columns to make sure. In the case of our Nobel Prize dataset, everything checks out.

Plotting with Pandas

Both Pandas Series and DataFrames have integrated plotting, which wraps the most common Matplotlib charts, a few of which we explored in the last chapter. This makes it easy to get quick visual feedback as you interact with your DataFrame. And if you want to visualize something a little more complicated, the Pandas containers will play nicely with vanilla Matplotlib. You can also adapt the plots produced by Pandas using standard Matplotlib customizations.

Let’s look at an example of Pandas’ integrated plotting, starting with a basic plot of gender disparity in Nobel Prize wins.
Notoriously, the Nobel Prize has been distributed unequally among the sexes. Let’s get a quick feel for that disparity by using a bar plot on the gender category. Example 1-2 produces Figure 1-1, showing the huge difference, with men receiving 811 of the 858 prizes in our dataset.

Example 1-2. Using Pandas’ integrated plotting to see gender disparities
by_gender = df.groupby('gender')
by_gender.size().plot(kind='bar')
dvpj 1101
Figure 1-1. Prize counts by gender

In Example 1-2, the Series produced by the gender group’s size method has its own integrated plot method, which turns the raw numbers into a chart:

by_gender.size()
Out:
gender
female     47
male      811
dtype: int64

In addition to the default line plot, the Pandas plot method takes a kind argument to select among other possible plots. Among the more commonly used are:

  • bar or barh (h for horizontal) for bar plots

  • hist for a histogram

  • box for a box plot

  • scatter for scatter plots

You can find a full list of Panda’s integrated plots in the docs as well as some Pandas plotting functions that take DataFrames and Series as arguments.

Let’s extend our investigation into gender disparities and start extending our plotting know-how.

Gender Disparities

Let’s break down the gender numbers shown in Figure 1-1 by category of prize. Pandas’ groupby method can take a list of columns to group by, with each group being accessed by multiple keys.

by_cat_gen = df.groupby(['category','gender'])

by_cat_gen.get_group(('Physics', 'female'))[['name', 'year']] 1
Out:
                       name  year
268    Maria Goeppert-Mayer  1963
613  Marie Skłodowska-Curie  1903
1

Gets a group using a category and gender key.

Using the size method to get the size of these groups returns a Series with a MultiIndex that labels the values by both category and gender:

by_cat_gen.size()
Out:
category                gender
Chemistry               female      4
                        male      167
Economics               female      1
                        male       74
...
Physiology or Medicine  female     11
                        male      191
dtype: int64

We can plot this multi-indexed Series directly, using hbar as the kind argument to produce a horizontal bar chart. This code produces Figure 1-2:

by_cat_gen.size().plot(kind='barh')
dvpj 1102
Figure 1-2. Plotting multikey groups

Figure 1-2 is a little crude and makes comparing gender disparities harder than it should be. Let’s go about refining our charts to make those disparities clearer.

Unstacking Groups

Figure 1-2 isn’t the easiest chart to read, even were we to improve the sorting of the bars. Handily, Pandas Series have a cool unstack method that takes the multiple indices—in this case, gender and category—and uses them as columns and indices, respectively, to create a new DataFrame. Plotting this DataFrame gives a much more usable plot, as it compares prize wins by gender. The following code produces Figure 1-3:

by_cat_gen.size().unstack().plot(kind='barh')
dvpj 1103
Figure 1-3. Unstacked Series of group sizes

Let’s improve Figure 1-3 by ordering the bar groups by number of female winners (low to high) and adding a total winners bar group for comparison. Example 1-3 produces the chart in Figure 1-4.

Example 1-3. Sorting and summing our gender groups
cat_gen_sz = by_cat_gen.size().unstack()
cat_gen_sz['total'] = cat_gen_sz.sum(axis=1) 1
cat_gen_sz = cat_gen_sz.sort_values(by='female', ascending=True) 2
cat_gen_sz[['female', 'total', 'male']].plot(kind='barh')
1

Sums the male and female totals. The axis argument is 0 for index sum, 1 for columns.

2

Sorts the rows using the female field, from low to high.

dvpj 1104
Figure 1-4. Bars ordered by number of female winners

Ignoring Economics, a recent and contentious addition to the Nobel Prize categories, Figure 1-4 shows that the largest discrepancy in the number of male and female prize winners is in Physics, with only two female winners. Let’s remind ourselves who they are:

df[(df.category == 'Physics') & (df.gender == 'female')]\
    [['name', 'country','year']]

Out:
                       name    country  year
267    Maria Goeppert-Mayer  United States  1963
611  Marie Skłodowska-Curie         Poland  1903

While most people will have heard of Marie Curie, who is actually one of the four illustrious winners of two Nobel Prizes, few have heard of Maria Goeppert-Mayer.1 This ignorance is surprising, given the drive to encourage women into science. I would want my visualization to enable people to discover and learn a little about Maria Goeppert-Mayer.

Age and Life Expectancy of Winners

We have the date of birth for all our winners and the date of death for 559 of them. Combined with the year in which they won their prizes, we have a fair amount of individual data to mine. Let’s investigate the age distribution of winners and try to glean some idea of the winners’ longevity.

Age at Time of Award

In not available we added an 'award_age' column to our Nobel Prize dataset by subtracting the winners’ ages from their prize years. A quick and easy win is to use Pandas’ histogram plot to assess this distribution:

df['award_age'].hist(bins=20)

Here we require that the age data be divided into 20 bins. This produces Figure 1-17, showing that the early 60s is a sweet spot for the prize and if you haven’t achieved it by 100, it probably isn’t going to happen. Note the outlier around 20, which is the recently awarded 17-year-old recipient of the Peace Prize, Malala Yousafzai.

dvpj 1117
Figure 1-17. Distribution of ages at time of award

We can use Seaborn’s distplot to get a better feel for the distribution, adding a kernel density estimate (KDE)2 to the histogram. The following one-liner produces Figure 1-18, showing that our sweet spot is around 60 years of age:

sns.distplot(df['award_age'])
dvpj 1118
Figure 1-18. Distribution of ages at time of award with KDE superimposed

A box plot is a good way of visualizing continuous data, showing the quartiles, the first and third marking the edges of the box and the second quartile (or median average) marking the line in the box. Generally, as in Figure 1-19, the horizontal end lines (known as the whisker ends) indicate the max and min of the data. Let’s use a Seaborn box plot and divide the prizes by gender:

sns.boxplot(df.gender, df.award_age)

This produces Figure 1-19, which shows that the distributions by gender are similar, with women having a slightly lower average age. Note that with far fewer female prize winners, their statistics are subject to a good deal more uncertainty.

dvpj 1119
Figure 1-19. Ages of prize winners by gender

Seaborn’s rather nice violinplot combines the conventional box plot with a kernel density estimation to give a more refined view of the breakdown by age and gender. The following code produces Figure 1-20.

sns.violinplot(df.gender, df.award_age)
dvpj 1120
Figure 1-20. Violinplots of prize-age distribution by gender

Life Expectancy of Winners

Now let’s look at the longevity of Nobel Prize winners, by subtracting the available dates of death from their respective dates of birth. We’ll store this data in a new 'age_at_death' column:

df['age_at_death'] = (df.date_of_death - df.date_of_birth)\
                     .dt.days/365 1
1

datetime64 data can be added and subtracted in a sensible fashion, producing a Pandas timedelta column. We can use its dt method to get the interval in days, dividing this by 365 to get the age at death as a float.

We make a copy of the 'age_at_death' column,3 removing all empty NaN rows. This can then be used to make the histogram and KDE shown in Figure 1-21.

age_at_death = df[df.age_at_death.notnull()].age_at_death 1

sns.distplot(age_at_death, bins=40)
1

Removes all NaNs to clean the data and reduce plotting errors (e.g., distplot fails with NaNs).

dvpj 1121
Figure 1-21. Life expectancy of the Nobel Prize winners

Figure 1-21 shows the Nobel Prize winners to be a remarkably long-lived bunch, with an average age in the early 80s. This is all the more impressive given that the large majority of winners are men, who have considerably lower average life expectancies in the general population than women. One contributary factor to this longevity is the selection bias we saw earlier. Nobel Prize winners aren’t generally honored until they’re in their late 50s and 60s, which removes the subpopulation who died before having the chance to be acknowledged, pushing up the longevity figures.

Figure 1-21 shows some centenarians among the prize winners. Let’s find them:

df[df.age_at_death > 100][['name', 'category', 'year']]
Out:
                     name                category  year
68   Rita Levi-Montalcini  Physiology or Medicine  1986
103          Ronald Coase               Economics  1991

Now let’s superimpose a couple of KDEs to show differences in mortality for male and female recipients:

df2 = df[df.age_at_death.notnull()] 1
sns.kdeplot(df2[df2.gender == 'male']
    .age_at_death, shade=True, label='male')
sns.kdeplot(df2[df2.gender == 'female']
    .age_at_death, shade=True, label='female')

plt.legend()
1

Creates a DataFrame with only valid 'age_at_death' fields.

This produces Figure 1-22, which, allowing for the small number of female winners and flatter distribution, shows the male and female averages to be close. Female Nobel Prize winners seem to live relatively shorter lives than their counterparts in the general population.

dvpj 1122
Figure 1-22. Nobel Prize winner life expectancies by gender

A violinplot provides another perspective, shown in Figure 1-23.

sns.violinplot(df.gender, age_at_death)
dvpj 1123
Figure 1-23. Winner life expectancies by gender

Increasing Life Expectancies over Time

Let’s do a little historical demographic analysis by seeing if there’s a correlation between the date of birth of our Nobel Prize winners and their life expectancy. We’ll use one of Seaborn’s lmplots to provide a scatter plot and line-fitting with confidence intervals (see not available).

df_temp=df[df.age_at_death.notnull()] 1
data = pd.DataFrame( 2
    {'age at death':df_temp.age_at_death,
     'date of birth':df_temp.date_of_birth.dt.year})
sns.lmplot('date of birth', 'age at death', data,\
  size=6, aspect=1.5)
1

Creates a temporary DataFrame, removing all the rows with no 'age_at_death' field.

2

Creates a new DataFrame with only the two columns of interest from the refined df_temp. We grab only the year from the date_of_birth, using its dt accessor.

This produces Figure 1-24, showing an increase in life expectancy of a decade or so over the prize’s duration.

dvpj 1124
Figure 1-24. Correlating date of birth with age at death

The Nobel Diaspora

While cleaning our Nobel Prize dataset in not available, we found duplicate entries recording the winner’s place of birth and country at time of winning. We preserved these, giving us 104 winners whose country at time of winning was different from their country of birth. Is there a story to tell here?

A good way to visualize the movement patterns from the winners’ country of birth to their adopted country is by using a heatmap to show all born_in/country pairs. The following code produces the heatmap in Figure 1-25:

by_bornin_nat = df[df.born_in.notnull()].groupby(\ 1
    ['born_in', 'country']).size().unstack()
by_bornin_nat.index.name = 'Born in' 2
by_bornin_nat.columns.name = 'Moved to'
plt.figure(figsize=(8, 8))

ax = sns.heatmap(by_bornin_nat, vmin=0, vmax=8) 3
ax.set_title('The Nobel Diaspora')
1

Selects all rows with a 'born_in' field, and forms groups on this and the country column.

2

We rename the row index and column names to make them more descriptive.

3

Seaborn’s heatmap attempts to set the correct bounds for the data, but in this case, we must manually adjust the limits (vmin and vmax) to see all the cells.

Figure 1-25 shows some interesting patterns, which tell a tale of persecution and sanctuary. First, the United States is the overwhelming recipient of relocated Nobel winners, followed by the United Kingdom. Note that the biggest contingents for both (except cross-border traffic from Canada) are from Germany. Italy, Hungary, and Austria are the next largest groups. Examining the individuals in these groups shows that the majority were displaced as a result of the rise of antisemitic fascist regimes in the run-up to World War II and the increasing persecution of Jewish minorities.

dvpj 1125
Figure 1-25. The Nobel Prize diaspora

To take an example, all four of the Nobel winners who moved from Germany to the United Kingdom were German research scientists with Jewish ancestry who moved in response to the Nazis’ rise to power:

df[(df.born_in == 'Germany') & (df.country == 'United Kingdom')]
    [['name', 'date_of_birth', 'category']]

Out:
                  name date_of_birth                category
976   Ernst Boris Chain    1906-06-19  Physiology or Medicine
1342   Hans Adolf Krebs    1900-08-25  Physiology or Medicine
1344           Max Born    1882-12-11                 Physics
1360       Bernard Katz    1911-03-26  Physiology or Medicine

Ernst Chain pioneered the industrial production of penicillin. Hans Krebs discovered the Krebs cycle, one of the most important discoveries in biochemistry, which regulates the energy production of cells. Max Born was one of the pioneers of quantum mechanics, and Bernard Katz uncovered the fundamental properties of synaptic junctions in neurons.

There are many such illustrious names among the winning emigrants. One interesting discovery is the number of prize winners who were part of the famous Kindertransport, an organized rescue effort that took place nine months before the outbreak of WWII and saw 10,000 Jewish children from Germany, Austria, Czechoslovakia, and Poland transported to the United Kingdom. Of these children, four went on to win a Nobel Prize.

Summary

In this chapter, we explored our Nobel Prize dataset, probing the key fields of gender, category, country, and year (of prize) looking for interesting trends and stories we can tell or enable visually. We used a fair number of Matplotlib (by way of Pandas) and Seaborn’s plots, from basic bar charts to more complicated statistical summaries like violinplots and heatmaps. Mastery of these tools and the others in the Python chart armory will allow you to quickly get the feel of your datasets, which is a prerequisite to building a visualization around them. We found more than enough stories in the data to suggest a web visualization. In the next chapter we will imagine and design just such a Nobel Prize winner visualization, cherry-picking the nuggets gained in this chapter.

1Anecdotally, no one I have asked in person or in talk audiences has known the name of the other female Nobel Prize winner for Physics.

2See Wikipedia for details. Essentially the data is smoothed and a probability density function derived.

3We are ignoring leap years and other subtle, complicating factors in deriving years from days.

Post topics: Open Source
Share: