In this chapter, we will introduce the dataset we use throughout the book: baseball performance statistics. We will explain the various metrics used in baseball (and in this book), such that if you aren’t a baseball fan you can still follow along.
Nate Silver calls baseball the “perfect dataset.” There are not many human-centered systems for which this comprehensive degree of detail is available, and no richer set of tables for truly demonstrating the full range of analytic patterns.
For readers who are not avid baseball fans, we provide a simple—some might say “oversimplified”—description of the sport and its key statistics. For more details, refer to Joseph Adler’s Baseball Hacks (O’Reilly) or Max Marchi and Jim Albert’s Analyzing Baseball Data with R (Chapman & Hall).
Our baseball statistics come in tables at multiple levels of detail.
Putting people first as we like to do, the
people table lists each player’s name and personal stats (height and weight, birth year, etc.). It has a primary key, the
player_id, formed from the first five letters of the player’s last name, first two letters of their first name, and a two-digit disambiguation slug. There are also primary tables for ballparks (
parks, which lists information on every stadium that has ever hosted a game) and for teams (
teams, which lists every Major League team back to the birth of the game).
The core statistics table is
bat_seasons, which gives each player’s batting ...