O'Reilly logo

Big Data for Chimps by Russell Jurney, Philip Kromer

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 3. A Quick Look into Baseball

In this chapter, we will introduce the dataset we use throughout the book: baseball performance statistics. We will explain the various metrics used in baseball (and in this book), such that if you aren’t a baseball fan you can still follow along.

Nate Silver calls baseball the “perfect dataset.” There are not many human-centered systems for which this comprehensive degree of detail is available, and no richer set of tables for truly demonstrating the full range of analytic patterns.

For readers who are not avid baseball fans, we provide a simple—some might say “oversimplified”—description of the sport and its key statistics. For more details, refer to Joseph Adler’s Baseball Hacks (O’Reilly) or Max Marchi and Jim Albert’s Analyzing Baseball Data with R (Chapman & Hall).

The Data

Our baseball statistics come in tables at multiple levels of detail.

Putting people first as we like to do, the people table lists each player’s name and personal stats (height and weight, birth year, etc.). It has a primary key, the player_id, formed from the first five letters of the player’s last name, first two letters of their first name, and a two-digit disambiguation slug. There are also primary tables for ballparks (parks, which lists information on every stadium that has ever hosted a game) and for teams (teams, which lists every Major League team back to the birth of the game).

The core statistics table is bat_seasons, which gives each player’s batting ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required