O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Analyzing Baseball Data with R

Book Description

With its flexible capabilities and open-source platform, R has become a major tool for analyzing detailed, high-quality baseball data. Analyzing Baseball Data with R provides an introduction to R for sabermetricians, baseball enthusiasts, and students interested in exploring the rich sources of baseball data. It equips readers with the necessary skills and software tools to perform all of the analysis steps, from gathering the datasets and entering them in a convenient format to visualizing the data via graphs to performing a statistical analysis.

The authors first present an overview of publicly available baseball datasets and a gentle introduction to the type of data structures and exploratory and data management capabilities of R. They also cover the traditional graphics functions in the base package and introduce more sophisticated graphical displays available through the lattice and ggplot2 packages. Much of the book illustrates the use of R through popular sabermetrics topics, including the Pythagorean formula, runs expectancy, career trajectories, simulation of games and seasons, patterns of streaky behavior of players, and fielding measures. Each chapter contains exercises that encourage readers to perform their own analyses using R. All of the datasets and R code used in the text are available online.

This book helps readers answer questions about baseball teams, players, and strategy using large, publically available datasets. It offers detailed instructions on downloading the datasets and putting them into formats that simplify data exploration and analysis. Through the book’s various examples, readers will learn about modern sabermetrics and be able to conduct their own baseball analyses.

Table of Contents

  1. Cover
  2. Half Title
  3. Title
  4. Copyright
  5. Contents
  6. Preface
  7. 1 The Baseball Datasets
    1. 1.1 Introduction
    2. 1.2 The Lahman Database: Season-by-Season Data
      1. 1.2.1 Bonds, Aaron, and Ruth home run trajectories
      2. 1.2.2 Obtaining the database
      3. 1.2.3 The Master table
      4. 1.2.4 The Batting table
      5. 1.2.5 The Pitching table
      6. 1.2.6 The Fielding table
      7. 1.2.7 The Teams table
      8. 1.2.8 Baseball questions
    3. 1.3 Retrosheet Game-by-Game Data
      1. 1.3.1 The 1998 McGwire and Sosa home run race
      2. 1.3.2 Retrosheet
      3. 1.3.3 Game logs
      4. 1.3.4 Obtaining the game logs from Retrosheet
      5. 1.3.5 Game log example
      6. 1.3.6 Baseball questions
    4. 1.4 Retrosheet Play-by-Play Data
      1. 1.4.1 Event files
      2. 1.4.2 Event example
      3. 1.4.3 Baseball questions
    5. 1.5 Pitch-by-Pitch Data
      1. 1.5.1 MLBAM Gameday and PITCHf/x
      2. 1.5.2 PITCHf/x Example
      3. 1.5.3 Baseball questions
    6. 1.6 Summary
    7. 1.7 Further Reading
    8. 1.8 Exercises
  8. 2 Introduction to R
    1. 2.1 Introduction
    2. 2.2 Installing R and RStudio
    3. 2.3 Vectors
      1. 2.3.1 Career of Warren Spahn
      2. 2.3.2 Vectors: defining and calculations
      3. 2.3.3 Vector functions
      4. 2.3.4 Vector index and logical variables
    4. 2.4 Objects and Containers in R
      1. 2.4.1 Character data and matrices
      2. 2.4.2 Factors
      3. 2.4.3 Lists
    5. 2.5 Collection of R Commands
      1. 2.5.1 R scripts
      2. 2.5.2 R functions
    6. 2.6 Reading and Writing Data in R
      1. 2.6.1 Importing data from a file
      2. 2.6.2 Saving datasets
    7. 2.7 Data Frames
      1. 2.7.1 Introduction
      2. 2.7.2 Manipulations with data frames
      3. 2.7.3 Merging and selecting from data frames
    8. 2.8 Packages
    9. 2.9 Splitting, Applying, and Combining Data
      1. 2.9.1 Using sapply
      2. 2.9.2 Using ddply in the plyr package
    10. 2.10 Getting Help
    11. 2.11 Further Reading
    12. 2.12 Exercises
  9. 3 Traditional Graphics
    1. 3.1 Introduction
    2. 3.2 Factor Variable
      1. 3.2.1 A bar graph
      2. 3.2.2 Add axes labels and a title
      3. 3.2.3 Other graphs of a factor
    3. 3.3 Saving Graphs
    4. 3.4 Dot plots
    5. 3.5 Numeric Variable: Stripchart and Histogram
    6. 3.6 Two Numeric Variables
      1. 3.6.1 Scatterplot
      2. 3.6.2 Building a graph, step-by-step
    7. 3.7 A Numeric Variable and a Factor Variable
      1. 3.7.1 Parallel stripcharts
      2. 3.7.2 Parallel boxplots
    8. 3.8 Comparing Ruth, Aaron, Bonds, and A-Rod
      1. 3.8.1 Getting the data
      2. 3.8.2 Creating the player data frames
      3. 3.8.3 Constructing the graph
    9. 3.9 The 1998 Home Run Race
      1. 3.9.1 Getting the data
      2. 3.9.2 Extracting the variables
      3. 3.9.3 Constructing the graph
    10. 3.10 Further Reading
    11. 3.11 Exercises
  10. 4 The Relation Between Runs and Wins
    1. 4.1 Introduction
    2. 4.2 The Teams Table in Lahman's Database
    3. 4.3 Linear Regression
    4. 4.4 The Pythagorean Formula for Winning Percentage
    5. 4.5 The Exponent in the Pythagorean Formula
    6. 4.6 Good and Bad Predictions by the Pythagorean Formula
    7. 4.7 How Many Runs for a Win?
    8. 4.8 Further Reading
    9. 4.9 Exercises
  11. 5 Value of Plays Using Run Expectancy
    1. 5.1 The Run Expectancy Matrix
    2. 5.2 Runs Scored in the Remainder of the Inning
    3. 5.3 Creating the Matrix
    4. 5.4 Measuring Success of a Batting Play
    5. 5.5 Albert Pujols
    6. 5.6 Opportunity and Success for All Hitters
    7. 5.7 Position in the Batting Lineup
    8. 5.8 Run Values of Different Base Hits
      1. 5.8.1 Value of a home run
      2. 5.8.2 Value of a single
    9. 5.9 Value of Base Stealing
    10. 5.10 Further Reading and Software
    11. 5.11 Exercises
  12. 6 Advanced Graphics
    1. 6.1 Introduction
    2. 6.2 The lattice Package
      1. 6.2.1 Introduction
      2. 6.2.2 The verlander dataset
      3. 6.2.3 Basic plotting with lattice
      4. 6.2.4 Multipanel conditioning
      5. 6.2.5 Superposing group elements
      6. 6.2.6 Scatterplots and dot plots
      7. 6.2.7 The panel function
      8. 6.2.8 Building a graph, step-by-step
    3. 6.3 The ggplot2 Package
      1. 6.3.1 Introduction
      2. 6.3.2 The cabrera dataset
      3. 6.3.3 The first layer
      4. 6.3.4 Grouping factors
      5. 6.3.5 Multipanel conditioning (faceting)
      6. 6.3.6 Adding elements
      7. 6.3.7 Combining information
      8. 6.3.8 Adding a smooth line with error bands
      9. 6.3.9 Dealing with cluttered charts
      10. 6.3.10 Adding a background image
    4. 6.4 Further Reading
    5. 6.5 Exercises
  13. 7 Balls and Strikes Effects
    1. 7.1 Introduction
    2. 7.2 Hitter's Counts and Pitcher's Counts
      1. 7.2.1 Introduction
      2. 7.2.2 An example for a single pitcher
      3. 7.2.3 Pitch sequences on Retrosheet
        1. 7.2.3.1 Functions for string manipulation
        2. 7.2.3.2 Finding plate appearances going through a given count
      4. 7.2.4 Expected run value by count
      5. 7.2.5 The importance of the previous count
    3. 7.3 Behaviors by Count
      1. 7.3.1 Swinging tendencies by count
        1. 7.3.1.1 Propensity to swing by location
        2. 7.3.1.2 Effect of the ball/strike count
      2. 7.3.2 Pitch selection by count
      3. 7.3.3 Umpires' behavior by count
    4. 7.4 Further Reading
    5. 7.5 Exercises
  14. 8 Career Trajectories
    1. 8.1 Introduction
    2. 8.2 Mickey Mantle's Batting Trajectory
    3. 8.3 Comparing Trajectories
      1. 8.3.1 Some preliminary work
      2. 8.3.2 Computing career statistics
      3. 8.3.3 Computing similarity scores
      4. 8.3.4 Defining age, OBP, SLG, and OPS variables
      5. 8.3.5 Fitting and plotting trajectories
    4. 8.4 General Patterns of Peak Ages
      1. 8.4.1 Computing all fitted trajectories
      2. 8.4.2 Patterns of peak age over time
      3. 8.4.3 Peak age and career at-bats
    5. 8.5 Trajectories and Fielding Position
    6. 8.6 Further Reading
    7. 8.7 Exercises
  15. 9 Simulation
    1. 9.1 Introduction
    2. 9.2 Simulating a Half Inning
      1. 9.2.1 Markov chains
      2. 9.2.2 Review of work in runs expectancy
      3. 9.2.3 Computing the transition probabilities
      4. 9.2.4 Simulating the Markov chain
      5. 9.2.5 Beyond runs expectancy
      6. 9.2.6 Transition probabilities for individual teams
    3. 9.3 Simulating a Baseball Season
      1. 9.3.1 The Bradley-Terry model
      2. 9.3.2 Making up a schedule
      3. 9.3.3 Simulating talents and computing win probabilities
      4. 9.3.4 Simulating the regular season
      5. 9.3.5 Simulating the post-season
      6. 9.3.6 Function to simulate one season
      7. 9.3.7 Simulating many seasons
    4. 9.4 Further Reading
    5. 9.5 Exercises
  16. 10 Exploring Streaky Performances
    1. 10.1 Introduction
    2. 10.2 The Great Streak
      1. 10.2.1 Finding game hitting streaks
      2. 10.2.2 Moving batting averages
    3. 10.3 Streaks in Individual At-Bats
      1. 10.3.1 Streaks of hits and outs
      2. 10.3.2 Moving batting averages
      3. 10.3.3 Finding hitting slumps for all players
      4. 10.3.4 Were Suzuki and Ibanez unusually streaky?
    4. 10.4 Local Patterns of Weighted On-Base Average
    5. 10.5 Further Reading
    6. 10.6 Exercises
  17. 11 Learning About Park Effects by Database Management Tools
    1. 11.1 Introduction
    2. 11.2 Installing MySQL and Creating a Database
    3. 11.3 Connecting R to MySQL
      1. 11.3.1 Connecting using package RMySQL
      2. 11.3.2 Connecting using Package RODBC
    4. 11.4 Filling a MySQL Game Log Database from R
      1. 11.4.1 From Retrosheet to R
      2. 11.4.2 From R to MySQL
    5. 11.5 Querying Data from R
      1. 11.5.1 Introduction
      2. 11.5.2 Coors Field and run scoring
    6. 11.6 Baseball Data as MySQL Dumps
      1. 11.6.1 Lahman's database
      2. 11.6.2 Retrosheet database
      3. 11.6.3 PITCHf/x database
    7. 11.7 Calculating Basic Park Factors
      1. 11.7.1 Loading the data in R
      2. 11.7.2 Home run park factor
      3. 11.7.3 Assumptions of the proposed approach
      4. 11.7.4 Applying park factors
    8. 11.8 Further Reading
    9. 11.9 Exercises
  18. 12 Exploring Fielding Metrics with Contributed R Packages
    1. 12.1 Introduction
    2. 12.2 A Motivating Example: Comparing Fielding Metrics
      1. 12.2.1 Introduction
      2. 12.2.2 The fielding metrics
      3. 12.2.3 Reading an Excel spreadsheet (XLConnect)
      4. 12.2.4 Summarizing multiple columns (doBy)
      5. 12.2.5 Finding the most similar string (stringdist)
      6. 12.2.6 Applying a function on multiple columns (plyr)
      7. 12.2.7 Weighted correlations (weights)
      8. 12.2.8 Displaying correlation matrices (ellipse)
      9. 12.2.9 Evaluating the fielding metrics (psych)
    3. 12.3 Comparing Two Shortstops
      1. 12.3.1 Reshaping the data (reshape2)
      2. 12.3.2 Plotting the data (ggplot2 and directlabels)
    4. 12.4 Further Reading
    5. 12.5 Exercises
  19. A Retrosheet Files Reference
    1. A.1 Downloading Play-by-Play Files
      1. A.1.1 Introduction
      2. A.1.2 Setup
      3. A.1.3 Using a special function for a particular season
      4. A.1.4 Reading the files into R
      5. A.1.5 The function parse.retrosheet.pbp
    2. A.2 Retrosheet Event Files: a Short Reference
      1. A.2.1 Game and event identifiers
      2. A.2.2 The state of the game
    3. A.3 Parsing Retrosheet Pitch Sequences
      1. A.3.1 Introduction
      2. A.3.2 Setup
      3. A.3.3 Evaluating every count
  20. B Accessing and Using MLBAM Gameday and PITCHf/x Data
    1. B.1 Introduction
    2. B.2 Where are the Data Stored?
    3. B.3 Suitable Formats for PITCHf/x Data
      1. B.3.1 Obtaining data from on-line resources
      2. B.3.2 Parsing in R
        1. B.3.2.1 A wrapper function
    4. B.4 Details on the Data
      1. B.4.1 atbat attributes
      2. B.4.2 pitch attributes
      3. B.4.3 hip attributes (hit locations data)
    5. B.5 Special Notes About the Gameday and PITCHf/x Data
    6. B.6 Miscellanea
      1. B.6.1 Calculating the pitch trajectory
      2. B.6.2 An R package for getting and visualizing PITCHf/x data: pitchRx
      3. B.6.3 Cross-referencing with other data sources
      4. B.6.4 Online resources
  21. Bibliography
  22. Index