Chapter 1. Football Analytics
American football (also known as gridiron football or North American football and henceforth simply called football) is undergoing a drastic shift toward the quantitative. Prior to the last half of a decade or so, most of football analytics was confined to a few seminal pieces of work. Arguably the earliest example of analytics being used in football occurred when former Brigham Young University, Chicago Bears, Cincinnati Bengals, and San Diego Chargers quarterback Virgil Carter created the notion of an expected point as coauthor of the 1971 paper “Technical Note: Operations Research in Football” before he teamed with the legendary Bill Walsh as the first quarterback to execute what is now known as the West Coast offense.
The idea of an expected point is incredibly important in football, as the game by its very nature is discrete: a collection of a finite number of plays (also called downs) that require the offense to go a certain distance (in yards) before having to surrender the ball to the opposing team. If the line to gain is the opponent’s end zone, the offense scores a touchdown, which is worth, on average, seven points after a post-touchdown conversion. Hence, the expected point provides an estimated, or expected value for the number of points you would expect a team to score given the current game situation on that drive.
Football statistics have largely been confined to offensive players, and have been doled out in the currency of yards gained and touchdowns scored. The problem with this is obvious. If a player catches a pass to gain 7 yards, but 8 are required to get a first down or a touchdown, the player did not gain a first down. Conversely, if a player gains 5 yards, when 5 are required to get a first down or a touchdown, the player gained a first down. Hence, “enough” yards can be better than “more” yards depending on the context of the play. As a second example, if it takes a team two plays to travel 70 yards to score a touchdown, with one player gaining the first 65 yards and the second gaining the final 5, why should the second player get all the credit for the score?
In 1988, Bob Carroll, Pete Palmer, and John Thorn wrote The Hidden Game of Football (Grand Central Publishing), which further explored the notions of expected points. In 2007, Brian Burke, who was a US Navy pilot before creating the Advanced Football Analytics website (http://www.advancedfootballanalytics.com), formulated the expected-points and expected-points-added approach, along with building a win probability model responsible for some key insights, including the 4th Down Bot at the New York Times website. Players may be evaluated by the number of expected points or win probability points added to their teams when those players did things like throw or catch passes, run the ball, or sack the quarterback.
The work of Burke inspired the open source work of Ron Yurko, Sam Ventura, and Max Horowitz of Carnegie Mellon University. The trio built nflscrapR
, an R package that scraped NFL play-by-play data. The nflscrapR
package was built to display their own versions of expected points added (EPA) and win probability (WP) models. Using this framework, they also replicated the famous wins above replacement (WAR) framework from baseball for quarterbacks, running backs, and wide receivers, which was published in 2018. This work was later extended using different data and methods by Eric and his collaborator George Chahouri in 2020. Eric’s version of WAR, and its analogous model for college football, are used throughout the industry to this day.
The nflscrapR
package served as a catalyst for the popularization of modern tools that use data to study football, most of which use a framework that will be replicated constantly throughout this book. The process of building an expectation for an outcome—in the form of points, completion percentage, rushing yards, draft-pick outcome, and many more—and measuring players or teams via the residual (that is, the difference between the value expected by the model and the observed value) is a process that transcends football. In soccer, for example, expected goals (xG) are the cornerstone metric upon which players and clubs are measured in the sport known as “the Beautiful Game”. And shot quality—the expected rate at which a shot is made in basketball—is a ubiquitous measure for players and teams on the hardwood. The features that go into these models, and the forms that they take, are the subject of constant research whose surface we will scratch in this book.
The rise of tools like nflscrapR
allowed more people to show their analytical skills and flourish in the public sphere. Analysts were hired based on their tweets on Twitter because of their ingenious approaches to measuring team performance. Decisions to punt or go for it on a fourth down were evaluated by how they affected the team’s win probability. Ben Baldwin and Sebastian Carl created a spin-off R package, nflfastR
. This package updated the models of Yurko, Ventura, and Horowitz, along with adding many of their own—models that we’ll use in this book. More recently, the data contained in the nflfastR
package has been cloned into Python via the nfl_data_py
package by Cooper Adams.
We hope that this book will give you the basic tools to approach some of the initial problems in football analytics and will serve as a jumping-off point for future work.
Tip
People looking for the cutting edge of sports analytics, including football, may want to check out the MIT Sloan Sports Analytics Conference. Since its founding in 2006, Sloan has emerged as a leading venue for the presentation of new tools for football (and other) analytics. Other, more accessible conferences, like the Carnegie Mellon Sports Analytics Conference and New England Statistics Symposium, are fantastic places for students and practitioners to present their work. Most of these conferences have hackathons for people looking to make an impression on the industry.
Baseball Has the Three True Outcomes: Does Football?
Baseball pioneered the use of quantitative metrics, and the creation of the Society for American Baseball Research (SABR) led to the term sabermetrics to describe baseball analysis. Because of this long history, we start by looking at the metrics commonly used in baseball—specifically, the three true outcomes. One of the reasons the game of baseball has trended toward the three true outcomes (walks, strikeouts, and home runs) is that they were the easiest to predict from one season to the next. What batted balls did when they were in play was noisier and was the source of much of the variance in perceived play from one year to the next. The three true outcomes of baseball have also been the source for more elaborate data-collection methods and subsequent analysis in an attempt to tease additional signals from batted-ball data.
Stability analysis is a cornerstone of team and player evaluation. Stable production is sticky or repeatable and is the kind of production decision-makers should want to buy into year in and year out. Stability analysis therefore examines whether something is stable—in our case, football observation and model outputs, and you will use this analysis in “Player-Level Stability of Passing Yards per Attempt”, “Is RYOE a Better Metric?”, “Analyzing RYOE”, and “Is CPOE More Stable Than Completion Percentage?”. On the other hand, how well a team or player does in high-leverage situations (plays that have greater effects on the outcome of the game, such as converting third downs) can have an outsized impact on win-loss record or eventual playoff fate, but if it doesn’t help us predict what we want year in and year out, it might be better to ignore, or sell such things to other decision-makers.
Using play-by-play data from nflfastR
, Chapter 2 shows you how to slice and dice football passing data into subsets that partition a player’s performance into stable and unstable production. Through exploratory data analysis techniques, you can see whether any players break the mold and what to do with them. We preview how this work can aid in the process of feature engineering for prediction in Chapter 2.
Do Running Backs Matter?
For most of the history of football, the best players played running back (in fact, early football didn’t include the forward pass until President Teddy Roosevelt worked with college football to introduce passing to make the game safer in 1906). The importance of the running back used to be an accepted truism across all levels of football, until the forward pass became an integral part of the game. Following the forward pass, rule and technology changes—along with Carter (mentioned earlier in this chapter) and his quarterbacks coach, Walsh—made throwing the football more efficient relative to running the football.
Many of our childhood memories from the 1990s revolve around Emmitt Smith and Barry Sanders trading the privilege of being the NFL rushing champion every other year. College football fans from the 1980s may remember Herschel Walker giving way to Bo Jackson in the Southeastern Conference (SEC). Even many younger fans from the 2000s and 2010s can still remember Adrian Peterson earning the last nonquarterback most valuable player (MVP) award. During the 2012 season, he rushed for over 2,000 yards while carrying an otherwise-bad Minnesota Vikings team to the playoffs.
However, the current prevailing wisdom among football analytics folks is that the running back position does not matter as much as other positions. This is for a few reasons. First, running the football is not as efficient as passing. This is plain to see with simple analyses using yards per play, but also through more advanced means like EPA. Even the worst passing plays usually produce, on average, more yards or expected points per play than running.
Second, differences in the actual player running the ball does not elicit the kind of change in rushing production that similar differences do for quarterbacks, wide receivers, or offensive or defensive linemen. In other words, additional resources used to pay for the services of this running back over that running back are probably not worth it, especially if those resources can be used on other positions. The marketplace that is the NFL has provided additional evidence that this is true, as we have seen running back salaries and draft capital used on the position decline to lows not previously seen.
This didn’t keep the New York Giants from using the second-overall pick in the 2018 NFL Draft on Pennsylvania State University’s Saquon Barkley, which was met with jeers from the analytics community, and a counter from Giants General Manager Dave Gettleman. In a post-draft press conference for the ages, Gettleman, sitting next to reams of bound paper, made fun of the analytics jabs toward his pick by mimicking a person typing furiously on a typewriter.
Chapters 3 and 4 look at techniques for controlling play-by-play rushing data for a situation to see how much of the variability in rushing success has to do with the player running the ball.
How Data Can Help Us Contextualize Passing Statistics
As we’ve stated previously, the passing game dominates football, and in Appendix B, we show you how to examine the basics of passing game data. In recent years, analysts have taken a deeper look into what constitutes accuracy at the quarterback position because raw completion percentage numbers, even among quarterbacks who aren’t considered elite players, have skyrocketed. The work of Josh Hermsmeyer with the Baltimore Ravens and later FiveThirtyEight established the significance of air yards, which is the distance traveled by the pass from the line of scrimmage to the intended receiver.
While Hermsmeyer’s initial research was in the fantasy football space, it spawned a significant amount of basic research into the passing game, giving rise to metrics like completion percentage over expected (CPOE), which is one of the most predictive quarterback metrics about quarterback quality available today.
In Chapter 5, we introduce generalized linear models in the form of logistic regression. You’ll use this to estimate the completion probability of a pass, given multiple situational factors that affect a throw’s expected success. You’ll then look at a player’s residuals (that is, how well a player actually performs compared to the model’s prediction for that performance) and see whether there is more or less stability in the residuals—the CPOE—than in actual completion percentage.
Can You Beat the Odds?
In 2018, the Professional and Amateur Sports Protection Act of 1992 (PASPA), which had banned sports betting in the United States (outside of Nevada), was overturned by the US Supreme Court. This court decision opened the floodgates for states to make legal what many people were already doing illegally: betting on football.
The difficult thing about sports betting is the house advantage—referred to as the vigorish, or vig—which makes it so that a bettor has to win more than 50% of their bets to break even. Thus, a cost exists for simply playing the game that needs to be overcome in order to beat the sportsbook (or simply the book for short).
American football is the largest gambling market in North America. Most successful sports bettors in this market use some form of analytics to overcome this house advantage. Chapter 6 examines the passing touchdowns per game prop market, which shows how a bettor would arrive at an internal price for such a market and compare it to the market price.
Do Teams Beat the Draft?
Owners, fans, and the broader NFL community evaluate coaches and general managers based on the quality of talent that they bring to their team from one year to the next. One complaint against New England Patriots Coach Bill Belichick, maybe the best nonplayer coach in the history of the NFL, is that he has not drafted well in recent seasons. Has that been a sequence of fundamental missteps or just a run of bad luck?
One argument in support of coaches such as Belichick may be “Well, they are always drafting in the back of the draft, since they are usually a good team.” Luckily, one can use math to control for this and to see if we can reject the hypothesis that all front offices are equally good at drafting after accounting for draft capital used. Draft capital comprises the resources used during the NFL Draft—notably, the number of picks, pick rounds, and pick numbers.
In Chapter 7, we scrape publicly available draft data and test the hypothesis that all front offices are equally good at drafting after accounting for draft capital used, with surprising results. In Chapter 8, we scrape publicly available NFL Scouting Combine data and use dimension-reduction tools and clustering to see how groups of players emerge.
Tools for Football Analytics
Football analytics, and more broadly, data science, require a diverse set of tools. Successful practitioners in these fields require an understanding of these tools. Statistical programming languages, like Python and R, are a backbone of our data science toolbox. These languages allow us to clean our datasets, conduct our analyses, and readily reuse our methods.
Although many people commonly use spreadsheets (such as Microsoft Excel or Google Sheets) for data cleaning and analysis, we find spreadsheets do not scale well. For example, when working with large datasets containing tracking data, which can include thousands of rows of data per play, spreadsheets simply are not up to the task. Likewise, people commonly use business intelligence (BI) tools such as Microsoft Power BI and Tableau because of their power and ability to scale. But these tools tend to focus on point-and-click methods and require licenses, especially for commercial use.
Programming languages also allow for easy reuse because copying and pasting formulas in spreadsheets can be tedious and error prone. Lastly, spreadsheets (and, more broadly, point-and-click tools) allow undocumented errors. For example, spreadsheets do not have a way to catch a copying and pasting mistake. Furthermore, modern data science tools allow code, data, and results to be blended together in easy-to-use interfaces. Common languages include Python, R, Julia, MATLAB, and SAS. Additional languages continue to appear as computer science advances.
As practitioners of data science, we use R and Python daily for our work, which has collectively spanned the space of applied mathematics, applied statistics, theoretical ecology and, of course, football analytics. Of the languages listed previously, Python and R offer the benefit of larger user bases (and hence likely contain the tools and models we need). Both R and Python (as well as Julia) are open source. As of this writing, Julia does not have the user base of R or Python, but it may either be the cutting edge of statistical computing, a dead end that fizzles out, or possibly both.
Open source means two types of freedom. First, anybody can access all the code in the language, like free speech. This allows volunteers to help improve the language, such as ensuring that users can debug the code and extend the language through add-on packages (like the nflfastR
package in R or the nfl_data_py package in Python). Second, open source also offers the benefit of being free to use for users, like free drinks. Hence users do not need to pay thousands of dollars annually in licensing fees. We were initially trained in R but have learned Python over the course of our jobs. Either language is well suited for football analytics (and sports analytics in general).
Note
Appendix A includes instructions for obtaining R and Python for those of you who do not currently have access to these languages. This includes either downloading and installing the programs or using web-hosted resources. The appendix also describes programs to help you more easily work with these languages, such as editors and integrated development environments (IDEs).
We encourage you to pick one language for your work with this book and learn that language well. Learning a second programming language will be easier if you understand the programming concepts behind a first language. Then you can relate the concepts back to your understanding of your original computer language.
Tip
For readers who want to learn the basics of programming before proceeding with our book, we recommend Al Sweigart’s Invent Your Own Computer Games with Python, 4th edition (No Starch Press, 2016) or Garrett Grolemund’s Hands-On Programming with R (O’Reilly, 2014). Either resource will hold your hand to help you learn the basics of programming.
Although many people pick favorite languages and sometimes argue about which coding language is better (similar to Coke versus Pepsi or Ford versus General Motors), we have seen both R and Python used in production and also used with large data and complex models. For example, we have used R with 100 GB files on servers with sufficient memory. Both of us began our careers coding almost exclusively in R but have learned to use Python when the situation has called for it. Furthermore, the tools often have complementary roles, especially for advanced methods, and knowing both languages lets you have options for problems you may encounter.
Tip
When picking a language, we suggest you use what your friends use. If all your friends speak Spanish, communicating with them if you learn Spanish will probably be easier as well. You can then teach them your native language too. Likewise, the same holds for programming: your friends can then help you debug and troubleshoot. If you still need help deciding, open up both languages and play around for a little bit. See which one you like better. Personally, we like R when working with data, because of R’s data manipulation tools, and Python when building and deploying new models because of Python’s cleaner syntax for writing functions.
First Steps in Python and R
Tip
If you are familiar with R and Python, you’ll still benefit from skimming this section to see how we teach a tool you are familiar with.
Opening a computer terminal may be intimidating for many people. For example, many of our friends and family will walk by our computers, see code up on the screens, and immediately turn their heads in disgust (Richard’s dad) or fear (most other people). However, terminals are quite powerful and allow more to be done with less, once you learn the language. This section will help you get started using Python or R.
The first step for using R or Python is either to install it on your computer or use a web-based version of the program. Various options exist for installing or otherwise accessing Python and R and then using them on your computer. Appendix A contains steps for this as well as installation options.
Note
People, like Richard, who follow the Green Bay Packers are commonly called Cheeseheads. Likewise, people who use Python are commonly called Pythonistas, and people who use R are commonly called useRs.
Once you have you access to R or Python, you have an expensive graphing calculator (for example, your $1,000 laptop). In fact, both Eric and Richard, in lieu of using an actual calculator, will often calculate silly things like point spreads or totals in the console if in need of a quick calculation. Let’s see some things you can do. Type 2 + 2
in either the Python or R console:
2
+
2
Which results in:
4
Note
People use comments to leave notes to themselves and others in code. Both Python and R use the # symbol for comments (the pound symbol for the authors or hashtag for younger readers). Comments are text (within code) that the computer does not read but that help humans to understand the code. In this book, we will use two comment symbols to tell you that a code block is Python (## Python) or R (## R)
You may also save numbers as variables. In Python, you could define z
to be 2
and then reuse z
and divide by 3:
## Python
z
=
2
z
/
3
Resulting in:
0.6666666666666666
Tip
In R, either <-
or =
may be used to create variables. We use
<-
for two reasons. First, in this book this helps you see the
difference between R and Python code. Second, we use this style in our
day-to-day programming as well. Chapter 9 discusses code
styles more. Regardless of which operator you use, be consistent with
your programming style in any language. Your future self (and others who
read your code) will thank you.
In R, you can also define z
to be 2
and then reuse z
and divide by 3:
## R
z
<-
2
z
/
3
Resulting in:
[1] 0.6666667
Note
Python and R format outputs differently. Python does not round up and includes more digits. Conversely, R shows fewer digits and rounds up.
Example Data: Who Throws Deep?
Now that you have seen some basics in R, let’s dive into an example with football data. You will use the nflfastR
data for many of the examples in this book. This data may be installed as an R package or as the Python package nfl_data_py
. Specifically, we will explore the broad (and overly simple) question “Who were the most aggressive quarterbacks in 2021?” We will start off introducing the package using R because the data originated with R.
Note
Both Python and R have flourished because they readily allow add-on
packages. Conda exists as one tool for managing these add-ons.
Chapter 9 and Appendix A discuss these add-ons in
greater detail. In general, you can install packages in Python by
typing pip install package name
or
conda install package name
in the terminal (such as the bash shell on
Linux, Zsh shell on macOS, or command prompt on Microsoft Windows). Sometimes
you will need to use pip3
, depending on your operating system’s
configuration, if you are using the pip
package manager system. For a
concrete example, to install the seaborn
package, you could type
pip install seaborn
in your terminal. In general, packages in R can
be installed by opening R and then typing
install.packages("package name")
. For example, to install the
tidyverse
collection of packages, open R and run install.packages("tidyverse")
.
nflfastR in R
Starting with R, install the nflfastR
package:
## R
install.packages
(
"nflfastR"
)
Tip
Using single quotation marks around a name, such as 'x'
, or double quotes, such
as "x"
, are both acceptable to languages such as Python or R. Make
sure the opening and closing quotes match. For example, 'x"
would not be
acceptable. You may use both single and double quotes
to place quotes inside of quotes. For example, in a figure caption, you
might write, "Panthers' points earned"
or
'Air temperature ("true temperature")'
. Or in Python, you can use a
combination of quotes later for inputs such as "team == 'GB'"
because
you’ll need to nest quotes inside of quotes.
Next, load this package as well as the tidyverse
, which gives you tools to manipulate and plot the data:
## R
library
(
"tidyverse"
)
library
(
"nflfastR"
)
Note
Base R contains dataframes as data.frame()
. We use tibbles from the
tidyverse instead, because these print nicer to screens and include
other useful features. Many users consider base R’s data.frame()
to be
a legacy object, although you will likely see these objects when looking at help files and
examples on the web. Lastly, you might see the data.table
package in
R. The data.table
extension of dataframes is similar to a tibble and
works better with larger data (for example, 10 GB or 100 GB files) and has a
more compact coding syntax, but it comes with the trade-off of being less
user-friendly compared to tibbles. In our own work, we use a data.table rather than a tibble or data.frame when we
need high performance at the trade-off of code readability.
Once you’ve loaded the packages, you need to load the data from each play, or the play-by-play (pbp) data, for the 2021 season. Use the load_pbp()
function from nflfastR
and call the data pbp_r
(the _r
ending helps you tell that the code is from an R example in this book):
## R
pbp_r
<-
load_pbp
(
2021
)
Note
We generally include _py
in the name of Python dataframes and _r
in
the names of R dataframes to help you identify the language for various
code objects.
After loading the data as pbp_r
, pass (or pipe) the data along to be filtered by using |>
. Use the filter()
function to select only data where passing plays occurred (play_type == "pass"
) and where air_yards
are not missing, or NA
in R syntax (in plain English, the pass had a recorded depth). Chapter 2, Appendix B, and Appendix C cover data manipulation more, and most examples in this book use data wrangling to format data. So right now, simply type this code. You can probably figure out what the code is doing, but don’t worry about understanding it too much:
## R
pbp_r_p
<-
pbp_r
|>
filter
(
play_type
==
'pass'
&
!
is.na
(
air_yards
))
Now you’ll look at the average depth of target (aDOT), or mean air yards per pass, for every quarterback in the NFL in 2021 who threw 100 or more passes with a designated depth. To avoid multiple players who have the same name, which happens more than you’d think, you’ll summarize by both player ID and player name.
First, group by both the passer_id
and passer
. Then summarize to calculate the number of plays (n()
) and mean air yards per pass (adot
) per player. Also, filter to include only players with 100 or more plays and to remove any rows without a passer name (specifically, those with missing or NA
values).
With this and the previous example commands, the function is.na(passer)
checks whether value in the passer
column has the value NA
and returns TRUE
for columns with an NA
value. Appendix B covers this logic and terminology in greater detail. Next, an exclamation point (!
) turns this expression into the opposite of not missing value, so that you keep cells with a value. As an aside, we, the authors, find the use of double negatives confusing as well. Lastly, arrange by the adot
values and then print all (or infinity, Inf
) values:
## R
pbp_r_p
|>
group_by
(
passer_id
,
passer
)
|>
summarize
(
n
=
n
(),
adot
=
mean
(
air_yards
))
|>
filter
(
n
>=
100
&
!
is.na
(
passer
))
|>
arrange
(
-
adot
)
|>
(
n
=
Inf
)
Resulting in:
[Entire table] A tibble: 42 × 4 # Groups: passer_id [42] passer_id passer n adot <chr> <chr> <int> <dbl> 1 00-0035704 D.Lock 110 10.2 2 00-0029263 R.Wilson 400 9.89 3 00-0036945 J.Fields 268 9.84 4 00-0034796 L.Jackson 378 9.34 5 00-0036389 J.Hurts 473 9.19 6 00-0034855 B.Mayfield 416 8.78 7 00-0026498 M.Stafford 740 8.51 8 00-0031503 J.Winston 161 8.32 9 00-0029604 K.Cousins 556 8.23 10 00-0034857 J.Allen 708 8.22 11 00-0031280 D.Carr 676 8.13 12 00-0031237 T.Bridgewater 426 8.04 13 00-0019596 T.Brady 808 7.94 14 00-0035228 K.Murray 515 7.94 15 00-0036971 T.Lawrence 598 7.91 16 00-0036972 M.Jones 557 7.90 17 00-0033077 D.Prescott 638 7.81 18 00-0036442 J.Burrow 659 7.75 19 00-0023459 A.Rodgers 556 7.73 20 00-0031800 T.Heinicke 491 7.69 21 00-0035993 T.Huntley 185 7.68 22 00-0032950 C.Wentz 516 7.64 23 00-0029701 R.Tannehill 554 7.61 24 00-0037013 Z.Wilson 382 7.57 25 00-0036355 J.Herbert 671 7.55 26 00-0033119 J.Brissett 224 7.55 27 00-0033357 T.Hill 132 7.44 28 00-0028118 T.Taylor 149 7.43 29 00-0030520 M.Glennon 164 7.38 30 00-0035710 D.Jones 360 7.34 31 00-0036898 D.Mills 392 7.32 32 00-0031345 J.Garoppolo 511 7.31 33 00-0034869 S.Darnold 405 7.26 34 00-0026143 M.Ryan 559 7.16 35 00-0032156 T.Siemian 187 7.13 36 00-0036212 T.Tagovailoa 387 7.10 37 00-0033873 P.Mahomes 780 7.08 38 00-0027973 A.Dalton 235 6.99 39 00-0027939 C.Newton 126 6.97 40 00-0022924 B.Roethlisberger 647 6.76 41 00-0033106 J.Goff 489 6.44 42 00-0034401 M.White 132 5.89
The adot
value, a commonly used measure of quarterback aggressiveness, gives a quantitative approach to rank quarterbacks by their aggression, as measured by mean air yards per pass (can you think of other ways to measure aggressiveness that pass depth alone leaves out?). Look at the results and think, do they make sense to you, or are you surprised, given your personal opinions of quarterbacks?
nfl_data_py in Python
In Python, the nfl_data_py
package by Cooper Adams exists as a clone of the R nflfastR
package for data. To use the data from this package, first import the pandas
package with the alias (or short nickname) pd
for working with data and import the nfl_data_py
package as nfl
:
## Python
import
pandas
as
pd
import
nfl_data_py
as
nfl
Next, tell Python to import the data for 2021 (Chapter 2 shows how to import multiple years). Note that you need to include the year in a Python list as [2021]
:
## Python
pbp_py
=
nfl
.
import_pbp_data
([
2021
])
As with the R code, filter the data in Python (pandas calls filtering a query
). Python allows you to readily pass the filter criteria (filter_crit
) into query()
as an object, and we have you do this to save space line space. Then group by passer_id
and passer
before aggregating the data by using a Python dictionary (dict()
, or {}
for short) with the .agg()
function:
## Python
filter_crit
=
'play_type == "pass" & air_yards.notnull()'
pbp_py_p
=
(
pbp_py
.
query
(
filter_crit
)
.
groupby
([
"passer_id"
,
"passer"
])
.
agg
({
"air_yards"
:
[
"count"
,
"mean"
]})
)
The pandas package also requires reformatting the column heads via a list()
function and changing the header from being two rows to a single row via map(). Next, print the outputs after sorting by the mean of the air yards via the query()
function (to_string()
allows all the outputs to be printed):
## Python
pbp_py_p
.
columns
=
list
(
map
(
"_"
.
join
,
pbp_py_p
.
columns
.
values
))
sort_crit
=
"air_yards_count > 100"
(
pbp_py_p
.
query
(
sort_crit
)
\.
sort_values
(
by
=
"air_yards_mean"
,
ascending
=
[
False
])
\.
to_string
()
)
This results in:
air_yards_count air_yards_mean passer_id passer 00-0035704 D.Lock 110 10.154545 00-0029263 R.Wilson 400 9.887500 00-0036945 J.Fields 268 9.835821 00-0034796 L.Jackson 378 9.341270 00-0036389 J.Hurts 473 9.190275 00-0034855 B.Mayfield 416 8.776442 00-0026498 M.Stafford 740 8.508108 00-0031503 J.Winston 161 8.322981 00-0029604 K.Cousins 556 8.228417 00-0034857 J.Allen 708 8.224576 00-0031280 D.Carr 676 8.128698 00-0031237 T.Bridgewater 426 8.037559 00-0019596 T.Brady 808 7.941832 00-0035228 K.Murray 515 7.941748 00-0036971 T.Lawrence 598 7.913043 00-0036972 M.Jones 557 7.901257 00-0033077 D.Prescott 638 7.811912 00-0036442 J.Burrow 659 7.745068 00-0023459 A.Rodgers 556 7.730216 00-0031800 T.Heinicke 491 7.692464 00-0035993 T.Huntley 185 7.675676 00-0032950 C.Wentz 516 7.641473 00-0029701 R.Tannehill 554 7.606498 00-0037013 Z.Wilson 382 7.565445 00-0036355 J.Herbert 671 7.554396 00-0033119 J.Brissett 224 7.549107 00-0033357 T.Hill 132 7.439394 00-0028118 T.Taylor 149 7.429530 00-0030520 M.Glennon 164 7.378049 00-0035710 D.Jones 360 7.344444 00-0036898 D.Mills 392 7.318878 00-0031345 J.Garoppolo 511 7.305284 00-0034869 S.Darnold 405 7.259259 00-0026143 M.Ryan 559 7.159213 00-0032156 T.Siemian 187 7.133690 00-0036212 T.Tagovailoa 387 7.103359 00-0033873 P.Mahomes 780 7.075641 00-0027973 A.Dalton 235 6.987234 00-0027939 C.Newton 126 6.968254 00-0022924 B.Roethlisberger 647 6.761978 00-0033106 J.Goff 489 6.441718 00-0034401 M.White 132 5.886364
Hopefully, this chapter whet your appetite for using math to examine football data. We glossed over some of the many topics you will learn about in future chapters such as data sorting, summarizing data, and cleaning data. You have also had a chance to compare Python and R for basic tasks for working with data, including modeling. Appendix B also dives deeper into the air-yards data to cover basic statistics and data wrangling.
Data Science Tools Used in This Chapter
This chapter covered the following topics:
-
Obtaining data from one season by using the
nflfastR
package either directly in R or via thenfl_data_py
package in Python -
Using
filter()
in R orquery()
in Python to select and create a subset of data for analysis -
Using
summarize()
to group data in R with the help ofgroup_by()
, and aggregating (agg()
) data by groups in Python with the help ofgroupby()
-
Printing dataframe outputs to your screen to help you look at data
-
Removing missing data by using
is.na()
in R ornotnull()
in Python
Suggested Readings
If you get really interested in analytics without the programming, here are some sources we read to develop our philosophy and strategies for football analytics:
-
The Hidden Game of Football: A Revolutionary Approach to the Game and Its Statistics by Bob Carroll et al. (University of Chicago Press, 2023). Originally published in 1988, this cult classic introduces the numerous ideas that were later formulated into the cornerstone of what has become modern football analytics.
-
Moneyball: The Art of Winning an Unfair Game by Michael Lewis (W.W. Norton & Company, 2003). Lewis describes the rise of analytics in baseball and shows how the stage was set for other sports. The book helps us think about how modeling and data can help guide sports. A movie was made of this book as well.
-
The Signal and the Noise: Why So Many Predictions Fail, but Some Don’t by Nate Silver (Penguin, 2012). Silver describes why models work in some instances and fail in others. He draws upon his experience with poker, baseball analytics, and running the political prediction website FiveThirtyEight. The book does a good job of showing how to think quantitatively for big-picture problems without getting bogged down by the details.
Lastly, we encourage you to read the documentation for the nflfastR
package. Diving into this package will help you better understand much of the data used in this book.
Get Football Analytics with Python & R now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.