Analyze Baseball with R

A short introduction to the R language and environment.

R is a terrific piece of software because it’s stable, powerful, and easy to use. It’s a great tool for doing many different things, including creating simple calculations and charts, building complex visualizations, and even building statistical models. This hack will give you enough of an overview to enable you to do really sophisticated studies that would be difficult or impossible to do in a tool like Excel.

Let’s start by taking a look at the R environment. R includes a toolbar with some commonly used operations; a console window; and windows showing graphical output, help, edit windows, and other results. See Figure 4-2 for an illustration. The R environment looks a little different on Mac OS, Linux, and other Unix variants, but the language and tools are the same.

Notice the window with the > prompts and the messages. This is the console window. This is the primary way you communicate with R. Just type an expression in the window and press Return; R responds with results and errors when appropriate.

The GUI includes a lot of familiar operations: you can save and load files; cut, copy, and paste things; and get help. The most interesting feature is the packages menu. R packages are similar to browser plug-ins because they extend the functionality of R. The GUI lets you load packages that are stored locally, or install and update packages from the Internet.

Calculations in R

Let’s start with a few simple examples. You can type mathematical expressions in R, and R will return the results.

	> 5
	[1] 5
	> 5/6
	[1] 0.8333333
	> 1 == 2
	[1] FALSE
	> 2^3 + (4 * 5)
	[1] 28

(Notice the two equals [=] signs in the third line. That expression just means “1 equals 2,” which is, of course, false.)

Assignment in R

Everything you type in R returns a result of some sort. You can give the answer a name and refer to it later. This process is called assignment. The named object is called a variable. To do this, type some_name <- some_value. To see the value a variable is assigned, just type its name. Here is an example of assigning and using variables:

	> earned_runs <- 5
	> innings <- 7
	> ERA <- earned_runs / innings * 9
	> ERA
	[1] 6.428571

Arrays

Often, you want to do a calculation with several values at once. You can group together a set of values into an ordered set of values (called an array, vector, or column) and use this just as you would use a single value:

	> strikeouts <- c(290, 265, 264, 251, 239)
	> inningspitched <- c(245.6666, 228, 237, 225, 196)
	> strikeouts_perinning <- strikeouts / inningspitched
	> strikeouts_perinning
	[1] 1.180462 1.162281 1.113924 1.115556 1.219388

You can also mix arrays with single values:

	> strikeouts_per_nine <- strikeouts_perinning * 9
	> strikeouts_per_nine
	[1] 10.62415 10.46053 10.02532 10.04000 10.97449

Oh, by the way, R also supports strings (an expression with characters, such as a name), not just numbers:

	> players <- c("R Johnson", "J Santana", "B Sheets", "J Schmidt", "O Perez")
	> players
	[1] "R Johnson" "J Santana" "B Sheets"  "J Schmidt" "O Perez"

Data Frames

Suppose that you have several columns of associated information—say, a table of data. R includes a data frame that lets you store and manipulate a table of values. Data frames are similar to spreadsheets (though more like database tables) because they let you organize and group information into tables. Here is how you define a data frame from a set of columns:

	>earned_runs <- c(71, 66, 71, 80, 65)
	>strikeout_leaders <- data.frame(players, earned_runs, strikeouts, inningspitched)
	>strikeout_leaders
	  players  earned_runs strikeouts inningspitched
	1 R Johnson         71        290       245.6666
	2 J Santana         66        265       228.0000
	3 B Sheets          71        264       237.0000
	4 J Schmidt         80        251       225.0000
	5 O Perez           65        239       196.0000

You can refer to specific vectors within a data frame by name:

	>strikeout_leaders$players
	[1] R Johnson J Santana B Sheets  J Schmidt O Perez
	Levels: B Sheets J Santana J Schmidt O Perez R Johnson

Comments

Comments allow you to leave notes for yourself and others about what the program is doing. Comments start with a hash (#) sign and run to the end of the line:

	># copy strikeout leaders to so_leaders and change names
	># of columns to abbreviations
	>so_leaders <- strikeout_leaders
	>names(so_leaders) <- c('NAME', 'ER', 'SO', 'IP')

Functions

R contains many functions that extend its functionality. Each function is an expression of the form f(a, b, c,…). The list of stuff between the parentheses (a, b, c,…) comprises the arguments to the function. Here are some simple examples:

	># the cosine function
	>cos(0)
	[1] 1
	># the exp(x) functions, which returns e ^ x
	>exp(1)
	[1] 2.718282
	># the (natural)log function
	>log(exp(7))
	[1] 7

Some functions in R can take different numbers of arguments at different times, and let you explain what each argument means:

	>log(x=1000, base=10)
	[1] 3

Other functions in R can open windows showing graphics or other information. For example, a convenient tool for editing the contents of a data frame (or just looking at what it contains) is the edit() function. Here is an example of how to use this function:

	>strikeout_leaders_edited <- edit(strikeout_leaders)

Notice that this function does not change the original strikeout_leaders data frame but returns a result assigned to strikeout_leaders_edited.

Some functions do different things with different types of arguments. One example is the summary() function, which returns statistical summary information about an object. It returns different results for columns and data frames.

	>summary(earned_runs)
	  Min. 1st Qu.  Median   Mean 3rd Qu.    Max.
	  65.0    66.0    71.0   70.6    71.0    80.0
	>summary(strikeout_leaders)
	  players    earned_runs     strikeouts       inningspitched
	B Sheets :1  Min.   :65.0    Min.   :239.0    Min.   :196.0
	J Santana:1  1st Qu.:66.0    1st Qu.:251.0    1st Qu.:225.0
	J Schmidt:1  Median :71.0    Median :264.0    Median :228.0
	O Perez  :1  Mean   :70.6    Mean   :261.8    Mean   :226.3
	R Johnson:1  3rd Qu.:71.0    3rd Qu.:265.0    3rd Qu.:237.0
	             Max.   :80.0    Max.   :290.0    Max.   :245.7

Moreover, some functions have side effects (that is, they do more than just return a value). For example, the edit() function opens a window that allows a value (such as a data frame) to be edited. The plot() function prints a graph to a separate window.

We will use only a few functions in this book. Here is a short table of the most useful functions and the most common arguments:

Table 4-1. 

Function

Arguments

Description

Example

help()

Topic

Returns a description of the function.

help(summary)

summary()

Object,[optional args]

Returns statistical summary information about an object. (See help(summary) for more details.)

summary(earned_runs)

subset()

X, subset, select

Returns a subset of an object, such as a data frame. X is the object to subset, subset is the description of which rows to keep, and select is an (optional) list of columns to keep.

subset(earned_runs, inningspitched > 200)

read.table()

File, header, sep, col.names, [more args]

Reads values from a text file into a data frame. Values are separated by the value sep. col.names contains a list of column names.

Batting <-read.table("batting.csv")

merge()

X,y,by

Merges rows from two data frames (x and y) into a single data frame when the variable specified by by matches.

Plyrstats <-merge(batting, fielding, by=playerID

Graphics in R

One of R’s best features is its support for many different types of plots. Here is a simple example of a plot, using the data we defined earlier:

	> barplot(strikeouts, names.arg=players)

For some cool demonstrations of the graphics functions in R, try typing this:

	> demo(graphics)

Hacking the Hack

If you’re new to R, you might find the default environment a bit daunting. John Fox, a statistics professor at McMaster University, developed a tool called R Commander to help his students use R without too much of a programming background. I recommend this tool for learning R because it shows you the R commands that it generates in one window and the output in other windows. For an illustration, see Figure 4-4.

R Commander

Figure 4-4. R Commander

You can get more information about the R Commander package from http://socserv.mcmaster.ca/jfox/Misc/Rcmdr. This tool is available as an R package called Rcmdr. For instructions on installing and loading R packages, see “Get R and R Packages” [Hack #31] . R Commander works on Windows, but it might not run on other platforms. (While I was writing this book, I tried testing it on Mac OS X and could not get it to run.)

See Also

R includes lots of other features and functions. For more information about the R language and system, see http://cran.r-project.org/manuals.html.

R is based on the S language, which was was developed at Bell Labs in the 1980s and has since been purchased by Insightful Corporation. Insightful offers a commercial version of the S language, called S-Plus, that includes many more modeling, graphing, and analysis features. It also includes a very easy-to-navigate GUI. The only bad thing about S-Plus: the cost. It’s a very expensive piece of software, unless you are a student or educator.

Get Baseball Hacks now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.