Chapter 1. R Basics
Downloading the Software
The first thing you will need to do is download the free R software and install it on your computer. Start your computer, open your web browser, and navigate to the R Project for Statistical Computing at http://www.r-project.org. Click “download R” and then choose one of the mirror sites close to you. (The R software is stored on many computers around the world, not just one. Because they all contain the same files, and they all look the same, they are called “mirror” sites. You can choose any one of those computers.) Click the site address and a page will open from which you can select the version of R that will run on your computer’s operating system. If your computer can run the latest version of R—3.0 or higher—that is best. However, if your computer is several years old and cannot run the most up-to-date version, get the latest one that your computer can run. There might be a few small differences from the examples in this book, but most things should work.
Follow the instructions and you should have R installed in a short time. This is base R, but there are thousands (this is not an exaggeration) of add-on “packages” that you can download for free to expand the functionality of your R installation. Depending on your particular needs, you might not add any of these, but you might be delightfully surprised to discover that there are capabilities you could not have imagined and now absolutely must have.
Try Some Simple Tasks
If you are using Windows or OS X, you can click the “R” icon on your desktop to start R, or, on Linux or OS X, you can start by typing R as a command in a terminal window. This will open the console. This is a window in which you type commands and see the results of many of those commands, although commands to create graphs will, in most cases, open a new window for the resulting graph. R displays a prompt, the greater-than symbol (>
), when it is ready to accept a command from you. The simplest use of R is as a calculator. So, after the prompt, type a mathematical expression to which you want an answer:
> 12/4 [1] 3 >
Here, we asked for “12 divided by 4.” R responded with “3,” and then displayed another prompt, showing that it is ready for the next problem. The [1] before the answer is an index. In this case, it just shows that the answer begins with the first number in a vector. There is only one number in this example, but sometimes there will be multiple numbers, so it is helpful to know where the set of numbers begins. If you do not understand the index, do not worry about it for now; it will become clearer after seeing more examples. The division sign (/
) is called an operator. Table 1-1 presents the symbols for standard arithmetic operators.
Operator | Operation | Example |
---|---|---|
+ |
Addition | 3 + 4 = 7 or 3+4 (i.e., with no spaces) |
– |
Subtraction | 5 – 2 = 3 |
* |
Multiplication | 100*2.5 = 250 |
/ |
Division | 20/5 = 4 |
^ or ** |
Exponent | 3^2 = 9 or 3**2 = 9 |
%% |
Remainder of division | 5 %% 2 = 1 (5/2 = 2 with remainder of 1) |
%/% |
Divide and round down | 5 %/%2 = 2 (5/2 = 2.5, round down, = 2) |
You can use parentheses as in ordinary arithmetic, to show the order in which operations are performed:
> (4/2)+1 [1] 3 > 4/(2+1) [1] 1.333333
Try another problem:
> sqrt(57) [1] 7.549834
This time, arithmetic was done with a function; in this case, sqrt()
. Table 1-2 lists somecommonly used arithmetic functions.
Function | Operation |
---|---|
cos() |
Cosine |
sin() |
Sine |
tan() |
Tangent |
sqrt() |
Square root |
log() |
Natural logarithm |
exp() |
Exponential, inverse of natural logarithm |
sum() |
Sum (i.e., total) |
mean() |
Mean (i.e., average) |
median() |
Median (i.e., the middle value) |
min() |
Minimum |
max() |
Maximum |
var() |
Variance |
sd() |
Standard deviation |
The functions take arguments. An argument is a sort of modifier that you use with a function to make more specific requests of R. So, rather than simply requesting a sum, you might request the sum of particular numbers; or rather than simply drawing a line on a graph, you might use an argument to specify the color of the line or the width. The argument, or arguments, must be in parentheses after the function name. If you need help in using a function—or any R command—you can ask for assistance:
> help(sum)
R will open a new window with information about the specified function and its arguments. Here is a shortcut to get exactly the same response:
> ?sum
Be aware that R is case sensitive, so “help” and “Help” are not equivalent! Spaces, however, are not relevant, so the preceding command could just as well be the following:
> ? sum
Sometimes, as in the sqrt()
example, there is only one argument. Other times, a function operates on a group of numbers, called a vector, as shown here:
> sum(3,2,1,4) [1] 10
In this case, the sum()
function found the total of the numbers 3, 2, 1, and 4. You cannot always type all of the vectors into a function statement like in the preceding example. Usually you will need to create the vector first. Try this:
> x1 <- c(1,2,3,4)
After you enter this command, nothing happens! Actually, nothing happens that you can see. Any time the special operator made of the two symbols, <
and -
appears, the name to the left of this operator is given the value of the expression to the right of the operator. (Newer versions of R allow the use of one symbol, =
, to accomplish the same thing. After Chapter 1, we will use the simpler form as well.) In this case, a new vector was created, which the user called x1
. R is an object-oriented language, and the vector x1
is an object in your workspace.
Creating a new vector requires typing the letter “c” in front of the parenthesis preceding the numbers in the vector. See what happens when you type the following:
> x1
The set of numbers 1, 2, 3, 4 has been saved with a name of x1
. Typing the name of the vector instructs R to print the values of x1
. You can ask R to do various kinds of operations on that vector at any time. For example, the command:
> mean(x1)
returns, as evidenced by printing to the screen, the mean, or average, of the numbers in the vector x1
. Try using some of the other operators in Table 1-2 to see some other things R can do.
Create another object, this time a single number:
> pi <- 3.14
At any time, you can get a list of all the objects presently in your workspace by using the following command:
> ls()
And, you can use any or all of the objects in a new computation:
> newvar <- pi*x1
This creates yet another object named newvar
.
User Interface
The examples you have seen so far are all command-line instructions. In other words, you directed R what to do by typing command words. This is not the only way to interface with R. The basic installation of R has some graphical user interface (GUI, pronounced “GOO-ee”) capabilities, too. The GUI refers to the point-and-click interface that you have probably come to appreciate with other applications you use. The problem is that each of the types of installation—Windows, OS X, and Linux—has somewhat different GUI capabilities. OS X is a little “GUI-er” than the others, and you may quickly decide that you prefer to issue a lot of commands this way. Whichever operating system you are using has a menu at the top of the console window. Before you enter important data, experiment a little to see what point-and-click commands you can use.
This book uses the command-line interface because it is the same for all three versions of R—Windows, OS X, and Linux—so only one explanation is necessary, and you can easily move from one computer to another. Listing code—that is, a set of command lines—is far easier and terser than trying to explain every menu choice and mouse click. Further, learning R this way helps you to understand the logic of the software a little better. Finally, the command language is more precise than point-and-click direction and affords the user greater control and power.
Installing a Package: A GUI Interface
No matter which operating system you are using, you can download a free “frontend” program that will provide a GUI for you. There are several available. After you have learned a little more about R, and appreciate its considerable usefulness, you might be ready to try one of these GUI interfaces. For example, earlier I mentioned that a large number of packages are available that you can add to R; one of them is a well-designed GUI called “R Commander.” If you are connected to the Internet, try the following command:
> install.packages("Rcmdr", dependencies=TRUE)
R will download this package and any other packages that are necessary to make R Commander work. The packages will be permanently saved on your computer, so you will not need to install them again. Every time you open R, if you want to use R Commander, you will need to load the package this way:
> library(Rcmdr)
We are all different. For some of us, the command language is great. Others, who dislike R’s command-line interface, might find R Commander just the thing to make R their favorite computer tool. You can produce many of the graphs in this book by using R Commander, but you can’t produce all of them. If you want to try R Commander, you can find additional information in Appendix C.
To retrieve a complete list of the packages available, use this command:
> available.packages()
You can learn a lot more about these packages, by topic, from CRAN Task Views at http://cran.r-project.org/web/views/.
You can see a list of all packages, by name, by going to http://cran.r-project.org/web/packages/available_packages_by_name.html.
To get help on the package you just downloaded, type the following:
> library(help=Rcmdr)
Data Structures
You can put data into objects that are organized or “structured” in various ways. We have already worked with one type of structure, the vector. You can think of a vector as one-dimensional—a row of elements or a column of elements. A vector can contain any number of elements, from one to as high a number as your computer’s memory can hold. The elements in a vector can be of type numeric; character, with alphabetic, numeric, and special characters; or logical, containing TRUE
or FALSE
values. All of the elements of a vector must be of the same type. Here are some examples of vector creation:
> x <- c(14,6.7,5.1,-8) #numeric > name <- c("Lou","Mary","Rhoda","Ted") #character/quotes #needed > test <- c(TRUE,TRUE,TRUE,FALSE,TRUE) #logical/caps needed
Note
Anything that appears after the octothorpe (#) character is a comment. This is information or notes intended for us to read, but it will be ignored by R. (Being a musician, I prefer sharp for this symbol.) It is a good idea to get in the habit of putting comments into code to remind you of why you did a particular thing and help you to fix problems or expand upon a good idea when you come back to your program later. It is also a good idea to read the comments in the R code examples throughout the book.
The data frame is the main kind of structure with which we will work. It is a two-dimensional object, with rows and columns. You can think of it as a box with column vectors in it, or as a rectangular dataset of rows and columns. For better understanding, see the next section on sample datasets and the exercise on reading CO2 emissions data into R. A data frame can include column vectors of all the same type or any combination of types.
R has other structures, such as matrices, arrays, and lists, which will not be discussed here.
You can use the str()
function to find out what structure any given object has:
> str(x) num [1:4] 14 6.7 5.1 -8 > str(name) chr [1:4] "Lou" "Mary" "Rhoda" "Ted" > str(test) logi [1:5] TRUE TRUE TRUE FALSE TRUE
Sample Datasets
The base R package includes some sample datasets that will be helpful to illustrate the graphical tools we will learn about. To see what datasets are available on your computer, type this command:
> data()
Ensure that the empty parentheses follow the command; otherwise, you will not get the expected result. Many more datasets are available. Nearly all additional packages contain sample datasets. To see a description of a particular dataset that has come with base R or that you have downloaded, just use the help command. For instance, to get some information about the airquality
dataset, such as brief description, its source, references, and so on, type:
> ?airquality
Look at the first six observations in the dataset by using the following:
> head(airquality) Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6
This dataset is a data frame. There are 153 rows of data, each row representing air quality measurements (e.g., Ozone, Solar.R, and Wind) taken on one day. The head()
command by default prints out the names of the variables followed by the first six rows of data, so that we can see what the data looks like. Had we wanted to see a different number of rows—for example, 25—we could have typed the following:
>head(airquality,25)
Had we wanted to see the last four rows of the dataset, we could have typed this command:
> tail(airquality,4)
Each row has a row number and the values of six variables; that is, six measurements taken on that day. The first row, or first day, has the values 1, 41, 190, 7.4, 67, 5, 1. The values of the first variable, Ozone
, for the first six days are 41, 36, 12, 18, NA, 28. This is an example of a rectangular dataset or flat file. Most statistical analysis programs require data to be in this format.
Notice that among the numbers in the dataset, you can see the “NA” entries. This is the standard R notation for “not available” or “missing.” You can handle these values in various ways. One way is to delete the rows with one or more missing values and do the calculation with all the other rows. Another way is to refuse to do the calculation and return an error message. Some procedures offer the user a means to specify which method to use. It is also possible to impute, or estimate, a value for a missing value and use the estimate in a computation. Treatment of missing values is a complex and controversial subject and not to be taken lightly. Kabacoff (2011) has a good introductory chapter on handling missing values in R.
There are two ways to access the data. The first method is to use the attach()
command, issue some commands with variable names, and then issue the detach()
command, as in the following example:
> attach(airquality) > table (Temp) # get counts of Temp values > mean (Temp) # find the average Temp > plot(Wind,Temp) # make a scatter plot of Wind and Temp > detach(airquality)
The advantage of this method is that, if you are going to do several steps, it is not necessary to type the dataset name over and over again. The second method is to specify whatever analysis you want by using a combination of the dataset name and variable name, separated by a dollar sign ($
). For example, if we wanted to do just this:
> attach(airquality) > plot(Wind,Temp) > detach(airquality)
We could use the equivalent code:
> plot(airquality$Wind,airquality$Temp)
The advantage of this method is that if you are calling upon several datasets in quick succession, it is not necessary to use many attach
and detach
statements.
The Working Directory
When using R, you will often want to read data from a file into R, or write data from R to a file. For instance, you might have some data that you created using a spreadsheet, a statistical package such as SAS or SPSS, or a text editor, and you want to analyze that data using R. Alternatively, you will often create an R dataset that you want to save and use again. Those files must be stored somewhere in your computer’s file structure. With each read or write operation, it is possible to specify a (frequently long) path to the precise file containing the data you want to read or the place where you will write the data. This can be cumbersome, so R has a working directory, or default location for files. In other words, if you do not instruct R where to find a particular file, it will just assume that you mean it is in the working directory. Likewise, if you do not specify where to save something, R will automatically write it in the working directory. You can find your current working directory with this command:
> getwd()
Suppose that you got the response that follows (your actual result will be quite different, of course!):
[1] "/Users/yourname/Desktop/"
The last folder in the chain (i.e., the last name on the righthand side) is the place where R looks for files and writes files unless you direct it to look elsewhere. You can change the working directory by using the setwd()
command. You might want to create a new folder specifically for the use of R, or even specifically for your exercises with this book. Call it something that clearly suggests its purpose, such as “R folder” or “R graphical data.” Assuming you have created a folder called “R things” within the folder “Desktop,” you can then issue the following command:
> setwd("/Users/yourname/Desktop/R things")
From this point on, R will consider the folder “R things” to be your working directory, until the next time you give a setwd()
command or shut down R by typing q()
, for “quit.” If you do not want to have to set the working directory every time you start R, see the section “Sourcing a Script” to learn how to do this.
Putting Data into R
You now know how to use the sample datasets that come with various R packages. This is a tremendous resource for learning to use R, but you are learning R because you want to do graphical analysis of your own data. The method you choose to put your data into R will depend on several factors:
-
How large your dataset is
-
Whether the data already exists as a data file in any one of various forms
-
How comfortable you are with using tools outside of R to create a file
-
How much time you have to devote to data entry
-
Your threshold for pain ;)
If you are not especially interested in data entry because you expect to use datasets that have already been created as spreadsheets, statistical package datasets, ASCII files, or other types of data files, you should skim the remainder of this section and consult Appendix E for the data file type of interest.
Typing into a Command Line
The most direct way to enter data into R is to type, from a command line, a statement creating a vector, as you have already done. If your need is to analyze one or a few fairly short vectors, that is probably the easiest thing to do.
Exercise 1-1.
Backblaze, a data backup company, runs about 25,000 disk drives and reports on survival rates (in percent) of hard drives. It showed the following annual survival rates for its drives (read from a graph; source: http://bit.ly/1KVU57t):
year rate 1 94 # (i.e., after one year, 94% of drives still work) 2 92 3 90 4 80
Create two vectors by using the following commands:
> year <- c(1,2,3,4) > rate <- c(94,92,90,80)
Be sure that you enter the numbers in the proper order; for example, if 1 is first in the year
vector, 94 must be first in the rate
vector, and so on. You can examine the relationship of these two vectors by using this command:
> plot(year,rate)
Most graphic commands open a new window. If you have several open applications, you might miss it and be forced to look for it.
The plot
statement in the previous code snippet called the plot()
function and instructed it to do an analysis on the two arguments, year
and rate
. The graph we just made is a simple one, but it is possible to make very elaborate graphs with R. The plot on the right side of Figure 1-1 shows a few ways in which you could customize the basic plot. We will examine many such options throughout this book. You can enter the ?plot
command to see a long list of available options.
You could combine the two vectors, year
and rate
, into a new data frame, mydata
, as shown here:
> mydata <- data.frame(year, rate)
Using the Data Editor
If your data is just a little more complex or larger, you could use the simple data editor from the R console. Even if you do not enter your data this way, it is a good thing to know about the editor because someday (or maybe lots of days) you might need to fix an occasional problem data point in an object in your R workspace. I suspect that for most people it will be an unnecessary effort to try to use the editor for data entry. Read this section to learn some terms and to see how to save a file. You will probably prefer to use your favorite spreadsheet program for data entry, but you might need to use the editor if you do not have a spreadsheet program. See the section “Reading from an External File” to learn how to read your spreadsheet data into R.
Exercise 1-2
The data presented in Table 1-3 (from the US Energy Information Administration) concerns worldwide carbon dioxide emissions over a recent eight-year period. You will enter it into R by using the built-in data editor, but let us see what is in this dataset first.
Year | North America | Central/South America | Europe | Eurasia | Middle East | Africa | Asia/Oceania |
---|---|---|---|---|---|---|---|
2004 | 16.2 | 2.4 | 7.9 | 8.5 | 7.1 | 1.1 | 2.7 |
2005 | 16.2 | 2.5 | 7.9 | 8.5 | 7.6 | 1.2 | 2.9 |
2006 | 15.9 | 2.5 | 7.9 | 8.7 | 7.7 | 1.1 | 3.1 |
2007 | 15.9 | 2.6 | 7.8 | 8.6 | 7.6 | 1.1 | 3.2 |
2008 | 15.4 | 2.6 | 7.7 | 8.9 | 7.9 | 1.2 | 3.3 |
2009 | 14.2 | 2.6 | 7.1 | 8 | 8.3 | 1.1 | 3.5 |
2010 | 14.5 | 2.7 | 7.2 | 8.4 | 8.4 | 1.1 | 3.6 |
(source: http://1.usa.gov/1R6sj99) |
The top row in Table 1-3 is header information, naming each of the variables recorded. Each row contains all the information gathered during one year. Each row is said to be a statistical unit. Social scientists usually call the row a case, whereas natural scientists most often refer to the row as an observation. Computer professionals usually call the row a record. Each of the columns is called a variable, or in the case of computer science, a field. The emissions dataset has seven rows (observations) and eight variables: the year, and the amount of emissions from each of seven regions in the study.
The editor looks like a spreadsheet and has some of the features of a good spreadsheet, but is not as convenient to use as Excel or Numbers. It is also easy to lose your changes if you are not careful. To begin, choose an object name and assign this name to a new data frame. There are several ways to do this. I find the safest way is to name each variable, identify its type, and specify how many rows:
> emissions <- data.frame(Year=numeric(7),N_Amer = numeric(7), CS_Amer=numeric(7), Europe=numeric(7),Eurasia=numeric(7), Mid_East=numeric(7),Africa=numeric(7), Asia_Oceania=numeric(7))
This creates an empty data frame, called emissions
. To open up the editor, call the edit()
function by assigning an object to hold the empty data frame:
> emissions <- edit(emissions)
Remember, emissions
is empty. By calling the object “emissions” in the preceding command, you are telling R to overwrite the empty data frame with whatever edited data you enter. Enter the data by double-clicking the cell that you want to write/edit. When you are done, click the upper-left corner of the spreadsheet in OS X or the “X” in the upper-right corner in Windows. Do not click Stop, which is on the edit window in OS X or at the top of the screen in Windows. If you click Stop, you will lose any changes. After the data is entered, check carefully to ensure that there are no errors. If you see an error, just double-click the cell that you want to fix and type the corrected number. If necessary, you can use the previous command again to go back to the editor and fix any problems. Save this data frame so that you can use it again later without the need to retype it:
> save (emissions,file="emiss.rda")
The preceding command writes the emissions
data frame into a file called emiss.rda in the working directory. You can retrieve the data by using the following command, assuming that you still have the same working directory:
> load("emiss.rda")
Reading from an External File
You might already have a favorite tool that you use for data entry; for many people this is a spreadsheet program, but it also could be a text editor. I like Numbers on my Mac, but Excel or another spreadsheet will work just as well. The general approach is to create the file in the spreadsheet program and save it to your working directory. After it’s there, you can read it into R for analysis.
Exercise 1-3
Prolific English composer Edward Elgar (1857–1934) is, perhaps, most famous for two celebrated works: “Pomp and Circumstance,” the processional march for innumerable graduation ceremonies; and the “Enigma Variations,” for symphony orchestra. Although the entire latter work is a popular part of symphony programs, the extraordinarily beautiful “Nimrod” variation is often performed by itself, not only by orchestras, but also by other ensembles (musical groups) or soloists.
One of the most fundamental questions one must ask before performing a musical work is, “What should the tempo be?” In other words, “How fast should it be played?” Although the composer usually gives an indication, some works have received a wide range of interpretations, even among the most highly regarded musicians. Learning how other musicians perform the work can be quite instructive to someone planning her own performance. The “Nimrod” tempo data presented in Table 1-4 comes from a number of recorded performances that were available on YouTube on November 9, 2013.
Performer | Medium | Time | Level |
---|---|---|---|
Barenboim–Chicago SO | so | 240 | p |
Solti–London Phil | so | 204 | p |
Davis | so | 270 | p |
Remembrance2009 | cb | 236 | p |
Belcher | org | 254 | p |
Bish | org | 232 | p |
ColdstrGuards | cb | 239 | p |
Pallhuber–3 Lions BB | bb | 257 | a |
Bernstein–BBC | so | 315 | p |
Dudamel–SBolivarSym | so | 239 | p |
John | org | 252 | p |
Sunshine Brass | bb | 186 | a |
Mahidol Sym Band | cb | 173 | a |
Hills | org | 240 | p |
Grimethorpe CollB | bb | 200 | p |
Barbirolli_Halle O | so | 200 | p |
Stokowski | so | 244 | p |
Boult–London SO | so | 211 | p |
Kindl–Marktoberdorfer BB | bb | 238 | a |
Carter–Charlotte CB | cb | 196 | a |
Cord–IndianaU | bb | 188 | a |
Mack–SUNYFredonia CB | cb | 160 | a |
U Akron CB | cb | 193 | a |
Akron Youth Sym | so | 188 | a |
BP–Ostwestfalen | cb | 198 | a |
Santarsola–MoldovaPO | so | 320 | p |
Klumpp_NWD PO | so | 187 | p |
Burke–MancunianWinds | cb | 257 | a |
US Army Field Band | cb | 235 | p |
EE–Phonograph | so | 186 | p |
Niemczyk–NWC O | so | 169 | a |
Allentoff–Brockport SO | so | 200 | a |
The Nimrod
dataset has 32 rows (cases/observations) and 4 columns (variables). This data will become a data frame in R.
Nimrod codebook
All but the simplest datasets need a “codebook,” which offers an explanation of the meaning of each of the values of the variables. The codebook for the Nimrod
dataset is as follows:
performer
- Name of both conductor and ensemble, if available. At least one must be available for inclusion in the study.
medium
- bb brass band
- cb concert band
- org organ solo
- so symphony orchestra
time
- Performance time, in seconds, from first note to last note, leaving out announcements, tuning, applause, etc. Proxy measure for tempo; i.e., assumes same tempo throughout.
level (proficiency level of the performers)
- a amateur (or student)
- p professional
The variable time
is a quantitative variable; that is, it’s a measurement of an amount. You can use quantitative variables in arithmetic, so one could calculate the sum or the average of the variable time
. These are R numeric vectors, as discussed in the section “Data Structures”. All the other variables in this particular dataset are categorical variables; i.e., the observations are assigned to categories. Some people refer to categorical variables as qualitative or nominal variables. These are R character vectors. We cannot calculate the average of medium
, because the values bb
, cb
, and so on are not numbers; calculation does not even make sense. There are some things we can do with categorical variables, though, such as finding the frequency of bb
or of cb
. We might also use the values of categorical variables to form groups. So, for instance, we might break the dataset into parts, according to the values of level
, so we could compare the average time
in the amateur group to average time in the professional group.
You can enter the data in one of the following ways:
-
Type the data into your favorite spreadsheet program and save (export) the spreadsheet to your working directory as a .csv file, with the name Nimrod.Tempo.csv. R can read other file types, but .csv seems to be the easiest and the least prone to error. Then open R and type the following command:
> Nimrod <- read.csv("Nimrod.Tempo.csv",header=TRUE)
If you want to read a file that does not have a header, use
header=FALSE
. -
If you want to read Excel files without converting them to .csv files, there is a package called
XLConnect
that is meant for exactly this purpose.XLConnect
can do many other tasks, such as editing a spreadsheet and writing R data to an Excel file. You will not be able to use this package if you have an old version of R (before version 3.0). The code that follows shows how to read theNimrod
data when it has been saved as an Excel file with the name Nimrod.xls:> install.packages ("XLConnect") > library (XLConnect) > Nimrod2 <-readWorksheetFromFile("Nimrod.xls", sheet = 1, header = TRUE)
You do not actually need to have Excel installed on your computer to use this package. There are many datasets, freely available from government agencies and sundry other sources, that you can download in Excel format. See Appendix E for more information on this topic. You can copy them and read them into R for your own analysis with
XLConnect
. This package can read or write .xls or the newer .xlsx formats. You can find complete documentation at http://cran.r-project.org/web/packages/XLConnect/XLConnect.pdf. -
Use a text editor or word processor to create a text file called Nimrod.Tempo.txt that uses spaces as separators between values. The file can be read as follows:
> Nimrod <-read.table("Nimrod.Tempo.txt", sep = "", header=TRUE)
If you find yourself in a situation that the preceding discussion of methods for putting data into R did not cover, consult the R help file, “R Data Import/Export.” This file is included in the “R Help” that is part of the base R installation. After you have read the data into R using any one of the aforementioned methods, check to see if it worked by using one of the following:
> Nimrod # types out complete dataset > head(Nimrod) # types out first 6 rows > fix(Nimrod) # opens Nimrod data in editor
The final option will open the editor (see Figure 1-2) so that you can check the data or change data values, if necessary.
You can also give R commands to analyze the data in various ways, such as shown here:
> mean(Nimrod$time) [1] 222.0938 > table(Nimrod$medium) # get counts within each medium bb cb org so 5 9 4 14
And you can create some cool graphs, which we will get to in due course.
You can also ask R for some general information about the dataset Nimrod
:
> summary(Nimrod) performer medium time level Akron Youth Sym : 1 bb : 5 Min. :160.0 a:13 Allentoff-Brockport SO: 1 cb : 9 1st Qu.:191.8 p:19 Barbirolli_Halle O : 1 org: 4 Median :221.5 Barenboim-ChicagoSO : 1 so :14 Mean :222.1 Belcher : 1 3rd Qu.:241.0 Bernstein-BBC : 1 Max. :320.0 (Other) :26
Save the Nimrod
data frame the same way you did the emissions
dataset (but with a different filename, of course), because you will need to retrieve it for a later exercise:
> save (Nimrod,file="Nimrod.rda")
You can retrieve it later by using the load()
command:
> load("Nimrod.rda")
You can find more information about reading and importing external files in Appendix E.
Sourcing a Script
Up to this point, we have typed single-line commands. Most of the time, this will work just fine. There might be instances, however, when you want to perform a sequence of commands and repeat the entire sequence. This can get to be quite tedious if the sequence is very long or you want to repeat it many times. Fortunately, R can work with scripts. A script is a list of commands, set up in the order in which you want them to be performed. You can create a script by using a text editor and save it in a file. Then, you can source the script, which means to retrieve the script and execute the saved commands.
To see how this works, imagine that you are updating the Nimrod
data on an ongoing basis. You add a few new observations from time to time in an Excel spreadsheet and would like to do some analysis in R to see where things stand with the latest data included. The list of commands for this analysis that follows requires that you have previously installed a couple of packages. If you are not sure of what packages you have installed on your computer, you can find out by using the command:
> installed.packages()
If you do not have gmodels
and XLConnect
, install them now:
> install.packages("gmodels") > install.packages("XLConnect")
Now, here is a list of commands that you might use to carry out this analysis. Note that when we use a block of commands, we will usually not precede each one with the R prompt, >
:1
# The following group of commands is a script library(gmodels) # required to use the CrossTable command library (XLConnect) # must have installed XLConnect Nimrod2 <- readWorksheetFromFile("Nimrod.xls",sheet=1, header=TRUE) attach(Nimrod2) CrossTable(medium,level, prop.r=FALSE, prop.c=FALSE, prop.t=FALSE, prop.chisq=FALSE) # above command prints table with counts in each cell, # but no percents perf_time <- summary(time) # save summary output title = "Summary of performance times:" cat(title,"\n", "\n") # print title and 2 linefeeds print(perf_time) # print results of summary(time) detach(Nimrod2)
It would be a bit of a bother to key in these exact same commands every time you wanted to see results. So, I recommend that you use an editor to create a file that contains the preceding commands. A text editor is provided in R. In most versions of R, you can access it from the File menu at the upper-left corner of the R console. Choose New Document or New Script to open a text window, and enter the commands. Save the edited script in the working directory, using the name NimTotals.R for this example. Then, use the following command to execute all of the commands in the file:
> source("NimTotals.R") Cell Contents |-------------------------| | N | |-------------------------| Total Observations in Table: 32 | level medium | a | p | Row Total | -------------|-----------|-----------|-----------| bb | 4 | 1 | 5 | -------------|-----------|-----------|-----------| cb | 6 | 3 | 9 | -------------|-----------|-----------|-----------| org | 0 | 4 | 4 | -------------|-----------|-----------|-----------| so | 3 | 11 | 14 | -------------|-----------|-----------|-----------| Column Total | 13 | 19 | 32 | -------------|-----------|-----------|-----------| Summary of performance times: Min. 1st Qu. Median Mean 3rd Qu. Max. 160.0 191.8 221.5 222.1 241.0 320.0
The CrossTable()
command in your script created a contingency table or cross tabulation. At the top of the table is a header row, which includes values for the variable level
. The column on the left gives the names of the values for the variable medium
. The first row below the header shows information for “bb”—brass bands. There are four brass bands that are “a,” or amateur groups, and one that is “p,” or professional. The column on the right and the row on the bottom give totals for the respective rows or columns. For example, the Row Total
column shows that there are five brass bands of all kinds. Statisticians call the totals marginal values or just marginals.
Below the table, you will find summary information for the variable time
—the performance time. We see a minimum time of 160 seconds and a maximum time of 320 seconds. There are two measures of the center of the distribution of time: the mean, or ordinary average, where all the numbers are added and then the sum is divided by the total number of numbers; and the median, which is the number that is higher than 50 percent of the numbers and lower than 50 percent. Finally, we have the first quartile, 191.8 (the point at which one quarter of the numbers are lower), and the third quartile, 241 (the point at which three-quarters of the numbers are lower).
You might also find it convenient to write a script containing library()
and setwd()
commands so that you can, with one command, execute many such commands that you would otherwise need to enter separately. If you have downloaded several packages that you use frequently, it might be a good idea to load them all in one step rather than trying to remember when you need a particular package. Even though it is not a great inconvenience to issue a source()
command each time you start R, some people prefer having R source such a script automatically. This is possible, but the method is a little different for each platform. In OS X, at the top of the screen, open the R menu and choose Preferences. You will be able to specify a working directory that will apply every time you start R. Configure it so that R starts by dragging and dropping a file containing a script that will be sourced at startup. In Windows, you will need to find the .Rprofile or Rprofile.site file and edit it to include the commands that you want to execute at startup. To see examples, try this:
> ?Startup
User-Written Functions
Sourcing a script is a great tool when you need to repeat a sequence of commands exactly. However, there may be times when you want to do some procedure repeatedly, but not always on the same variables or same arguments. If you wanted to use the script we created in the previous section, but not always on the same file, you could write your own function that would let you choose which file to retrieve and analyze.
The general format for a user-written function is as follows:
name
<- function (argument1
,argument2
,...){commands
}
Suppose that you want to name your function “update” and have it retrieve an Excel file that you will name each time you use the function. The code that follows, which is almost the same as the script in the previous section, would do this. The argument fn
appears in the function
statement and in the Nimrod2
statement, indicating whatever argument is supplied by the user in the function call will be substituted in the Nimrod2
command when R executes the commands:
# a user-written function, called "update" update <- function (fn){ library(gmodels) library (XLConnect) Nimrod2<-readWorksheetFromFile(fn,sheet=1,header=TRUE) attach(Nimrod2) CrossTable(medium,level, prop.r=FALSE, prop.c=FALSE, prop.t=FALSE, prop.chisq=FALSE) # print table with counts in each cell, but no percents perf_time = summary(time) # save summary output title = "Summary of performance times:" cat(title,"\n", "\n") # print title and 2 linefeeds print(perf_time) detach(Nimrod2) }
To use this function, you must first save it, as you would save any R script, and then load it or source it. You can then issue any number of commands until you are ready to call the function, which you would do the following way, where myfile.xls
is the name of the Excel file that you wish to analyze:
> update("myfile.xls") # filename in quotes because myfile.xls # is the value of a character variable
Of course, you might want to analyze a different file the next time. Just substitute the name of the new file. You can also create simple mathematical functions or quite complex programs, such as one that produces a special type of graph, as we will see later.
A Taste of Things to Come
Figure 1-3 displays several graphs based on the Nimrod
data. All of these types of graphs will be discussed in the following chapters, and you should be able to produce any of them, and many more, by the time you finish this book.
Exercise 1-4
If you had trouble entering data in “Exercise 1-2” or “Exercise 1-3”, try entering the simple dataset from “Exercise 1-1.” by one of the following methods. First, here is the data again:
Year | Rate |
---|---|
1 | 94 |
2 | 92 |
3 | 90 |
4 | 80 |
Method 1: spreadsheet
Open your spreadsheet program. Enter the data into five rows (including the header) and two columns, just as the data is laid out here. Export the file into your working directory (see the section “The Working Directory”) as a .csv file with the name simple1.csv Then, create the new data frame mydata
:
> mydata = read.csv("simple1.csv",header=TRUE) > mydata year rate 1 1 94 2 2 92 3 3 90 4 4 80 5 NA NA
In this case, there was a blank row in the spreadsheet, so the last row of the R data frame has missing values. You can fix this by using the nrows
argument to read in only the specified number of rows:
> mydata= read.csv("simple1.csv",header=TRUE,nrows=4) > mydata year rate 1 1 94 2 2 92 3 3 90 4 4 80
If you had an extra (blank) row before the header, here would be the result:
> mydata X X.1 X.2 X.3 1 NA year rate NA 2 NA 1 94 NA 3 NA 2 92 NA 4 NA 3 90 NA 5 NA 4 80 NA
You could use the skip
argument to ignore the first row:
> mydata= read.csv("simple1a.csv",header=TRUE,skip=1, nrows=4) > mydata X year rate X.1 1 NA 1 94 NA 2 NA 2 92 NA 3 NA 3 90 NA 4 NA 4 80 NA
In this last example, we have extra columns. This is not a big problem, because we could simply ignore them. The important thing is that the two vectors of interest, year
and rate
, have the right number of rows. If you want to delete one of the useless columns, you can do it this way:
> mydata$X = NULL > mydata year rate X.1 1 1 94 NA 2 2 92 NA 3 3 90 NA 4 4 80 NA
Method 2: text
Open your word processor or text editor. Type in the data with a space between each entry and the next on a line, and a carriage return at the end of each line, like so:
year rate 1 94 2 92 3 90 4 80
Extra spaces should not matter, but the data should be saved as plain text, not rich text. If your word processor allows you to save a .txt file, use the Save As command to save your file into your working directory, with the name simple2.txt. Otherwise, you will probably need to use the Export command, again using the name simple2.txt. Read the data into R and create the data frame newdata
by using the following command:
> newdata = read.table("simple2.txt",sep="",header=TRUE) > newdata year rate 1 1 94 2 2 92 3 3 90 4 4 80
1 Many of the remaining examples of code will be written as scripts, without the >
prompt at the beginning of each line. Furthermore, long commands, such as the CrossTable()
command in the example, are often broken up over several lines; this makes reading them a little easier.
Get Graphing Data with R now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.