Access Databases Directly from Excel or R

If you store your data in a database, this hack shows you how to fetch it from R or Excel.

In “Move Data from a Database to Excel” [Hack #19] , I show several ways to export data from a database program into an Excel spreadsheet. This hack shows you another way to access data using ODBC (which stands for Open Database Connectivity). ODBC provides a standard interface for different programs to connect to databases, and it allows you to use the right tool for each job. As shown in Figure 4-5, you can do data selection and manipulation using SQL and then analyze the data in a tool such as R or Excel.

ODBC

Figure 4-5. ODBC

To use ODBC for MySQL, first you must install the MySQL ODBC drivers on your computer. (MyODBC is available for Windows, Mac OS X, Linux, and other platforms.) You can download the files from http://dev.mysql.com/downloads/connector/odbc/3.51.html. The simplest way to install MyODBC is to download the precompiled binaries and then run the install program to install the drivers.

If you don’t want to use MySQL at all, you might not have to. If you have Microsoft Access installed on your Windows XP machine, you should already have the Microsoft Access ODBC drivers installed. Plus, the Baseball Archive database is available as a Microsoft Access database.

If you’re using Mac OS X, you might have some problems with the MyODBC drivers, depending on the versions of MySQL, MyODBC, and Mac OS X that you are using. I found that the drivers from Actual Technologies worked somewhat better, but you have to pay for the software. See http://www.actualtechnologies.com for more information.

Now, you need to configure an ODBC connection for your database. Under the Start menu, go to Programs → Administrative Tools → Data Sources (ODBC); you will also find this in the Administrative Tools section of the Control Panel. Once you’ve opened the configuration manager, you will see a list of user data sources. If you are using MySQL, you need to add a MySQL ODBC connection. Click the Add…button. Select MySQL ODBC 3.51 Driver from the list in the New Dialog box, and then click Finish.

Here’s an example of what you would enter to access a MySQL database of player and team statistics from the Baseball DataBank database [Hack #10] :

Data source name

This is the name you will use to refer to the database—a completely separate name from the underlying database itself. I used bballdata.

Description

The description of the data source, which can be whatever you like.

Server

The name of the machine on which the MySQL server resides. In most cases, this will be your local machine, referred to as localhost.

User

The username of the account that you use to log in. Use the name you created when you set up the database. (In “Get a MySQL Database of Player and Team Statistics” [Hack #10] , I chose jadler.)

Password

The password for the account. (In “Get a MySQL Database of Player and Team Statistics” [Hack #10] , I chose P@ssw0rd.)

Database

The name of the database to which you would like to connect. This is the name of the database in MySQL. (In “Get a MySQL Database of Player and Team Statistics” [Hack #10] , I used bbdatabank.)

Figure 4-6 shows the configuration screen.

Configuring an ODBC connection

Figure 4-6. Configuring an ODBC connection

If you are using Microsoft Access, you need to add an Access ODBC connection. This is just as simple. Click the Add…button, select Microsoft Access Driver from the list, and choose the Access database file.

Use ODBC in R

R is a terrific language for calculations, models, and visualization, but SQL [Hack #8] is a better choice for data storage and retrieval. Once you’ve installed the ODBC drivers and configured a data source, you’re almost ready to use R. The last preparatory step is to install RODBC, the package [Hack #31] that “ODBC-enables” R. Start R, select Packages → Load Package…, and then select RODBC from the list of options. Once you’ve installed the package, you must load it using the command library(RODBC) each time you use R.

With the package installed, you’re ready to crunch data with R. The RODBC package includes many commands; for more information, see the RODBC manual or type help(RODBC). Here are a few key functions (and arguments to those functions; I’m omitting a few to keep this easy) that you may find useful:

Table 4-2. 

Function name

Arguments

Description

odbcConnect

<ODBC data source>, [uid=<username>], [pwd=<password>], etc.

Connects to the ODBC data source named in the first argument (using the name you assigned to it in the Data Sources administrative tool). User ID and password are optional. Returns an object that you can use with other ODBC functions.

odbcGetInfo

<channel>

Returns information about the ODBC connection.

sqlTables

<channel>

Lists the tables available from the ODBC data source.

sqlFetch

<channel>, <table name>, [max=<max rows>]

Returns a data frame containing the data from the specified table name. Optionally, you can specify a limited number of rows.

sqlQuery

<channel>, <query>

Returns a data frame containing the results of the SQL query.

Here’s a sample run of this program, using the commands that we described earlier. In this example, we start by opening a “channel” in R for accessing the database using the odbcConnect function. Next, we look at some information about the database (using the odbcGetInfo function), and we get a set of tables available on the database. Finally, we copy a few observations into R using the sqlFetch function and display the results to make sure everything worked:

	>library(RODBC)
	>channel <- odbcConnect('bballdata')
	>odbcGetInfo(channel)
	      DBMS_Name         DBMS_Ver            Driver_ODBC_Ver
	      "MySQL"           " 4.0.21-standard"  "03.51"
	      Data_Source_Name  Driver_Name         Driver_Ver
	      "bballdata"       "myodbc3.dll"       "03.51.09"
	      ODBC_Ver          Server_Name
	      "03.52.0000"      "192.168.0.3 via TCP/IP"
	>sqlTables(channel)
	 TABLE_CAT TABLE_SCHEM TABLE_NAME TABLE_TYPE     REMARKS
	1 bballdata   events      TABLE MySQL table
	>events <- sqlFetch(channel, "events", max=9)
	>events
	                gid AB Pitches     Play
	1  gid_2004_01_15_aasmlb_nasmlb_1  0
	2  gid_2004_01_15_aasmlb_nasmlb_1  0
	3  gid_2004_01_15_aasmlb_nasmlb_1  1  .BFE   HR/7/L
	4  gid_2004_01_15_aasmlb_nasmlb_1  2  .CFX     63/G
	5  gid_2004_01_15_aasmlb_nasmlb_1  3  .SBX      8/F
	6  gid_2004_01_15_aasmlb_nasmlb_1  4
	7  gid_2004_01_15_aasmlb_nasmlb_1  4  ..BD     S9/G
	8  gid_2004_01_15_aasmlb_nasmlb_1  5
	9  gid_2004_01_15_aasmlb_nasmlb_1  5   .B1.D D8/F.1-3
	                                                 PlayByPlay
	1
	2
	3  Carlos Febles homers (1) on a line drive to left field.
	4  Kimera Bartee grounds out, shortstop Kevin Polcovich to first baseman
	   Kevin Young.
	5  Geronimo Berroa flies out to center fielder Rich Becker.
	6  National Manager Walter Alston ejected by HP umpire Mark Carlson.
	7  Mark Whiten singles on a ground ball to right fielder Rob Ducey.
	8  American first baseman Larry Barnes left the game due to an injured
	      chest.
	9  Larry Barnes doubles (1) on a fly ball to center fielder Rich Becker.
	   Mark Whiten to 3rd.

Use ODBC in Excel

The cool thing about ODBC is that it gives you lots of options for how to work with the data. Here’s how to use the Baseball DataBank database in Excel. Open Excel and click Data → Import External Data → New Database Query. Select the bballdata ODBC source in the Choose Data Source window (Figure 4-7).

Choose Data Source dialog box

Figure 4-7. Choose Data Source dialog box

If you deselect the “Use the Query Wizard to create/edit queries” box, you can skip directly to the Microsoft Query application.

The first step of the Query Wizard allows you to select columns, as shown in Figure 4-8. I chose to select all columns from the Batting table and a few columns from the Master table (nameFirst, nameLast, birthYear, birthMonth, and birthDay). Next, the wizard allows you to choose to filter data, as shown in Figure 4-9. I chose to select player-seasons after the year 2000. After that, you can choose how you would like to sort columns (this is not pictured here). I chose not to sort the results. Finally, you can choose to save the query, return the results in Excel, create an OLAP cube, or edit the query using Microsoft Query (this step is also not shown). I chose to open the results in the Microsoft Query application.

Choosing columns in the Query wizard

Figure 4-8. Choosing columns in the Query wizard

Figure 4-10 shows a screenshot of the Microsoft Query application. If you are familiar with Microsoft Access, you’ll notice that Microsoft Query looks very similar. (See “Make Your Own Stats Book” [Hack #11] for more about Microsoft Access.) You can move around the order of the columns, add and subtract columns, join columns, add more tables, sort the results, filter the results, and do dozens of other things with this program. (Alternatively, you can click the SQL button and enter your SQL directly [Hack #16] .)

Once you’re satisfied with the query, go to the File menu in Microsoft Query and select Return Data to Microsoft Office Excel. Excel will then ask where you’d like to store the results (e.g., the current worksheet, a new worksheet, etc.). Once you’ve inserted the data, you can use Excel to slice, dice, or sort it in any way you want. Figure 4-11 shows the resulting spreadsheet.

Filtering data through the Query Wizard

Figure 4-9. Filtering data through the Query Wizard

Microsoft Query application

Figure 4-10. Microsoft Query application

ODBC Query results in Excel

Figure 4-11. ODBC Query results in Excel

Hacking the Hack

If you’re not on Windows, RODBC can be tough to work with sometimes. (For example, I could not get RODBC to work at all on a Mac.) RMySQL, an alternative package, might be a better solution.

To install RMySQL from source, on Mac OS X, you can use the following set of R commands. You will need to have MySQL installed locally in /usr/local/mysql. The first command sets two environment variables so that the package builds correctly. The next two commands install the DBI package and the RMySQL package, respectively.

	>print(Sys.putenv("PKG_CPPFLAGS"="-I/usr/local/mysql/include",
	+ "PKG_LIBS"="-L/usr/local/mysql/lib -L/usr/lib -lmysqlclient"))
	>install.packages('DBI')
	>install.packages('RMySQL')

Tip

You can also use RMySQL on Windows. To do this, download the package from http://stat.bell-labs.com/RS-DBI/download/index.html and use the Packages → “Install Package(s) from Local Zip Files” menu item in R to load it.

To use this package in your R programs, you can use a set of commands, as shown here:

	># load libraries
	>library(DBI)
	>library(RMySQL)
	># load MySQL driver into R and start connection
	>drv <-dbDriver("MySQL")
	>con <- dbConnect(drv, username="jadler", password="P@ssw0rd",
	+  dbname="bbdatabank", host="localhost")

Then, you can submit queries to a MySQL database directly from R and fetch the results into R tables. Here’s an example:

	>res<-dbSendQuery(con, "select * from teams");
	>teams<-fetch(res,n=-1);
	>summary(teams);

	  idxTeams          yearID        lgID
	Min.   :   1.0   Min.   :1871   Length:2475
	1st Qu.: 619.5   1st Qu.:1915   Class :character
	Median :1238.0   Median :1953   Mode  :character
	Mean   :1238.0   Mean   :1948
	3rd Qu.:1856.5   3rd Qu.:1982
	Max.   :2475.0   Max.   :2004

	idxTeamsFranchises    divID               Rank
	Min.   :  1.00     Length:2475        Min.   : 1.000
	1st Qu.: 29.00     Class :character   1st Qu.: 2.000
	Median : 57.00     Mode  :character   Median : 4.000
	Mean   : 54.63                        Mean   : 4.273
	3rd Qu.: 82.00                        3rd Qu.: 6.000
	Max.   :120.00                        Max.   :13.000

	  G             Ghome                 W
	Min.   :  6.0   Min.   : 44.00   Min.   :  0.00
	1st Qu.:153.0   1st Qu.: 77.00   1st Qu.: 64.50
	Median :155.0   Median : 79.00   Median : 76.00
	Mean   :148.8   Mean   : 78.06   Mean   : 73.92
	3rd Qu.:162.0   3rd Qu.: 81.00   3rd Qu.: 87.00
	Max.   :165.0   Max.   : 84.00   Max.   :116.00
	                NA's   :399.00

	  L                 DivWin               WCWin
	Min.   :  4.00   Length:2475        Length:2475
	1st Qu.: 64.00   Class :character   Class :character
	Median : 75.00   Mode  :character   Mode  :character
	Mean   : 73.92
	3rd Qu.: 86.00
	Max.   :134.00

	    LgWin              WSWin                    R
	Length:2475          Length :2475           Min.   : 24.0
	Class :character     Class  :character      1st Qu.:600.5
	Mode  :character     Mode   :character      Median :684.0
	                     Mean   : 676.7
	                     3rd Qu.: 760.0
	                     Max.   :1220.0

	     AB             H            2B               3B
	Min.   : 211   Min.   :  33   Min.   :  3.0  Min.   : 0.0
	1st Qu.:5074   1st Qu.:1279   1st Qu.:187.0  1st Qu.: 32.0
	Median :5339   Median :1384   Median :223.0  Median : 45.0
	Mean   :5090   Mean   :1334   Mean   :219.7  Mean   : 49.4
	3rd Qu.:5496   3rd Qu.:1463   3rd Qu.:259.0  3rd Qu.: 63.0
	Max.   :5781   Max.   :1783   Max.   :373.0  Max.   :150.0

	     HR               BB             SO
	Min.   :  0.00   Min.   :  0.0   Min.   :   0.0
	1st Qu.: 35.00   1st Qu.:416.5   1st Qu.: 482.0
	Median : 96.00   Median :490.0   Median : 674.0
	Mean   : 93.15   Mean   :468.8   Mean   : 680.9
	3rd Qu.:139.00   3rd Qu.:554.0   3rd Qu.: 906.5
	Max.   :264.00   Max.   :835.0   Max.   :1399.0
	                 NA's   :120.0

	     SB               CS            HBP
	Min.   :  0.0   Min.   :  0.00   Min.   :  29.00
	1st Qu.: 62.0   1st Qu.: 37.00   1st Qu.:  51.00
	Median : 97.0   Median : 48.00   Median :  58.00
	Mean   :114.7   Mean   : 51.31   Mean   :  59.37
	3rd Qu.:149.0   3rd Qu.: 61.00   3rd Qu.:  67.00
	Max.   :581.0   Max.   :191.00   Max.   :  95.00
	NA's   :144.0   NA's   :859.00   NA's   :2325.00

	     SF               RA            ER
	Min.   :  25.00 Min.   :  34.0   Min.   :  25.0
	1st Qu.:  40.00 1st Qu.: 599.0   1st Qu.: 484.0
	Median :  46.50 Median : 680.0   Median : 579.0
	Mean   :  46.91 Mean   : 676.7   Mean   : 558.5
	3rd Qu.:  52.00 3rd Qu.: 761.0   3rd Qu.: 655.0
	Max.   :  75.00 Max.   :1252.0   Max.   :1023.0
	NA's   :2325.00

	     ERA              CG            SHO
	Length:2475        Min.   :  1.00   Min.   : 0.000
	Class :character   1st Qu.: 25.00   1st Qu.: 6.000
	Mode  :character   Median : 53.00   Median : 9.000
	                   Mean   : 56.59   Mean   : 9.622
	                   3rd Qu.: 82.00   3rd Qu.:13.000
	                   Max.   :148.00   Max.   :32.000

	     SV            IPouts          HA              HRA
	Min.   : 0.00   Min.   : 162   Min.   :  49   Min.   :  0.00
	1st Qu.: 7.00   1st Qu.:4050   1st Qu.:1277   1st Qu.: 38.00
	Median :20.00   Median :4182   Median :1384   Median : 98.00
	Mean   :21.34   Mean   :3980   Mean   :1334   Mean   : 93.15
	3rd Qu.:34.00   3rd Qu.:4332   3rd Qu.:1467   3rd Qu.:136.00
	Max.   :68.00   Max.   :4518   Max.   :1993   Max.   :241.00

	     BBA             SOA            E              DP
	Min.   :  0.0   Min.   :   0.0 Min.   : 47.0  Min.   : 18.0
	1st Qu.:418.0   1st Qu.: 481.0 1st Qu.:126.0  1st Qu.:124.0
	Median :491.0   Median : 664.0 Median :155.0  Median :144.0
	Mean   :469.2   Mean   : 676.1 Mean   :198.0  Mean   :139.0
	3rd Qu.:556.0   3rd Qu.: 895.0 3rd Qu.:231.0  3rd Qu.:160.0
	Max.   :827.0   Max.   :1404.0 Max.   :639.0  Max.   :217.0
	                               NA's   :317.0

	     FP              name              park
	Length:2475       Length:2475        Length:2475
	Class :character  Class :character   Class :character
	Mode  :character  Mode  :character   Mode  :character

	   attendance       BPF             PPF
	Min.   :   6088  Min.   : 54.0   Min.   : 54.0
	1st Qu.: 468365  1st Qu.: 97.0   1st Qu.: 97.0
	Median : 963895  Median :100.0   Median :100.0
	Mean   :1170516  Mean   :100.2   Mean   :100.2
	3rd Qu.:1705375  3rd Qu.:103.0   3rd Qu.:103.0
	Max.   :4483350  Max.   :131.0   Max.   :129.0
	NA's   :    279

Get Baseball Hacks now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.