BUY THIS BOOK
Add to Cart

Print Book $24.99


Safari Books Online

What is this?

Add to UK Cart

Print Book £17.50

What is this?

Looking to Reprint this content?


Baseball Hacks
Baseball Hacks Tips & Tools for Analyzing and Winning with Statistics

By Joseph Adler
Price: $24.99 USD
£17.50 GBP

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Basics of Baseball
More than most other sports, baseball is a game of numbers. Everything that happens can be measured, and people try to measure everything that can happen. They've done this for a long, long time. If you are so inclined, you can find detailed descriptions of games played when Ulysses S. Grant was president of the United States. It is the sense of endless possibilities for exploration and discovery in this data, along with the sheer love of the game, that drives passionate fans to devote countless hours to the smallest details of baseball.
Many fans love the history of baseball and want to compare the top players of today to the top players of the past. Some fans like to look up specific games from many years ago, maybe looking for the play-by-play for Sandy Koufax's perfect game or the box score for the first game their dad brought them to. Others wonder about how the game works, wanting to know if clutch hitters really exist, if sacrifice bunts are a good strategy, or if platooning batters is a good idea. Another group of fans just wants to know how their favorite players are doing this year, or they want to pick a winning fantasy team for next year.
This book will help you do these things, and it will tell you where to find the answers to questions like these. It will tell you where to find stats on past games, from wins and losses by teams all the way down to specific pitches. Many diligent fans from places like Retrosheet and Baseball DataBank work hard to collect historical data, and they make it available free of charge. This book shows you how to find and use their data, and how to analyze it with powerful and—thanks to the open source software movement—free tools that would have cost thousands of dollars just a few years ago.
Before diving headfirst into baseball statistics, databases, and programming, this chapter starts with the fundamentals: the rules of the game and of scoring a baseball game. I'll also give a few tips for easily getting baseball information off the Internet.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Hacks 1–7: Introduction
More than most other sports, baseball is a game of numbers. Everything that happens can be measured, and people try to measure everything that can happen. They've done this for a long, long time. If you are so inclined, you can find detailed descriptions of games played when Ulysses S. Grant was president of the United States. It is the sense of endless possibilities for exploration and discovery in this data, along with the sheer love of the game, that drives passionate fans to devote countless hours to the smallest details of baseball.
Many fans love the history of baseball and want to compare the top players of today to the top players of the past. Some fans like to look up specific games from many years ago, maybe looking for the play-by-play for Sandy Koufax's perfect game or the box score for the first game their dad brought them to. Others wonder about how the game works, wanting to know if clutch hitters really exist, if sacrifice bunts are a good strategy, or if platooning batters is a good idea. Another group of fans just wants to know how their favorite players are doing this year, or they want to pick a winning fantasy team for next year.
This book will help you do these things, and it will tell you where to find the answers to questions like these. It will tell you where to find stats on past games, from wins and losses by teams all the way down to specific pitches. Many diligent fans from places like Retrosheet and Baseball DataBank work hard to collect historical data, and they make it available free of charge. This book shows you how to find and use their data, and how to analyze it with powerful and—thanks to the open source software movement—free tools that would have cost thousands of dollars just a few years ago.
Before diving headfirst into baseball statistics, databases, and programming, this chapter starts with the fundamentals: the rules of the game and of scoring a baseball game. I'll also give a few tips for easily getting baseball information off the Internet.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Score a Baseball Game
This book shows you a lot of neat things you can do with computers and databases, but it's important to remember that all of those numbers come from the same place: a person with a score sheet.
Many fans watch baseball passively, chatting with friends, sipping beer, and trying to make sense of the first base coach's signals. But at any ballpark, you'll find fans of all ages taking part in a ritual as old as baseball itself: keeping score.
Some people, like Yankees fan Nancy Smith, have kept score at every game for 40 years. "For some reason, the people I've met at the stadium over the years were fascinated by [my] keeping score," she said in a New York Times article published June 1, 2005. "But I'm fascinated when I don't see people keeping score; it's a part of the game."
This hack describes a popular method for keeping score. It uses the scorebooks sold at ballparks and gives a good visual record of everything that happened in a game. Even if you never plan to score a game yourself, knowing how a game is scored helps you understand where statistics come from.
At a ballpark, most programs come with a score sheet for the game that resembles the score sheet in Figure 1-1 (from http://www.baseballscorecard.com/images/scorecard.gif). I've kept the description here a little vague because there are many variations on scoring methods. You're welcome to include more or less information when you score a game or to record it differently.
The purpose of scoring a game is to record everything that happened during the game so that you can calculate player statistics. As you can see in Figure 1-1, the score sheet has spaces to record information about the game, to list all of the players' names, and to record a description of every play in the game.
This score sheet lets you record the results for one team. Note that you need two of these score sheets for a game: one for the home team and one for the opposing team. By switching score sheets after every half inning (three outs), you can record everything that happens during a game. Here's how to use a score sheet.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Make a Box Score from a Score Sheet
Create a box score to summarize player performance.
Although a score sheet [Hack #1] describes everything that happened during a game, it doesn't tell you directly how each player performed. This hack explains how to use a score sheet to calculate a box score, a neat way to summarize how each player performed during a game.
Box scores are a traditional way to summarize baseball games, dating back to Chadwick in the 19th century. A box score captures a lot of information about the game. For each batter, it captures the number of at bats (AB), hits (H), runs (R), runs batted in (RBI), doubles (2B), home runs (HR), stolen bases (SB), and sacrifice hits (SH). For each pitcher, it includes the number of innings pitched, hits, runs, earned runs, walks, and strikeouts. It also includes notes on the number of players left on base; the winning, losing, and saving pitchers; and some other miscellaneous information.
As an example, here is the box score from the October 2, 1978 playoff game between the Yankees and the Red Sox:
	Game of 10/2/1978 -- New York at Boston (D)
	
	 New York          AB  R  H  RBI    Boston             AB  R  H  RBI
	Rivers M, cf        2  1  1    0    Burleson R, ss      4  1  1    0
	Blair P, ph-cf      1  0  1    0    Remy  J, 2b         4  1  2    0
	Munson T, c         5  0  1    1    Rice J, rf          5  0  1    1
	Piniella L, rf      4  0  1    0    Yastrzemski C, lf   5  2  2    2
	Jackson R, dh       4  1  1    1    Fisk C, c           3  0  1    0
	Nettles G, 3b       4  0  0    0    Lynn F, cf          4  0  1    1
	Chambliss C, 1b     4  1  1    0    Hobson B, dh        4  0  1    0
	White R, lf         3  1  1    0    Scott G, 1b         4  0  2    0
	Thomasson G, lf     0  0  0    0    Brohamer J, 3b      1  0  0    0
	Doyle B, 2b         2  0  0    0    Bailey B, ph        1  0  0    0
	Spencer J, ph       1  0  0    0    Duffy F, 3b         0  0  0    0
	Stanley F, 2b       1  0  0    0    Evans D, ph         1  0  0    0
	Dent B, ss          4  1  1    3
	                   -- -- --   --                       -- -- --   --
	                   35  5  8    5                       36  4 11    4

	New York      000 000 410 -- 5
	Boston        010 001 020 -- 4

	 New York         IP  H R ER BB SO
	Guidry R (W)      6.1 6 2  2  1  5
	Gossage R (S)     2.2 5 2  2  1  2

	 Boston            IP  H  R  ER  BB  SO
	Torrez M (L)      6.2  5  4   4   3   4
	Stanley B*        0.1  2  1   1   0   0
	Hassler A         1.2  1  0   0   0   2
	Drago D           0.1  0  0   0   0   0
	 * Pitched to 1 batter in 8th

	LOB -- New York 6, Boston 9
	2B  -- Rivers M, Scott G, Burleson R, Munson T, Remy J
	HR  -- Yastrzemski C, Dent B, Jackson R
	SB  -- Rivers M 2
	SH  -- Brohamer J, Remy J
	PB  -- Munson T
	T   -- 2:52
	A   -- 32925
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Keep Score, Project Scoresheet–Style
Use the Project Scoresheet system to record the plays during a baseball game.
About 20 years ago, a group of baseball fans started Project Scoresheet. This was a collaborative effort by baseball fans to record every play of every game. Bill James started this project out of frustration because the Elias Sports Bureau (the official statistics provider) refused to make this data available to fans. So, in true hacker tradition, the members of the project designed a neat, simple scoring method that was perfect for describing plays to computers. Today this system lives on as the preferred encoding method of Retrosheet. In fact, it's even used by STATS, Inc. and the Elias Sports Bureau.
With Project Scoresheet record keeping, you write down a code each time something happens during the game, in chronological order. You don't need two score sheets (one for each team) or special scoring paper. Even better, you don't need to use any weird symbols—you can just write codes using the letters of the alphabet. This makes Project Scoresheet descriptions ideal for computer use because they're easy to enter, email, and save to a database. The disadvantage of this system is that it's a little tough to compute a box score by hand or manually read a game description. After you get the hang of this system (and you learn all the codes, which admittedly is a big hurdle), you may find it faster and easier than traditional scoring.
Even if you never plan to use this system, you might find it helpful to read this hack because Retrosheet uses this system to describe play-by-play data. We'll revisit Project Scoresheet–style scoring throughout the book. For example, once you're familiar with the system, you can use it to understand historical play-by-play information [Hack #14] .
Each play code in this system captures several types of information:
Fielding of balls in play
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Follow Pitches During a Game
Baseball can be boring unless you understand what pitch is being thrown, and why.
People who don't watch baseball all the time think it's a boring game. They don't think much happens; every minute or so, the pitcher throws a ball and the batter swings, and that's it. Over time, fans realize that a lot is going on; there is a lot of strategy involved in pitching to each batter. This hack helps you follow this battle—the subtlest and most elegant part of the game.
In addition to making the game a lot more interesting, understanding and following pitching strategies raises a number of interesting questions to ponder as you watch. Is the pitcher throwing the ball where he should, or is he throwing it at the wrong spot? Is he having trouble with a certain pitch? Does it seem like the other team always knows a certain pitch is coming? Finally, watching the pitcher and catcher can also tell you about who is making the decisions. Usually, the catcher selects the pitches, but there are exceptions. You can learn a lot by watching.
The pitcher's primary goal is to try to fool the hitter into swinging in the wrong place at the wrong time. There are a lot of ways to do this. Sometimes the pitcher wants to throw strikes that look like balls (so the batter doesn't swing at a called strike) or balls that look like strikes (so the batter swings at the ball and misses). Other times, the pitcher wants to throw the ball where the batter can't hit it well (often, this means a high, inside fastball). At still other times, the pitcher throws the ball at different speeds to fool the hitter into swinging too early or too late. Pitching strategy can be subtle and complicated. Pitchers often throw one pitch (say, a slider on the outside corner) to set up another pitch (say, a slider outside).
Good pitchers (usually catchers, actually) remember what they threw to a batter in previous innings, games, or even seasons! Professional ball players and coaches spend hours studying videos before each game to try to forecast what a pitcher will throw to them, and when. Catchers spend hours reading notes and watching videos to figure out what pitches hitters expect pitchers to throw, and at what times, so they can adjust their strategies. And then ball players adjust their hitting strategies to what they think are the pitcher's new strategies…. You get the idea.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Follow the Game Online
Find some cool statistics and charts about baseball players on the Internet.
If you have a quick question about baseball (such as, "What was Reggie Jackson's batting average in 1977?"), some great web sites out there can help you. Even if you have the data in other places (I have more than 2 GB worth of baseball databases on my computer and a dozen phonebook-size statistics books), you might still want to look up player statistics on the Web.
There are many good web sites for finding statistics on current players. Here are a few of my favorites:
Major League Baseball
The best place to start is MLB.com (http://www.mlb.com), the official web site of Major League Baseball. Here you'll find the "official" statistics (as tabulated by the Elias Sports Bureau) for all current and past players. MLB.com lets you look at spray charts (diagrams showing where every ball was hit by a player) for all major league players. If you're curious why the defensive players shift positions for certain players, this can help answer your question.
ESPN
Like most web sites, ESPN.com (http://www.espn.com) has current statistics on every baseball player. But ESPN adds a unique twist: it includes a number of sabermetric stats, including park factors and DIPS. (I explain a lot of these statistics in Chapter 5.) If you subscribe to the web site, you'll get access to more content, including commentary from scouting agencies and a print subscription to ESPN The Magazine.
Baseball Reference
The Baseball Reference web site (http://www.baseballreference.com) is one of my favorite sources for information. This web site is based on the data from the Baseball DataBank (see "Get a MySQL Database of Player and Team Statistics"
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Add Baseball Searches to Firefox
Add a few baseball web sites to the default list of search engines and search instantly for information on baseball players.
Do you use Firefox? Do you like to look up information about baseball players? Start searching the easy way by adding a few baseball web sites to the Firefox search dialog box. As shown in Figure 1-13, you can pick different search engines to use in the dialog box. Even cooler, you can add other search engines to the list. This hack shows you how to add some baseball-specific search engines to the search bar.
Figure 1-13: Search bar search engine selection in Firefox
You can find a list of search engine plug-ins at http://mycroft.mozdev.org. This web site lets you search for different search engine plug-ins. Searching for "baseball" in the "Find search plugins" section of the site produces results like the ones shown in Figure 1-14. If you click on one of the links shown, a dialog box will appear asking whether you want to add that particular search engine to the search bar. If you click OK, the search engine will be added to the list in Firefox.
Figure 1-14: Baseball search engines
After adding the search engines you want, they will appear in the choices in the search bar, as shown in Figure 1-15.
Now, you can quickly look up information about baseball players. For example, suppose you want statistics on a player named Rodriguez. Type Rodriguez into the search bar, pick a search engine (such as Baseball Prospectus), and press the Enter key to search. Firefox will quickly load the search results, like the ones shown in Figure 1-16.
If you want to do even more, look at the tools available from Baseball Prospectus, at
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Find Images of Stadiums
Unlike most other sports, baseball fields are not uniform. Old parks like Fenway Park, Wrigley Field, and Yankee Stadium have short outfield walls and kinky corners where balls get lost. Here's an easy way to get an aerial view.
The easiest way to look at stadiums is through Google Maps (http://maps.google.com). To find a stadium, search by the name of the ballpark inside the city. For example, search for "SBC Park, San Francisco, CA" to find the ballpark where the San Francisco Giants play. Click the Satellite button in the top corner of the map and zoom in to take a closer look at the ballpark. Figure 1-17 shows the results.
Figure 1-17: SBC Park in Google Maps
Google Maps will try to find the address and will return a list of possibilities. Sometimes the first link will be correct, but sometimes you'll have to check other items on the list. (For example, "Petco Park, San Diego, CA" doesn't return the right result at the top.)
Google has released a cool new service called Google Earth that extends the aerial views of Google Maps with better images and lots of additional features. You can download a free version from http://earth.google.com. My favorite trick with Google Earth is to measure distances inside ballparks. Curious how far a home run ball was hit? Check it out with Google Earth.
For example, at SBC Park in San Francisco, Barry Bonds often hit balls out of the ballpark into McCovey Cove. How far did he actually have to hit the balls for them to reach the water? Figure 1-18 illustrates how to calculate this using Google Earth.
Figure 1-18: Using Google Earth to measure how far a ball was hit
Here's how to find out:
  1. Search for the ballpark by typing sbc park, san francisco, ca into the search box and clicking the Search button.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Baseball Games from Past Years
This chapter explains where to get information about baseball games and baseball players and tells you how to store this information for easy lookup. These hacks explain how to find or make box scores, player statistics, and even play-by-play data. You can get data about games from 1871 through last season. Several different groups of baseball fans have worked hard to compile and digitize this information, making it easy for any baseball fan to look up scores and stats, plan their fantasy team, or research baseball.
We use three primary types of data in this book. Here's a short explanation of each one, and how they all fit together:
Play-by-play
The most detailed data we have is "play-by-play" or "event" data. The event files include information about every play in a game: every at bat, every stolen base, and sometimes every pitch.
Game logs
Game logs include a summary of each game: playing conditions, scores, and starting players.
Player and team statistics
Player statistics include statistics for each player on each team in each season. These files include offensive statistics like hits, home runs, and stolen bases; pitching statistics like batters faced, strikeouts, and earned runs; and fielding statistics like putouts, assists, and errors. Team statistics summarize this information for each team.
When you have detailed play-by-play descriptions for each game, you can derive game logs from those descriptions. (See "Make Box Scores or Database Tables from Play-by-Play Data with Retrosheet Tools" [Hack #15] for information on a tool that does this.)
Similarly, you can derive player and team statistics from play-by-play data. "Make a Box Score from a Score Sheet"
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Hacks 8–23: Introduction
This chapter explains where to get information about baseball games and baseball players and tells you how to store this information for easy lookup. These hacks explain how to find or make box scores, player statistics, and even play-by-play data. You can get data about games from 1871 through last season. Several different groups of baseball fans have worked hard to compile and digitize this information, making it easy for any baseball fan to look up scores and stats, plan their fantasy team, or research baseball.
We use three primary types of data in this book. Here's a short explanation of each one, and how they all fit together:
Play-by-play
The most detailed data we have is "play-by-play" or "event" data. The event files include information about every play in a game: every at bat, every stolen base, and sometimes every pitch.
Game logs
Game logs include a summary of each game: playing conditions, scores, and starting players.
Player and team statistics
Player statistics include statistics for each player on each team in each season. These files include offensive statistics like hits, home runs, and stolen bases; pitching statistics like batters faced, strikeouts, and earned runs; and fielding statistics like putouts, assists, and errors. Team statistics summarize this information for each team.
When you have detailed play-by-play descriptions for each game, you can derive game logs from those descriptions. (See "Make Box Scores or Database Tables from Play-by-Play Data with Retrosheet Tools" [Hack #15] for information on a tool that does this.)
Similarly, you can derive player and team statistics from play-by-play data. "Make a Box Score from a Score Sheet" [Hack #2] explains how to calculate player statistics for each game. Over the course of a season, you can just add up a player's hits, at bats, home runs, and other statistics to find his player statistics for a season.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Get and Install MySQL
Get the free MySQL database and install it on your PC.
I hate to start this chapter with a hack about a piece of software instead of a hack about baseball statistics, but most of the hacks in this chapter require a database. This hack shows you how to install MySQL, a free database. If you already have a database (like Microsoft Access), you can skip ahead to the next hack.
A database organizes information into tables. Each table is organized into a set of columns that describe different attributes of the data, and a set of rows (called records) that contain different items of data. For example, a database of batting statistics might have columns representing player names, at bats, and hits, and rows representing players. Every value in a column is of the same type and can be referred to by the name of the column. In many respects, a table is similar to data in a spreadsheet.
In this book, we manipulate data with several pieces of software—Microsoft Excel, Perl, and R. You're probably wondering why I'm suggesting yet another way to manipulate data. Why do you need a database if you already have a spreadsheet? A database has several important advantages over a spreadsheet:
Strong types
In a spreadsheet, you can enter any value anywhere at any time. (Or, you can use difficult procedures to restrict the contents of cells.) This can be convenient, but it can also lead to mistakes. In a database, you precisely define what is allowed in every column. Computer science types call this strong typing. The advantage of this approach is that it allows you to check for garbage data automatically. For example, if you're expecting a column of batting averages (a number) and, for some reason, a player's name slips in because of a data entry error, the database can automatically warn you of the problem.
Transactions
In a spreadsheet, it is easy to accidentally delete or add data. (In the past, I've spent weeks searching for an incorrectly entered value.) In a database, values are added or deleted through insert, update, or delete queries. Moreover, all queries can be logged, providing a record of what changed. The main advantage of this is that if you accidentally change or delete some information, you can recover it automatically by rolling back to the previous values.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Get an Access Database of Player and Team Statistics
Get a free database of historical baseball data from the Internet (covering every major league game from 1871 through today) in Microsoft Access format.
Suppose you want to know the average ERA for your fantasy league. Or maybe you want to settle a bet with a friend about which players got on base most often during the 1970s. Or perhaps you want to show that Jim Thome was a clutch hitter for the Phillies in 2004. To do this, you'll need to find some statistics.
Sometimes you can find the statistics you want from MLB.com, the Baseball Reference, or the Baseball Prospectus (see "Follow the Game Online" [Hack #5] for some suggestions), but other times you won't find exactly what you want. You might be able to find this information online by spending hours tediously searching for the raw data, cutting and pasting the data into a spreadsheet, and producing the stats you want in the form you want.
I think that it's often easier to find the stats you want if you have your own database. This hack shows you the easiest way I know to get a database of baseball players. These databases include the total statistics, by year, for each baseball player. Later in this book, I'll show how to find records by game and play-by-play information. I'll also show how to get data in two common database formats: Microsoft Access and MySQL, the popular (and free!) open source database.
The information in these databases is identical, so you can pick whichever format is easiest for you. (When I wrote this book, I used a MySQL database containing the Baseball DataBank information. If you want to follow along with all the hacks in this book, you'll probably find it easiest to use the MySQL version. However, I do include some tips for using Microsoft Access. The idea of this book is to do things the easy way: if Microsoft Access is the easiest tool for you, go ahead and use it.)
If you are using Microsoft Windows and have Microsoft Access, you will probably find this the easiest place to start. You can download the file from the Baseball Archive web site at
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Get a MySQL Database of Player and Team Statistics
Get a free database of historical baseball data from the Internet (covering every major league game from 1871 through today) in MySQL format.
If you don't have Microsoft Access (and you don't want to buy it), or if you are a more experienced database user, you might prefer to use MySQL. The web site http://www.baseball-databank.org offers the same database that the Baseball Archive web site offers, but as a MySQL dump file.
Even if you have Microsoft Access, I recommend that you try MySQL, because it's faster, more standards compliant, more flexible, and relatively painless once it's up and running. It's also easier to use with other software, which is important for many hacks in this book. For instructions on getting MySQL, see "Get and Install MySQL" [Hack #8] . For suggestions on how to make it easier to work with MySQL, see "Get a GUI for MySQL" [Hack #18] . Finally, see "Use SQL to Explore Game Data" [Hack #16] for more information about the SQL language.
After you have MySQL installed, here's how to get the files and load them into your database.
You can get the file you need from http://www.Baseball-DataBank.org. From the web site, just download the file labeled "Database in MySQL form" and save it to your local disk. (You can download the 2004 database file from http://www.baseball-databank.org/files/BDB-sql-2005-08-02.sql.zip. Or, you can check for a more current version and use it instead. Just be sure to change the filename in the instructions I'm outlining here.) This database is produced by a volunteer effort led by Sean Forman and Peter Kreutzer.
If you are using Mac OS X, Linux, or another system with standard Unix commands installed, download the file using the following command:
	% curl http://www.baseball-databank.org/files/BDB-sql-2005-08-02.sql.zip\
	> BDB-sql-2005-08-02.sql.zip
	 % Total  % Received % Xferd Average Speed  Time  Time   Time Current
	                Dload Upload  Total  Spent  Left Speed
	100 6495k 100 6495k  0   0  587k   0 0:00:11 0:00:11 --:--:-- 623k
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Make Your Own Stats Book
Produce your own book of baseball statistics using Microsoft Access reports.
Every year, publishers release a set of new phonebook-size directories of baseball statistics. You can find the exact same statistics in free sources like the Baseball DataBank and Baseball Archive databases. (You can't get any accompanying insight, though, which is why it can still be worth buying these books.)
I usually find electronic databases more convenient than printed copies, but sometimes paper is better. For example, if your fantasy baseball league does a live draft, you might find a prioritized list of players. This hack tells you how to use Microsoft Access to print a professional-looking statistics report.
To use this hack, you will need Microsoft Access. For the example here, we'll only print batting statistics. We'll limit this further to show only:
  • Players who played in the last year
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Get Perl
Perl is a powerful tool for processing data. Here is how to get a free version for your PC.
Many hacks in this book require tedious, repetitive, and boring work: fetching files from the Internet, running programs to transform them, and saving them to databases. I hate doing tedious work, so I get my computer to do it for me. I usually use Perl for tasks like these, and I have chosen to use Perl in many of the hacks provided in this book.
Perl is a very powerful scripting tool that we'll use to download files from the Internet, parse files to extract information, and write output to a database. (You can use it for many other things, but we won't.)
If you want to run Perl scripts, this hack explains how to do that. If you're interested in understanding how my scripts work and want to modify them to do other things, see "Learn Perl" [Hack #13] for more information about the Perl language.
If you're using Linux or the Mac OS, you probably already have Perl installed on your computer. (Just make sure you install all the modules we'll need, which I'll describe shortly.) But if you use Windows, you'll need to install Perl yourself. Since Perl is essential to running many of the hacks in this book, here is how to get it and install it.
Installing Perl on Windows is fairly straightforward, and you can do it with a minimum of fuss using ActiveState's ActivePerl, a free version of Perl for Windows. The first step is to download the installation files from http://www.activestate.com/Products/ActivePerl. Follow the download links and download the installer program. (I recommend the MSI version because it includes uninstall functionality.)
Once you've downloaded the file, you need to run the install program. The first screen of the install program has no options; just click Next. The second one asks you to accept a license agreement to continue; if you don't object to the agreement, click Next. The third screen presents custom setup options. Just accept the defaults and click Next. The fourth screen asks if you want to "Enable PPM3 to send profile info to ASPN." It's up to you whether you want to select this option (I didn't select it); once you've made your choice, click Next. The next screen shows several options but doesn't allow you to check any, so just click Next to start the installation. Once the installation process completes, click the Finish button.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Learn Perl
This is a short introduction to Perl.
Many of the hacks in this book use Perl to process text files. I'm sure that many of you have never used Perl (or any programming language, for that matter), so I want to give a short introduction. This introduction should be enough to let you write really simple programs, to understand the sample programs in this book, and to modify the sample programs to do different things. If you don't want to understand my scripts, modify my scripts, or write your own scripts, you can safely skip this hack.
Perl programs are usually called Perl scripts because they are interpreted in real time by the Perl program. (Perl is an example of an interpreted language. Other languages you might have heard of, like C++ and Java, are compiled languages. Here's a quick analogy: suppose the computer is a baker. The Perl program is similar to a recipe to make cake from scratch, and the compiled C++ program is similar to a cake mix. You can bake a cake from either recipe, but the cake mix eliminates steps for you ahead of time.)
This section introduces some basic concepts in Perl, showing constructs we'll use many times in this book.

Section 2.7.1.1: Statements.

Perl statements tell the computer to do something. As a simple example, let's write a short program to print something:
	print "hello world\n";
(Notice the semicolon at the end of the statement. You need to put a semicolon after the end of every statement in Perl. Oh, and the \n means "start a new line.") If you save this to a file called helloworld.pl, you can run the program like this:
	% perl helloworld.pl
	hello world
We'll get back to statements soon, but first we need to introduce variables.

Section 2.7.1.2: Variables.

In Perl, you can give a convenient name to a value and then use it later. The named thing is called a variable. (If that statement bothers you, you know too much to be reading this chapter. This book is about saving time! Don't read this. Skip to another hack.) Here are a couple of examples:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Get Historical Play-by-Play Data
Get data on every play in MLB games from Retrosheet.
Retrosheet is a nonprofit organization dedicated to collecting and computerizing statistics on Major League Baseball games. David W. Smith started Retrosheet in 1994, and it has become the best source for historical baseball information. (David W. Smith won the prestigious 2005 SABR Bob Davids Award for his work.) Retrosheet provides lots of information on its web site, including old box scores, play-by-play game descriptions, and game summaries. Best of all, you can download datafiles that you can use for your own analysis.
Retrosheet supplies some tools for reading these files (see "Make Box Scores or Database Tables from Play-by-Play Data with Retrosheet Tools" [Hack #15] for more information), but this hack explains how to read these files. This can be very useful if you find the tools hard to run, or if you want to look for information that's not available through the standard tools.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Make Box Scores or Database Tables from Play-by-Play Data with Retrosheet Tools
Turn historical event files into box scores, summary statistics, and events suitable for database import using Chadwick and the DiamondWare tools.
It's pretty tough to work directly with the Retrosheet event files. They contain a description of everything that happened, but they're dense and hard to read. Two sets of tools can make things easier. These tools turn the event files into box scores and tables suitable for databases. There's even a new GUI-driven tool that can help you explore these files.
These tools don't include any data, nor do they fetch it from the Internet. Before using them, you'll need to get some baseball data. See "Get Historical Play-by-Play Data" [Hack #14] for more information.
A company called DiamondWare developed a set of three tools for Microsoft Windows for processing Retrosheet–style event files. These are:
BGAME
The BGAME tool will import an event file and output a CSV file with one line per game containing summary statistics about each game. These statistics include home team, visiting team, runs scored, starting pitchers, umpires, starting lineups, and other information about each game.
BOX
The BOX tool generates a set of box scores from a historical event file. Many users will find a one-page box score easier to understand than 200 lines of event codes.
BEVENT
The BEVENT tool will import an event file and output a CSV file with one event per line, suitable for import into a database. This program can output very detailed information about each event, including which player was at each base, which runners were on base, whether bases were stolen, flags for double plays, the result of the at bat, and many other fields.
You can get these Retrosheet tools from
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Use SQL to Explore Game Data
Understand databases and learn to read and write SQL queries.
In this book, I use a SQL database to query and store data. Specifically, I use MySQL. MySQL is a relational database, not unlike Microsoft Access, Oracle, Microsoft SQL Server, and IBM DB2, or any number of other expensive products that companies run on big, back-room servers. This hack describes the basics of databases and introduces Simple Query Language (SQL), which is the way you communicate with a database.
I suggest that you use MySQL for three reasons. First, it's free. Second, a MySQL database of baseball data that you can easily download and install is available free of charge. And finally, MySQL is very stable and robust, so you are assured fast performance and few errors.
If you're using another database, the basic techniques described in this book will still work for you. However, other databases might use slightly different notations for tasks such as adding users, creating tables, and managing your database.
To begin using MySQL, you need to run the mysql program from a command line. (The database server is called MySQL. It includes a command-line tool called mysql that you can use to query your MySQL database.) On Microsoft Windows, go to the Start menu, select Programs → MySQL → MySQL Server 4.1 → MySQL Command Line Client. A window will open prompting you for a password. Type in your password and press Enter. You should see a prompt like this:
	Enter password: ******
	Welcome to the MySQL monitor. Commands end with ; or \g.
	Your MySQL connection id is 8 to server version: 4.1.7-nt

	Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

	mysql>
At the prompt, type use <database name> to select a database. For example, to use the Baseball DataBank database described in "Get a MySQL Database of Player and Team Statistics" [Hack #10] , your interaction with MySQL would look like this:
	mysql> use bbdatabank;
	Database changed
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Use Microsoft Access to Run SQL Queries
Write queries in Microsoft Access using the SQL language.
Most Microsoft Access users like to build queries graphically, and I must admit that I also like to do that. But it's very hard to explain how to build a query in print. (See "Make Your Own Stats Book" [Hack #11] for an example.) SQL is a very concise way to explain database queries, so that is the way I explain queries in this book.
Luckily, Microsoft Access lets you enter queries as SQL code. This hack explains how to enter SQL queries in Microsoft Access.
To begin, start Microsoft Access and open the database file you want to query. You can do this by double-clicking the file in Windows Explorer or by starting Microsoft Access and choosing Open Database.
As an example, I chose to show batting averages by player, using the Baseball Archive database. (See "Get an Access Database of Player and Team Statistics" [Hack #9] for instructions on how to get this database.)
To create a new query, select Query from the Insert menu. A dialog box will appear, asking you how you want to create a new query. Select Design View. This will open a query design window in the default format. To get a SQL dialog box, select SQL View from the View menu. You will see a window like the one shown in Figure 2-9 . Running the query shown in Figure 2-9 generates the results shown in Figure 2-10.
Figure 2-9: Access SQL edit window
If you want, you can get Access to show the SQL query in the normal graphical edit window. To do this, choose Design View from the View menu. Figure 2-11 shows the design view for the earlier query.
Not all SQL features work in Microsoft Access, but if you want to use Microsoft Access, it is an alternative to MySQL. Specifically, if you are using Microsoft Access, you won't be able to use subqueries. Here is a way to work around this:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Get a GUI for MySQL
Get a graphical database tool to make it easier to query a MySQL database.
If you are reluctant to use MySQL because its command-line interface is unfamiliar and nonintuitive, this hack is for you. MySQL (the company) makes an easy-to-use graphical query tool for MySQL (the database). This tool, called MySQL Query Browser, lets you explore a database through a point-and-click interface, edit queries, and save results to files that you can open in a spreadsheet.
You can get MySQL Query Browser from http://dev.mysql.com/downloads/query-browser. This web site includes easy-to-use installer programs for Windows, Linux, and Mac OS X. See the web site for download and installation instructions.
Most users should probably download the standard installer program for Windows, the RPM packages for Linux, or the DMG files for Mac OS X. These versions will make installation easiest for most users. Unless you have unusual requirements and you know what you're doing, download and install the standard versions.
When you first open MySQL Query Browser, you will be prompted to enter information about your database connection. If you created a database using the instructions in "Get a MySQL Database of Player and Team Statistics" [Hack #10] , you can follow along with this example.
Enter your username and the bbdatabank default schema. (Optionally, you can enter a name in the Stored Connection field to name these parameters. This lets you choose the connection from a menu the next time you run the program.) Figure 2-12 shows the connection screen.
Let's try a simple query to show off the query browser interface. Select "New query tab" from the File menu. A new tab will appear in the query browser window. As a simple example, let's look at the Batting table. Drag-and-drop the Batting table from the Schemata list to the query window at the top of the page. The SQL code SELECT * FROM 'batting' b will appear in this window. Click the Execute button, and the tool will show the contents of the table in the results window, as shown in Figure 2-13.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Move Data from a Database to Excel
Get data out of your database and into Microsoft Excel for easy analysis.
In "Get and Install MySQL" [Hack #8] , I listed a bunch of reasons why databases can be better than spreadsheets: database fields are strongly typed, database changes are transactional, database queries can be used to select data, database joins can be used to merge data, database relationships can be defined between data, and database file sizes can be very large. For storing and fetching data, I also think databases are better.
However, for analyzing and inspecting data, I prefer a spreadsheet. You can do a lot of things easily with spreadsheets that you can't do easily with a database—for instance, plot charts, create pivot tables, and do statistical calculations. In this hack, I'll show you a few ways to get data out of a database and into Microsoft Excel.
Spreadsheets are very powerful, but they also have some limitations. Here are a few suggestions on how to select and format your data to make the most useful spreadsheet:
Table sizes
Make sure there are fewer than 255 columns and 65,534 rows in the table. The limit is 65,535 rows, but you need to save one for column headers!
Counts versus ratios
In general, I think it is better to export counts than averages. Here's an example that illustrates why this is.
Suppose you want to calculate team batting averages from individual player statistics. Would it be better to export hits (H) and at bats (AB), or averages (AVG)? Team batting average usually means the total number of hits by a team, divided by the total number of at bats. The easiest way to do this is to sum H and AB for a team, and calculate the team AVG from there. If you have only the AVG for each player, you cannot calculate the team average correctly. (You can calculate this directly from the individual batting averages of individual players if you weight the results by AB, but that's ugly and complicated.)
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Load Baseball Data into MySQL
Use a Perl script to generate MySQL statements automatically for loading baseball data into MySQL.
Most baseball data on the Web is available as flat files: text files where a tab, a comma, or another character separates each element. For example, here is the AwardsManagers table from the Baseball Archive data:
	"managerID","awardID","yearID","lgID","tie","notes"
	"larusto01m","Mgr of the year",1983,"AL",,
	"lasorto01m","Mgr of the year",1983,"NL",,
	"andersp01m","Mgr of the year",1984,"AL",,
	…
The first line contains the header information (the name of each column). Each subsequent line represents a different manager award for a different league. Text fields are enclosed in quotes and are separated by commas. The fields are (from left to right): ID code, award name, year, league, a field that indicates ties, and notes.
As you can see, it is easy for a person to read these files. Often, it is convenient to create or edit files like these. For example, I show how to create formatted text files like these in "Make Box Scores or Database Tables from Play-by-Play Data with Retrosheet Tools" [Hack #15] . However, it's much easier to read these files using a database because it lets you search easily for specific lines or summarize the information in a file.
To load a text file like this into a database, you have to define an appropriate table type [Hack #16] . Writing a statement like this can be tedious, especially if you don't know the length of each field. If you guess field lengths incorrectly, you will either waste storage space or truncate some of the fields.
This hack presents a Perl script that will automatically read a text file and create SQL import code. This script works pretty simply. First, it reads the command-line options (specifying the delimiter, whether there is a header line, and the input filename). Next, the script reads each line of the file, measuring the length of each field and checking the type of each field. If it finds any nonnumerical values in a line, it sets the field type to character. Finally, it prints out some SQL code to define the table and load the text file. This script accepts three options:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Load Retrosheet Game Logs
Content preview·Buy PDF of this chapter|