Baseball Hacks by Joseph Adler The unconfirmed error reports are from readers. They have not yet been approved or disproved by the author or editor and represent solely the opinion of the reader. Here's a key to the markup: [page-number]: serious technical mistake {page-number}: minor technical mistake : important language/formatting problem (page-number): language change or minor formatting problem ?page-number?: reader question or request for clarification This page was updated April 26, 2007. UNCONFIRMED errors and comments from readers: {24} Bottom of the page; In the baserunning info '.1-H;2-H;3-H', the order seems to be reversed. Instead, it should be '.3-H;2-H;1-H'. This would match the REGEXP example on p.113 (30) Last paragraph of Step 5; First sentence currently reads "There are a lot of reasons why a pitcher might shake off a batter." In context, it appears that it should read "There are a lot of reasons why a pitcher might shake off a 'catcher/catcher's signs'." {34} 4th para; Conquery link no longer works from baseball prospectus site. It appears as though that search tool may have been removed. {51} Step 5; Before you can "show tables;" the reader must first: mysql> SELECT bbdatabank; {65} 3; The line of code is given as: perl -e "print \"hello world!\n\";" This code works for ActivePerl, but not for Cygwin, advocated on the following page. Cygwin apparently wants the exclamation point to be escaped. The following code will work for both versions: perl -e "print \"hello world\!\n\";" {85} SQL INSERT command; The last four values of the INSERT command are inserting 3-letter team codes into the table. In the prototype table above, those codes are defined as being two letters (CHAR(2) datatype). Also, in the printed table those teams are defined as CO, HO, AT and PH. When the INSERT command is run, it will simply truncate those team names, causing no real harm to the example. However, mySQL will return those 4 warnings. To clean up the code, simply remove the last letter from the team column on those last 4 values. That will remove the warnings and make the command match the table definition exactly. {94} Subqueries in Access; A couple mistakes here: In the first SELECT statement: In Access 2003, at least, the expression (H + 2B + 2 * 3B + 3 * HR) must be written as (H + [2B] + 2 * [3B] + 3 * HR) which for consistency might as well be ([H] + [2B] + 2 * [3B] + 3 * [HR]). Next, the field AB appears twice in the list; it only needs to be there once. In the second SELECT statement: The nickname t is used in the WHERE clause, but is not defined. The clause 'from slugging_inner_query' should read 'from slugging_inner_query t'. Finally, a semicolon ends the second SELECT statement, but not the first. This doesn't matter as Access will simply add in the semicolon anyway. However, for consistency, these should match. [109] line 20 of 'rosters.pl' code; On line 20 of the rosters.pl code (which also appears in the hack_22_rosters.pl code that can be downloaded from the web site), the regular expression for verifying the Retrosheet ID is incorrect. The code reads: if (/[a-z]{5}\d{3}/) { Retrosheet ID's can include a dash when a player's last name is shorter than four characters (like Derrek Lee, lee-d002). The line should read: if (/[a-z-]{5}\d{3}/) { The dash in the first character class will match players like Lee or Jason Bay, etc. [109] rosters.pl; In line 5 of the code, print "retroID,lastName,firstName,bats,throws,team,pos\n"; should read: print "year,retroID,lastName,firstName,bats,throws,team,pos\n"; Otherwise, all of the columns will be shifted off by one (and position information will not make it into the database). {109} Paragraph starting "Notice that..."; It says that you called the output files pbp.csb and pbp2k.csv, but the script outputs pbp1960-1992.csv and pbp2000-2004.csv. It's trivial to rename the files, except that the name pbp.csv is used to dump the program output (which is not really needed) as stated in the previous paragraph. So, in order to avoid confusion, either the "debug" output of the script should be called something else, or the script should just output the files with the names that will be used later. [109] translate.pl code; The changes to the code in translate.pl for Windows (listed in the confirmed errata list) are not enough to get this script working on Windows XP. I had to make the following changes to get this code to work on Windows. (Note this fix uses the free 7-zip archive tool): Change from this: print `cat all_hdr.txt > $outfile`; print `cat all_hdr.txt > $outfile2k`; to this: print `type all_hdr.txt > $outfile`; print `type all_hdr.txt > $outfile2k`; Change from this: print `unzip -qq -o $archive`; to this: print `7z x -y $archive`; Change from this: print `./BEVENT.EXE -y $century$year -f 0-96 $file >> $outfile`; to this: print `BEVENT.EXE -y $century$year -f 0-96 $file >> $outfile`; Change from this: print `./BEVENT.EXE -y $century$year -f 0-96 $file >> $outfile2k`; to this: print `BEVENT.EXE -y $century$year -f 0-96 $file >> $outfile2k`; {110} Creating a play-by-play database and tables; In the console commands to the MySQL server, there is a GRANT ALL command on the database pbp. However, there is not creation of that pbp database. Readers can follow earlier examples to do this, but the following code shown on the page will not run as expected without this database already having been created. [141] Running the Hack; When running this on Windows, using ActivePerl, the DBD-mysql module package was not yet installed. In order to run this you will need to load the Perl Package Manager (as in Hack #12). Then type "install DBD-mysql". This will download the DBI and DBD-mysql packages, which will allow the perl to run and connect to the database. (The DBD-mysql package should probably be included in Hack #12 on page 66 in the list of hidden pacakges. I do not know if users of other operating systems will need to separately download this as well.) (141) last paragraph; 'You should run this script exactly once per day...' implies running the bootstrap script, load_db.pl, once a day as opposed to running the update script, update_db.pl, once a day. [146] last paragraph; This is not a mistake, but a technical update. For the 2006 season, GameDay has replaced the players.txt file with the players.xml file. Both files were available from Opening Day through 4/25/06, but as of 4/26/06 only the XML file exists. [147] 1st code comment, middle of the page; Contents of pbp can no longer be fetched. Pitch-by-pitch data has not been available as of the 2006 All-Star game, 7/11/06. Cannot find pitch location information stored anywhere, except for final batter of a game. [159] SQL Select Query; All references to a field named playerID should instead be to idxLahman. {160} Step 3; "Drop Row Fields Here" should read "Drop Category Fields Here" (at least in Excel 2003). Also the field "playerID" is actually the field "idxLahman". [163] Bottom of the page; The command: install.packages(pkgs=list("Rcmdr") should be: install.packages(pkgs="Rcmdr") (184-185) Graphs and exposition; The shapes of the graphs match the 2003 data, but the teams labels are incorrect. So while, the description of the teams in the exposition is accurate, the graphs do not match. Also, the graphs use the abbreviations CHC and LAD for example while the exposition uses CHN and LAN (as in the database). Correct graphs from the current database will fix the problems. {191} Step 1; At least in version 2.3.1 of R for Windows, you will need to go to "Select Repositories..." from the Packages menu and include Bioconductor. Then, "hexbin" will appear in the "Install Packages" list. [224] in "The Formula" paragraph for calculating TB; The formula in words is a single counts for 1, DOUBLES COUNT FOR 2, triples for 3 and homeruns for 4.... TB = 1*singles +2*doubles+3*triples+4*homeruns = H + 2B + 2 * 3B + 3 * HR shouldn't this last line read: = H + 2* 2B + 3 * 3B + 4 * HR if I am correct then this formula is wrong in several places in the text.