Spider Baseball Sites for Data

Sometimes the only way to get the data you want is to pull it directly from the source.

While I was writing this book, I came across the following request on the Retrosheet mailing list:

I’m going to be doing the Fans’ Scouting Report for a third
year, but this time, I want to do it during the year.
I’m looking to get the following information for 2005
for all players, as of the all-star break:
Team,playerID,player name,pos,innings
Anyone who can help, please send me a note offlist.
(playerid being whatever your data source is).
Thanks, Tom

Basically, Tom needed to pull just a subset of data from the MLB.com site. Grabbing data from web pages so that you can reuse it for other purposes is a common task—so much so that it has its own name: spidering. Spidering allows you to write programs that read a web page and pull out just the parts you want, while throwing out the rest.

Web pages are written in a language called HyperText Markup Language (HTML). They contain different tags that explain to your computer how to format the page. Here is a short sample file that shows how this works:

	<html>
	<head>
	<title>Baseball Sites</title>
	</head>
	<body>
	<h1> Baseball Web Sites </h1>
	This book describes many different baseball web sites. Here are a
	few of my favorites:<br>
	<a href="http://www.baseball1.com">The Baseball Archive</a><br>
	<a href="http://www.retrosheet.org">Retrosheet</a><br>
	<a href="http://www.mlb.com">MLB.com</a><br>
	</body>
	</html>

The <html> tags tell the ...

Get Baseball Hacks now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.