O'Reilly logo

Beautiful Data by Toby Segaran, Jeff Hammerbacher

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

How Did We Get the Data?

Once we decided that we were interested in real estate sales, the search for data began. Data searches are not always successful, so we felt particularly lucky when we found weekly sales of residential real estate (houses, apartments, condominiums, etc.) for the Bay Area produced by the San Francisco Chronicle at http://www.sfgate.com/homesales/. We felt even luckier when we figured out that we didn't have to extract the data by parsing web pages, but that the data is already available in a machine-readable format.

Each human-readable (HTML web page) weekly summary is built from a text file that looks like this:

rowid: 1
county: Alameda County
city: Alameda
newcity: 1
zip: 94501
street: 1220 Broadway
price: $509,000
br: 4
lsqft: 4420
bsqft: 1834
year: 1910

The data for each week is available at a URL of the form http://www.sfgate.com/c/a/<year>/<month>/<day>/REHS.tbl. This is pretty convenient and only requires generating a list of all Sundays from the first on record, 2003/04/27 (which we found on the archive page), to the most recent (at the time of analysis), 2008/11/16. With this list of dates in hand, we generated a list of URLs in the correct format and downloaded them with the Unix command-line tool wget. We used wget because it can easily resume where it left off if interrupted.

With all the data on a local computer, the next step was to convert the data into a standard format. We often use the csv (comma-separated values) format; it is easy to generate ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required