Once we decided that we were interested in real estate sales, the search for data began. Data searches are not always successful, so we felt particularly lucky when we found weekly sales of residential real estate (houses, apartments, condominiums, etc.) for the Bay Area produced by the San Francisco Chronicle at http://www.sfgate.com/homesales/. We felt even luckier when we figured out that we didn't have to extract the data by parsing web pages, but that the data is already available in a machine-readable format.
Each human-readable (HTML web page) weekly summary is built from a text file that looks like this:
rowid: 1 county: Alameda County city: Alameda newcity: 1 zip: 94501 street: 1220 Broadway price: $509,000 br: 4 lsqft: 4420 bsqft: 1834 year: 1910
The data for each week is available at a URL of the form http://www.sfgate.com/c/a/<year>/<month>/<day>/REHS.tbl. This is pretty convenient and only requires generating a list of all Sundays from the first on record, 2003/04/27 (which we found on the archive page), to the most recent (at the time of analysis), 2008/11/16. With this list of dates in hand, we generated a list of URLs in the correct format and downloaded them with the Unix command-line tool wget. We used wget because it can easily resume where it left off if interrupted.
With all the data on a local computer, the next step was to convert the data into a standard format. We often use the csv (comma-separated values) format; it is easy to generate ...