Mirror image.
Mirror image. (source: Avi Schwab on Flickr, used with permission).

In this special episode of the Data Show, O'Reilly's Jenn Webb speaks with Maxwell Ogden, director of Code for Science and Society. Recently, Ogden and Code for Science have been working on the ongoing rescue of data.gov and assisting with other data rescue projects, such as Data Refuge; they’re also the nonprofit developers supporting Dat, a data versioning and distribution manager, which came out of Ogden's work making government and scientific data open and accessible.

Here are some highlights from their conversation:

Backing up data.gov

One of the projects I've been doing is making the first ever backup of all of data.gov. It's not all of the data on the government. It's all the data that has been published. There is a policy in the federal government that you're supposed to, if you're an agency, tell data.gov all the open data that you have, but they kind of rely on those agencies to self-report. In a couple of years, they've made a pretty big data set of a couple million data sets across, I think, 1,000 federal departments. They've been focusing on getting more and more of these data sets listed, but they've never actually made an archival backup or done an analysis on how much of the data is actually accessible.

We think it's really important to do that while the data is still online. If we can make a snapshot as close to January 20th, the new president coming in, as possible, then in the future if something is changed, then we can refer back to what it used to be. A lot of those files, it's unclear if they've ever been programmatically downloaded and archived. We're trying to create kind of a history or a version control of all the data that's on data.gov as well as doing continuous monitoring on it to see if anything changes. ... There's a lot of data out there, and not all of it is on data.gov.

Making copies of open data is about access, not just preservation

If there's a data set that some repository is hosting on, say, Amazon Web Services, if we can distribute it using the Svalbard Project and the Data Silo nodes, then if somebody goes to download that data set, they can use all the volunteer upload bandwidth to download a copy of it for free, and it'll be faster than if they download it from the one server that's getting hammered a lot. We can kind of do two birds, one stone by having a distributed archive, which helps with redundancy and trust, but then it also helps with relieving the load on the original server.

I was talking with the OpenfMRI Project, which hosts a couple hundred data sets from brain imaging. They spend about $5,000 a year just on bandwidth, and they're a fairly niche community. They don't have millions of downloads or anything like that. It's just because the volume of data that they're hosting has a pretty high average megabyte count. Every download, they're spending a couple bucks on bandwidth that they could be putting toward their research.

Code for Science’s three constituencies

I would say in general our three focus areas are access to research data for scientists, and then access to public data for journalists, and then access to government data for civic hackers or governments that want to publish data themselves. Those are government, science, and journalism, are the three most exciting public data areas that I think need a lot better tools for a lot of this stuff.