Until recently, government data made its way to the Internet primarily through central planning: civil servants gathered the raw data generated by their work, processed and analyzed it to make maps, reports, and other informative products, and offered these to citizens seeking insight into school performance, crime in their neighborhoods, or the status of proposed laws. But a new, more dynamic approach is now emerging—one that enlists private actors as allies in making government information available and useful online.
A portion of this chapter was previously published as “Government Data and the Invisible Hand,” Yale Journal of Law & Technology, Vol. 11, 2009.
When the Web was born, computational and network resources were so expensive that building large-scale websites required substantial institutional investment. These inherent limits made government the only free provider of much online civic information, and kept significant troves of data off the Web entirely, trapped in high-end proprietary information services or dusty file cabinets. Government officials picked out what they thought to be the most critical and useful information, and did their best to present it usefully.
Costs for storage and processing have plummeted, but another shift, less well known, is at least as important: the tools that let people develop new websites are easier to use, and more powerful and flexible, than ever before. Most citizens have never heard of the new high-level computer languages and coding “frameworks” that automate the key technical tasks involved in developing a new website. Most don’t realize that resources such as bandwidth and storage can be bought for pennies at a time, at scales ranging from tiny to massive, with no upfront investment. And most citizens will never need to learn about these things—but we will all, from the most computer-savvy to the least tech-literate, reap the benefits of these developments in the civic sphere. By reducing the amount of knowledge, skill, and time it takes to build a new civic tool, these changes have put institutional-scale online projects within the reach of individual hobbyists—and of any voluntary organization or business that empowers such people within its ranks.
These changes justify a new baseline assumption about the public response to government data: when government puts data online, someone, somewhere, will do something innovative and valuable with it.
Private actors of all different stripes—businesses and nonprofit organizations, activists and scholars, and even individual volunteers—have begun to use new technologies on their own initiative to reinvent civic participation. Joshua Tauberer, a graduate student in linguistics, is an illustrative example. In 2004, he began to offer GovTrack.us, a website that mines the Library of Congress’s (LOC) THOMAS system to offer a more flexible tool for viewing and analyzing information about bills in Congress (see Chapter 18). At that time, THOMAS was a traditional website, so Tauberer had to write code to decipher the THOMAS web pages and extract the information for his database. He not only used this database to power his own site, but also shared it with other developers, who built popular civic sites such as OpenCongress and MAPLight (see Chapter 20), relying on his data. Whenever the appearance or formatting of THOMAS’s pages changed, Tauberer had to rework his code. Like reconstructing a table of figures by measuring the bars on a graph, this work was feasible, but extremely tedious and, ultimately, needless. In recent years, with encouragement from Tauberer and other enthusiasts, THOMAS has begun to offer computer-readable versions of much of its data, and this has made tools such as GovTrack easier to build and maintain than ever before.
Making government data public should always include putting it online, where it is more available and useful to citizens than in any other medium. But deciding that certain data should be published online is the beginning, not the end, of an important series of choices.
All publishing is not equal—instead, the way data is formatted and delivered makes a big difference. Public sector leaders interested in supporting this trend should look for the formats and approaches that best enable robust and diverse third-party reuse. Such a publishing strategy is powerful because it allows citizens themselves to decide how best to interact with civic data. Government-produced reports, charts, and analyses can be very valuable, but it is essential to also publish the underlying data itself in a computer-friendly format that makes it easy for the vibrant community of civic technologists to make and share a broad range of tools for public engagement.
Innovation is most likely to occur when data is available for free over the Internet in open, structured, machine-readable formats for anyone to download in bulk, meaning all at once. Structured formats such as XML make it easy for any third party to process and analyze government information at minimal cost. Internet delivery using standard protocols such as HTTP can offer immediate access to this data for developers. Each set of government data should be uniquely addressable on the Internet in a known, permanent location. This permanent address can allow both third-party services, as well as ordinary citizens, to refer back to the primary unmodified data source as provided by the government.
Public government data should be provided in this format in a timely manner. As new resources are added to a given data set, or changes are made, government should also provide data feeds, using open protocols such as RSS, to notify the public about incremental additions or changes. However, a feed that provides updates is of limited value unless the existing body of information that is being modified can itself be downloaded in full. These principles are not ours alone—they are consistent with a number of other recommendations, including the Open Government Working Group’s list of eight desirable properties for government data.
In an environment with structured data, questions about what to put on the home page become decisions for the public affairs department. Technical staff members in government, whose hard work makes the provision of underlying data possible, will have the satisfaction of seeing their data used widely—rather than lamenting interfaces that can sometimes end up hiding valuable information from citizens.
Third-party innovators provided with government data in this way will explore more advanced features, beyond simple delivery of data. A wide range of motivations will drive them forward, including nonprofit public service, volunteer enthusiasm, political advocacy, and business objectives. Examples of the features they may explore include:
The best search facilities go beyond simple text matching to support features such as multidimensional searches, searches based on complex and/or logical queries, and searches for ranges of dates or other values. They may account for synonyms or other equivalences among data items, or suggest ways to refine or improve the search query, as some of the leading web search services already do.
RSS, which stands for Really Simple Syndication, is a simple technology for notifying users of events and changes, such as the creation of a new item or an agency action. The best systems could adapt the government’s own feeds (or other offerings) of raw data to offer more specialized RSS feeds for individual data items, for new items in a particular topic or department, for replies to a certain comment, and so on. Users can subscribe to any desired feeds, using RSS reader software, and those feeds will be delivered automatically to the user. The set of feeds that can be offered is limited only by users’ taste for tailored notification services.
Government data, especially data about government actions and processes, often triggers news coverage and active discussion online. An information service can accompany government data with links to, or excerpts from, these outside sources to give readers context into the data and reactions to it.
To put an agency’s data in context, a site might combine that data with other agencies’ data or with outside sources. For example, MAPLight.org combines the voting records of members of Congress with information about campaign donations to those members. Similarly, the nonprofit group Pro Publica offers a map showing the locations of financial institutions that have received funds from the Treasury Department’s Troubled Asset Relief Program (TARP).
A site that provides data is a natural location for discussion and user-generated information about that data; this offers one-stop shopping for sophisticated users and helps novices put data in context. Such services often require a human moderator to erase off-topic and spam messages and to enforce civility. The First Amendment may make it difficult for government to perform this moderation function, but private sites face no such problem, and competition among sites can deter biased moderation.
Often, large data sets are best understood by using sophisticated visualization tools to find patterns in the data. Sites might offer users carefully selected images to convey these patterns, or they might let users control the visualization tool to choose exactly which data to display and how. Visualization is an active field of research and no one method is obviously best; presumably sites would experiment with different approaches.
Machine-learning algorithms can often analyze a body of data and infer rules for classifying and grouping data items. By automating the classification of data, such models can aid search and foster analysis of trends.
Another approach to filtering and classification is to leverage users’ activities. By asking each user to classify a small amount of data, or by inferring information from users’ activities on the site (such as which items a user clicks), a site might be able to classify or organize a large data set without requiring much work from any one user.
Exactly which of these features to use in which case, and how to combine advanced features with data presentation, is an open question. Private parties might not get it right the first time, but we believe they will explore more approaches and will recover more rapidly than government will from the inevitable missteps. This collective learning process, along with the improvement it creates, is the key advantage of our approach. Nobody knows what is best, so we should let people try different offerings and see which ones win out. For those desiring to build interactive sites, the barriers to entry are remarkably low once government data is conveniently available. New sites can easily iterate through many designs, and adapt to user feedback. The people who ultimately benefit from these investments are not just the small community of civic technologists, but also the much larger group of citizens who seek to use the Web to engage with their government.
Once third parties become primary interfaces for crucial government information, people will inevitably ask whether the presented data is authentic. Citizens may wonder whether some of the sites that provide data in innovative ways are distorting the data. Slight alterations to the data could carry major policy implications, and could be hard for citizens to detect.
To lower the barrier for building trustworthy third-party sites, government should provide authentication for all published bulk data sets so that anyone who encounters the data can verify its authenticity. Since government is the original publisher of the data, and citizens seek assurance that a third party has not altered the data, government is the only party that can provide a useful digital signature for its data. While other publishing tasks can be left open for many actors, only government itself can provide meaningful authentication.
The ideal way to provide such authentication is through National Institute of Standards and Technology (NIST) standard “digital signatures.” Government should sign entire data sets, which will allow any downloader to check that the “signed” data set was published by the government and not altered in transit. The advantage of digital signatures is that it allows third parties to republish a trustworthy mirror of the same signed data set. Innovators who download the signed data set, from either a third-party source or the government’s own server, can trust that it is authentic if its attached signature is valid. Enabling trustworthy third-party mirrors can significantly reduce the government’s server and bandwidth costs associated with hosting the primary copy.
But just authenticating at the data-set level is not enough. Government must also make it possible for citizens to verify, down to a reasonable granularity, the authenticity of individual elements that were picked out from the larger set. If signing individual elements is overly burdensome, government can alternatively publish individual data elements over a secure web connection (HTTPS). A third-party website offering crime statistics, for example, could link to specific data elements on the secure government website. This would make it easy for citizens to verify that the statistics for their own neighborhoods represent authentic government data, without having to download and verify the entire bulk data set on which the website is built.
There are a number of ways to support data authentication at each level—digital signatures and secure web connections are just two possibilities—and each agency, perhaps with the input of outsiders, should determine which option provides the best trade-off between efficiency and usability in each circumstance.
An alternative approach to bulk data, and one that is sometimes mentioned as an equivalent solution, is for government to provide a data application programming interface (API). An API is like a 411 telephone directory service that people can call to ask for specific information about a particular person or business. The directory operator looks up the answer in the telephone book and replies to the caller. In the same way, computers can “call” an API and query it for specific information, in this case, from a government database that is otherwise inaccessible to the public, and the API responds with an answer once it is found. Whether a third-party website uses an API or hosts its own copy of the government data is an architectural question that is not likely to be directly observable by the website’s end users.
APIs can be excellent, disappointing, or anywhere in between, but generally speaking, providing an API does not produce the same transformative value as providing the underlying data in bulk. While APIs can enable some innovative third-party uses of data, they constrain the range of possible outcomes by controlling what kinds of questions can be asked about the data. A very poorly designed API, for example, might not offer access to certain portions of the underlying data because the API builder considered those data columns to be unimportant. A better API might theoretically permit access to all of the data, but may not allow users to get the desired data out efficiently. For instance, an API for local spending might be able to return lists of all projects by industry sector, but might lack the functionality to return a list of all projects funded within a particular zip code, or all projects contracted to a particular group of companies. Because of API design decisions, a user who wants this information would face a difficult task: she would need to find or develop a list of all possible sectors, query the API for each one, and then manually filter the aggregate results by zip code or contractor.
APIs and finished, user-facing websites face the same fundamental limit for the same reason: both require a designer to decide on a single monolithic interface for the data. Even with the best of intentions, these top-down technical decisions can only limit how citizens can interact with the underlying data. Past experience shows that, in these situations, interested developers will struggle to reconstruct a complete copy of the underlying data in a machine-readable way, imposing a high cost in terms of human capital and creating a risk of low data quality. The task would be like reconstructing the phone book by calling 411—“First, I want the last names starting with Aa….” Moreover, APIs and websites are likely more expensive for government to develop and maintain, as compared to simply publishing copies of the raw data and allowing third parties to host mirrors.
If government releases the data first in bulk, citizens will not be restricted to just the approved interfaces. Since APIs, like websites, do serve a useful purpose in efficient data delivery, developers will build their own APIs on top of bulk data sets that best suit their own needs and those of downstream users. Indeed, a number of nonprofit groups have already built and are now offering public APIs for data the government has published in bulk form. OMB Watch, for example, combines multiple government contract and grant databases into a single “FedSpending” API that other developers use for their own sites. The National Institute on Money in State Politics offers a “Follow the Money” API which provides convenient access to its comprehensive state-level campaign finance data set (see Chapter 19).
Government should seek to ease any friction that limits developers’ ability to build these tailor-made solutions. Only with bulk data can government harness the creativity and innovation of the open market and leverage the power of the Internet to bring all kinds of information closer to citizens. In the long run, as the tools for interacting with data continue to improve and become increasingly intuitive, we may reach a state in which citizens themselves interact directly with data without needing any intermediary.
Of course, beyond publishing data, government might also decide to build finished websites, and to build APIs. But publishing data in bulk must be government’s first priority as an information provider. The success of a government is measured, ultimately, by the opportunities it provides to its citizens. By publishing its data in a form that is free, open, and reusable, government will empower citizens to dream up and implement their own innovative ideas of how to best connect with their government.
David G. Robinson is a J.D. candidate in the class of 2012 at Yale Law School. Before arriving at Yale, David helped launch Princeton’s Center for Information Technology Policy, serving as the Center’s first associate director. He holds an A.B. in philosophy from Princeton, and a B.A. in philosophy and politics from Balliol College, Oxford, where he was a Rhodes Scholar.
Harlan Yu is a Ph.D. student in computer science and the Center for Information Technology Policy at Princeton University. His research is in information security, privacy, and technology public policy. His recent work in open government includes the development of RECAP, a tool that helps the public liberate federal court documents from PACER (Public Access to Court Electronic Records). He received his B.S. in electrical engineering and computer sciences (EECS) from UC Berkeley in 2004 and his M.A. in computer science from Princeton University in 2006.
Edward W. Felten is professor of computer science and public affairs, and director of the Center for Information Technology Policy, at Princeton University. His research interests include computer security and privacy, civic technologies, and technology policy. He received his Ph.D. in computer science and engineering from the University of Washington in 1993.