Chapter 28. Toads on the Road to Open Government Data

Bill Schrier

One of the latest trends in governing is exposing many of the records and much of the data collected by governments for public viewing, analysis, scrutiny, and use. This trend started a number of years ago, with the federal Freedom of Information Act (FOIA) and local government equivalents such as Washington state’s Public Disclosure Act. The trend has recently accelerated with the election of President Barack Obama, who has promised an “open and accountable” government.

While open government advocates applaud this movement, and it has many notable benefits, there are also noticeable “toads” obstructing the road to an open government future. Some of these toads are implicit in the nature and culture of government. Others represent simple resistance to change. Still others present troubling ethical issues.

What Is Government?

Government is about services and geography and information. Governments should provide services which are difficult or impossible for the public to provide for themselves, or which are hard to purchase from private businesses. It is difficult, or at least troubling, to envision a police force or fire department operating as a for-profit business. Regulatory agencies such as those issuing building permits or enforcing food safety codes also are a natural fit for government. Of course, one can envision—or experience—a private water company, ambulance company, or even for-profit parks department. Still, those are natural monopolies best served at least by a nonprofit model, and probably by government.

Government is also about geography. Cities and counties and states define themselves by their geographic boundaries. Sometimes this geography gives rise to odd anomalies. You can be in a town where the prices of most goods are 9% higher than the sticker, and travel a few hundred feet away where you pay no sales tax or need to have someone else pump your gasoline for you. But such is the nature of city limits and county boundaries and state lines. Complicated technology systems are built (e.g., 911) to guarantee that the proper service is dispatched from the proper government—for example, city police versus county sheriff’s deputies versus the state highway patrol.

All of this is based on the geographical nature of governments. Technology—specifically, the Internet and the World Wide Web—is making boundaries less important. Do you care whether the recycling (or dumping) of your TV is handled by your city or your county or your state? No, but that difference is a major issue for the governments involved. Meanwhile, you simply want to go to a website and find out how to recycle.

In this way, government is also about information, because not only do governments need data to provide services, but they also thrive on data about services and about their constituents, and on turning that data into more or less usable information.

Data Collection

Most cities collect data in a variety of ways, and the most fundamental way is the phone call. A requestor (citizen, customer, constituent, member of the public, complainant) calls 911 for emergencies or calls 311 (in some more enlightened communities) for anything else. This starts the massive engine of data collection about the call and service. A simple call about a microwave oven left in the street can generate a huge amount of data collection. Is it in the street or on the curb? Is it a traffic hazard or, indeed, has it caused a collision? Who should pick it up and dispose of it—the streets department, solid waste department, or police department? Who left it there? Did they break a law? Should they be fined?

Usually, the person requesting a service must, at a minimum, provide his name, his phone number, and a detailed description of the request, including a specific geographic location. Often, she will need to provide a lot more information, such as her date of birth or Social Security number or home address to get a license or permit or piece of identification.

Beyond service calls, governments collect or generate a wide variety of data from a whole host of other sources. For example:

  • Detailed financial information about payments received from companies and people.

  • Business location, nature of business, ownership, business income, payroll, and a wide variety of other information about each business operating within the particular government’s grasp.

  • Statistics such as the number of cars passing through a particular intersection or the number of people living in a given census tract. Indeed, the amount of statistical information collected about economic, personal, and governmental activity probably far exceeds the data collected about individual complaints or requests for service.

Exposing the Soul of Government

The information in government databases is vast and, indeed, is probably the “soul” of government. Certainly, in a democracy all of this information is owned in common by the governed. And most of it should be freely turned over to anyone who asks. So, what’s the problem? Why isn’t all the data collected by government freely available and posted on websites for anyone to take and use?

I believe there are probably seven general reasons most data owned by governments is still locked in our virtual vaults:

  • Privacy and legal restrictions

  • The culture of bureaucracies and homeland security

  • Ancient media

  • Proprietary and medieval databases

  • Ethically questionable (privacy)

  • Ethically questionable (sharing)

  • Cost

This restriction is the easiest to understand, although not necessarily to interpret or put into action.

Clearly, we do not want to “make open” any information restricted by law or prone to criminal use. Health records are certainly private, for example, as is personal information such as date of birth and Social Security number in combination with legal name and residence. This latter combination can be used to obtain credit and steal identities. The names of victims and often the names of accused are private, as are active criminal investigation files. Conversations between attorneys and clients, including between an assistant city attorney and clients such as department directors who are conducting personnel investigations or hiring/disciplining employees, are private.

The difficulties with releasing information and records are primarily the myriad laws and rules which protect privacy. There are federal, state, and local laws. There is case law. There are special laws such as the Health Insurance Portability and Accountability Act of 1996 (HIPAA). Interpreting which laws apply often requires legal opinions from a government-employed attorney. Then, redacting or removing protected information from records requires considerable time, effort, and expense by government workers.

There are amazing sets of twists and turns and incongruities.

For one example, investigations of misconduct or discrimination are generally and amazingly in the public record. In the Seattle Transportation Department in 2008, for instance, certain employees claimed discrimination in job promotions and other personnel actions. The department hired an outside law firm and spent $800,000 investigating the issues. The entire file was opened as a public record, although the names of individual employees were redacted. One twist in this case is that employees who were interviewed as part of the investigation were notified prior to the release of the record—they could have hired a personal attorney to sue the city to block the release of the records.

In another situation, the city of Seattle has an ordinance, dating from the 1970s, restricting the police from collecting information about the activities of people other than for criminal investigations. This ordinance was enacted partially as a reaction to the FBI activities of 1975 and earlier, where that agency, under J. Edgar Hoover, collected vast files on the private lives of the people of the United States, both prominent and unknown. One twist in this Seattle ordinance is that the Seattle police not only are restricted from collecting such information, but also are restricted from obtaining it (via, say, an automated link) if it was collected by another agency.

In still another situation, an employee requested all records referencing her name, including performance evaluations, emails about her, and notes in supervisors’ files (files kept by supervisors about an employee’s performance). This single request resulted in a search or scan of files and email messages held on desktop, server, and mainframe computers, plus a physical search of paper files to find all the relevant material. Then it all had to be redacted to remove the names of other employees or other protected information before it was released.

In the end, simply determining which laws might protect data and information held by the government is a daunting task. Often, it is easier to let information sit rather than make this determination and open the information to public scrutiny.

The Culture of Bureaucracies and Homeland Security

Information is power. It is in the nature of bureaucracies to be both protective of their information and fearful of its release. This fear gained new legal and emotional standing after the terrorist attacks on September 11, 2001, laws resulting from those attacks, and the creation of giant federal bureaucracies such as the U.S. Department of Homeland Security. A whole new class of restrictions on sharing of information was enacted for fear that certain kinds of information might be an aid to future terrorists.

Specific examples of such information include:

  • Plans for buildings on file in building departments.

  • Locations of public communications infrastructure such as fiber optic cable for phones and data systems. There have been “terrorist” incidents in both Bellingham, Washington, and Santa Clara County, California, where cables have been intentionally cut and 911 service interrupted.

  • Plans for and even the location of other infrastructure such as water pipelines, electrical lines, highway bridges, and microwave and radio towers (their locations are obvious, but what services are provided on any particular tower are not obvious).

  • Government plans for protecting such infrastructure or responding to emergencies.

While this set of restrictions generally seems to make sense, the climate of fear gripping the nation after September 11 added a whole other set of issues. For example, the city of Seattle had published, on its website, the location and nature of all calls to which the fire department and its medical service were dispatched. After September 11, that information was curtailed, probably without specific legal authority, but because of Seattle’s home city security concerns.

Many cities do restrict making public the locations and nature of 911 calls—police or fire. A domestic violence call to a specific address could, if revealed, fuel additional attacks on the victim. But is there a good reason to withhold information about thefts or barking dogs or even assaults? Shouldn’t people know what is going on in their neighborhoods? On the other hand, what if releasing that information depresses housing values or sales in that neighborhood, or causes discriminatory lending practices?

Many officials inside government also fear misinterpretation of data, and crime data specifically. If the geographic location of all crimes is made public in a data feed, it certainly would be possible to draw lines around certain locations and declare that crime is increasing (or decreasing) in those locations. People outside government might draw erroneous correlations from the data, especially when compared with other information such as census data or anecdotal information, e.g., “This is a high crime neighborhood because many legal and illegal immigrants live here.”

Note

The book Inside Bureaucracy by Anthony Downs (Waveland Press, 1993) contains further reading on the culture of government—or indeed, the culture of any large bureaucracy, public or private.

Ancient Media

Large quantities of government information are still stored on ancient media. In many cases, these are maps with locations of infrastructure or photographs. There are also filing cards or paper files with building permits and other permitting activity, criminal case files, legal opinions, and a variety of other data. In some cases, the older paper media have been moved to “newer” microfilm or microfiche, which—in these days of fully electronic and digital records—makes them even harder to access! In a few cases, data might be stored on magnetic tape, audio tape, or floppy disks, although those media are too short-lived to really be repositories for significant amounts of information.

In most cases, I think, the information held on ancient media which is most needed for current government operations has already been digitized. Over time, this sort of information will become less and less relevant and important.

Perhaps we need a “Google Books” project for government!

Proprietary and Medieval Databases

A related issue is data which is presently in electronic format but which is held in proprietary and ancient databases. In some cases, the schemas (designs or plans) for those databases never existed or have been lost. In other cases, the vendor that sold the database considered its format to be proprietary, usually to guarantee the vendor’s income stream in consulting fees to create reports to pull data from the databases. I refer to such vendor behavior as “medieval” because most software companies today freely give schemas and database structures to the governments that pay for the software.

Here are a few examples of this problem:

Electronic mail archives

Many different email systems have come and gone over the years. Today, just a few systems are in common use, such as Microsoft’s Exchange/Outlook and IBM’s Domino. But email archives stored in older and less common systems (e.g., Novell’s GroupWise) may not convert to a newer format, or may continue to be held in the older format even when an organization converts to a newer email system. Even with newer email systems, the email message stores and archive stores may be scattered around on desktop and server computers throughout the enterprise, making it difficult to collect and expose the messages.

Older versions

In many cases, governments installed computer systems for specific tasks, such as records management for a police department or customer billing for a water department. In the (seemingly) never-ending economic and budget cycles, it is often tempting for a government to stop paying maintenance to the software vendor on such systems. As the vendor comes out with newer versions of the system, the government doesn’t have license to the new version, and doesn’t upgrade. So, the software continues to work, spitting out reports and bills. But the government can’t take advantage of features in the new versions which allow greater portability and exposure of the data.

Custom software applications

Before 2000, it was very common for all enterprises to write custom software applications for particular business problems. In other words, rather than buy commercial off-the-shelf (COTS) software, an enterprise would do custom programming of a system in COBOL or another software language. These systems often were not well documented, and are hard to modify and very hard to “open up” for a public data feed.

Proliferation of databases

Another problem is the sheer proliferation of databases in government. This is especially a problem in that some software became “too” easy to use. Individual employees, with minimal training, could create databases from FoxPro or Microsoft Access and use them for specific purposes. In 2008, the city of Seattle planned to upgrade the entire city government to Microsoft Office 2007. One component of Office 2007 is Access 2007. But Access 2007 formats are quite different from previous Access versions. We did a scan for older Access databases to determine the magnitude of the effort to do the conversion. In one single city department alone we found more than 25,000 Access databases, 15 times the number of employees in that department! Now, most of those databases were undoubtedly old or out of use, but this does illustrate the proliferation problem.

Ethically Questionable Information (Privacy)

Legal restrictions notwithstanding, whole sets of information are ethically troubling to expose. Here are some examples:

  • In the city of Seattle, a local radio/TV station (KIRO) requested the full name, employment date, and date of birth of every city employee. KIRO was trying to determine how many employees might retire from city service, and when. But full legal name and date of birth are two of the three pieces of information (the third is Social Security number) that are necessary to steal employee identities. After considering legal action to prevent the release, the city determined that it had to release the information to KIRO.

  • Most elected officials and city departments maintain lists and databases of email addresses for use in contacting constituents. This is public information and is a common target for public disclosure requests. While requestors are not supposed to use such information for commercial purposes (e.g., sending penis- or breast-enlargement emails to those constituents), that’s hard to prevent. Typically, I advise departments to turn such information over as a paper record so that at least the requestor needs to manually enter all the addresses!

  • Employees have a right to file grievances or complaints of harassment or discrimination. Usually, governments hire outside private companies to investigate such complaints, and then render a report. But as I mentioned earlier, such investigations are public records and are open to disclosure after redaction. This is ethically troubling, at least. Employees are much less likely to discuss the details of issues if they know their identities could be inferred or revealed after the investigation ends. Public disclosure, in this case, could have a chilling effect on the investigation of harassment and discrimination. Often, governments will try to protect employees by having the private investigatory agency keep all the original source material (i.e., interview notes) and turn over only a summary report to the government, thereby preventing its release.

  • Voicemail messages are held electronically. Most governments do not, I believe, keep such messages for any length of time. But such messages could well be considered public record and made public upon request.

Ethically Questionable Information (Sharing)

More and more government data is collected and held in “open” formats—ones which are easily shared. Certainly, that data could be shared publicly on websites such as Data.gov. But the data could also be shared among government agencies, and even correlated among agencies. This leads to the possibility of creating large databases of information about individuals. If you just think about the number of interactions you have with governments, this becomes a staggering amount of information: your driver’s license, moving violations, parking tickets, pet licenses, building permits, electricity usage and bill, water usage and bill, license plate number, car make and model, property taxes, and so forth.

We’ve certainly thought about creating customizable web portals for the city government, where individuals could sign in and be presented with news from their neighborhood, for example, or an opportunity to pay their electric bill. Would we also want to give them the option to pay their parking ticket or renew their pet license? Would we want to correlate this information so that an animal control officer could be dispatched to the address on a cat owner’s electrical bill and arrest her because Fluffy is unlicensed?

Laurence Millar, former CIO of New Zealand, had one interesting and partial solution to this issue: allow sharing among agencies only when explicitly authorized by the individual. This would certainly suit my taste, as I don’t want my speeding tickets with the Washington State Patrol cross-correlated with those of the Seattle police and other agencies!

This sort of information sharing is, of course, not unique to government. There are recent reports that social networking sites (MySpace, Facebook) might “leak” information to web search engine sites (Google, Yahoo!) so that the web browsing habits of consumers can be cross-correlated with their personal information and used for a variety of purposes, such as targeted advertising, I presume.

An interesting (or terrifying) marriage of these two data sources is public information about individuals from governments on an open data feed used by the same companies that track social networking and web browsing or online purchasing habits. “Ethically troubling,” to say the least.

Cost

In some cases, the cost to keep and expose government data is just too high to make it practical. The best example of this is email. The city of Seattle has an active email store of about 5 terabytes. That’s 5,000,000,000,000 bytes. (The contents of the printed matter of the entire Library of Congress are estimated to be 10 terabytes.) We have a rule that all email is deleted after 45 days unless the individual user explicitly archives it as being a public record or otherwise valuable. We’ve been criticized that 45 days is just too short a time—the records should be kept longer. We’ve estimated that keeping email for a year would require a message store of about 30 terabytes, and an additional initial cost of $1.8 million in storage. Furthermore, it would take at least two days just to back up a message store of this size.

Clearly, there are limits on how much data should be kept and exposed in terms of the cost to taxpayers.

Conclusion

I’m an open government advocate. I believe most of the data held by government should be freely available on the Internet for use by the public who paid for its collection and storage. I list the “toads in the road” of open government in this chapter simply to demonstrate that this is not a trivial or inexpensive task. It will take some effort (and perhaps a bulldozer) to get past some of the reasons not to share data. I do believe that, over time, as systems are upgraded and replaced, most of this data will become exposed.

We also need better systems for document management, content management, and searching. Some of those systems exist (e.g., Microsoft’s SharePoint and Oracle’s Stellent). But again, they are relatively expensive and not trivial to implement.

Finally, exposing this data does have its ethical consequences, especially if all government data is open. Data collected by private companies—for example, telecommunications companies, web e-tailers, and search companies—can be cross-correlated with the government data by those companies or others for purposes of marketing, advertising, and even criminal activity.

Yes, the road to open government data is pockmarked with many toads and bumps. Yes, most of them can eventually be overcome. But do we really want to make it that easy for everyone to obtain and use that data?

About the Author

About the Author

Bill Schrier is the chief technology officer (CTO) for the City of Seattle and director of the city’s Department of Information Technology (DoIT), reporting directly to Mayor Greg Nickels. Seattle has a population of about 600,000 residents and a city government of about 11,000 employees. DoIT has 215 full-time employees and a budget of $59 million. Approximately 600 employees work in information technology units throughout city government. Bill writes a blog about the intersection of information technology and government, how they sometimes collide but often influence and change each other. He tweets at http://www.twitter.com/billschrier.

Get Open Government now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.