O'Reilly logo

Open Government by Laurel Ruma, Daniel Lathrop

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 24. My Data Can’t Tell You That

Bill Allison

In April 2009, the Bureau of Labor Statistics, in its monthly Employment Situation update, reported that some 539,000 Americans lost their jobs, pushing the total number of job losses since the recession began to more than 5.7 million. Job losses were widespread across all economic sectors; the unemployment rate rose to 8.9%.[171]

The worsening employment picture contradicted projections made by a pair of advisors to the incoming administration of President Barack Obama. On January 10, 2009, in a report titled “The Job Impact of the American Recovery and Reinvestment Plan,” Christina Romer and Jared Bernstein suggested that, with a $775 million stimulus bill, unemployment would peak at around 7%—in the fourth quarter of 2009. In fairness to Romer and Bernstein, they also cautioned that their estimates were just that—estimates. Nevertheless, actual job losses, which occurred even though Congress passed a slightly heftier $787 billion American Recovery and Reinvestment Act, galloped along much faster than the pair of advisors said would occur if Congress had done nothing. Absent a stimulus, they warned, the unemployment rate could hit 9% sometime in late 2009. In reality, the U.S. unemployment rate reached 9.4% in May.

Around the time these job loss numbers were coming out, I started looking into the Obama administration’s efforts to make stimulus spending as transparent as possible. I began with the idea of doing a quick review of the website, Recovery.gov, which was supposed to be the place to “follow every penny.” However, as often happens to investigative reporters, it wasn’t long before I had wandered far afield and started asking what was going on, not with Recovery.gov but with the Recovery Act itself. How much money had actually gotten out to states and local communities?

It didn’t take long to track down examples of projects in the works. A press release from a member of Congress boasted about dozens of local housing authorities receiving funding, including one in Mercer County, Pennsylvania, for new construction projects.

A quick review of the data available on Recovery.gov showed no record of the project. USASpending.gov, the federal government’s one-stop shopping site for information on grants and contracts, did list the $1.7 million grant, but did not show how much, if any of it, had been spent. Looking back at previous grants, one quickly finds that while USASpending.gov lists the amount of grant money awarded to all kinds of things—thousands of public housing authorities each year—it doesn’t track when spending occurs, or when it concludes. Does that mean every single federal grant spent down to the penny, in exactly the period of time for which it’s given? Do grantees never spend less than they’re awarded? Do they ever run out of money much sooner than expected? Do they ever have to return to the federal government for additional funds? One would never know from looking at USASpending.gov. How quickly government money actually gets to the economy—of particular interest when looking at the effectiveness of the American Recovery and Reinvestment Act—isn’t something they track; their data can’t tell you that.

The How and Why of Data Collection

In this brave new era of transparent government, more and more departments and agencies are publishing more and more of the data they collect online. Yet we are finding that, for this information to be useful, it requires a great deal of analysis and explication, and that how and why the data is gathered sometimes tells us as much about government as the information itself. And sadly, one of the things we’ll also hear when we have vital questions to answer about the economy, health care, national defense, energy policy, the environment, and education is a response I’ve heard, in various forms, when I’ve asked government officials specific questions about the numbers they collect, record, analyze, summarize, correlate, and disseminate to the public: “My data can’t tell you that.”

This isn’t a particularly original insight: investigative reporters, economists, academics, and others have long found fault with the accuracy of government data. I’ve spent most of my professional life as a researcher, reporter, and editor working with government data in one form or another, from agencies ranging from the Federal Aviation Administration to the Agriculture Department’s Food Safety Inspection Service, from the Internal Revenue Service to the Defense Advanced Research Projects Agency. I’ve found that government officials keep track of all sorts of incredibly valuable minutiae.

Years before US Airways Flight 1549 lost engine power after hitting some birds and subsequently made an unscheduled landing in the Hudson River, civil servants were collecting from pilots and airports around the country information on incidents in which birds had interfered with the operation of aircraft. (It happens, on average, 20 times per day.) Sadly, the FAA kept that data under wraps for years. Had they made it public, perhaps biologists, pilots, statisticians, naturalists, or people who just like puzzles could have put the data to good use. Perhaps someone would have come up with countermeasures that might have kept the passengers of Flight 1549 from their wet landing.

But to do that, the data has to be reasonably accurate. And federal data can be inaccurate, misleading, or downright wrong—sometimes impossibly so. One absolute certainty we’ve been living with since the advent of commercial nuclear power is that the number of spent fuel assemblies—the highly radioactive metal rods that power nuclear plants—can, for practical purposes, only increase. They are lethal, and will remain so for thousands of years. They are mostly stored in cooling tanks at the nuclear power plants where they were used. Federal efforts to come up with a permanent solution for spent nuclear fuel—a process that began with the first atomic energy projects in the 1950s—have hit another dead end with the cancellation of the Yucca Mountain project.

Even if government can’t find a final destination for the nation’s nuclear waste, one would imagine that at least it could keep track of where it is. But that’s not the case. When examining records maintained on the buildup of spent fuel assemblies at power plants across the United States for their 1985 book, Forevermore: Nuclear Waste in America (W.W. Norton & Company), Donald L. Barlett and James B. Steele looked at the monthly reports that the Nuclear Regulatory Commission issued on the inventories of spent fuel assemblies in nuclear plants. A nuclear-generating station in Dresden, Illinois, owned at the time by the Commonwealth Edison Company of Chicago (now called ComEd, a subsidiary of the Exelon Corporation), reported having 3,512 stored assemblies in December 1982, then 1,873 in February 1983, then 2,880 in March, then 2,054 in May. Numbers which a Commonwealth Edison representative said should “constantly be going up” were bouncing both ways.

This is especially troubling when one considers that the NRC and the nuclear industry had developed an elaborate system, one in which each fuel assembly was assigned its own serial number, to make certain that the nation’s deadliest industrial waste was accurately accounted for. “When the conflicting figures…were called to the commission’s attention, a spokesman said the NRC had no explanation,” Barlett and Steele reported in their book. They added, “Because the federal government issues precise numbers on nuclear energy and waste production, we have used those figures. But they should be viewed, in every case, as nothing more than approximations.”

Federal Data: Approximations Galore

That sort of caution is a prerequisite when approaching federal data. OMB Watch, a Washington, D.C.-based nonprofit group, keeps a close eye on federal contract and grant data. They built, with a grant from Sunlight Foundation, a database called FedSpending.org—which became the model for the government’s primary site for publishing information on contracts and grants, USASpending.gov. The folks at OMB Watch noticed an anomaly in how their official government counterpart was handling companies that, through mergers, sales of business units, or spinoffs, acquire new parent companies. Halliburton, which spun off its Kellogg, Brown and Root subsidiary in 2007, was no longer listed as KBR’s parent in any of the years preceding the breakup. The government in essence backdated KBR’s emergence as a separate business, so anyone searching for Halliburton’s government contracts would find no references to those of its controversial subsidiary.

Exactly how that problem came about is unclear—it may be a programming error, or it might be a methodological problem. Data can also go awry due to its source. In 1973, journalist Jessica Mitford noted in her seminal book Kind & Usual Punishment: The Prison Business (Knopf) the impact that local officials can have on the FBI’s Uniform Crime Reports. These statistics, currently compiled from reports sent in from some 17,000 jurisdictions that include big-city police forces, state law enforcement agencies, university public safety offices, and small-town sheriffs, are used to justify tougher federal crime laws and larger budgets for law enforcement at the federal, state, and local levels. Yet the numbers are subject to wild fluctuations based on who collects them. “Much depends on the local police chief,” Mitford wrote, “thus there was an 83 percent increase in ‘major crimes known to the police’ in Chicago between 1960 and 1961 when a zealous new chief revised reporting procedures.”

Such difficulties persist to the present. Consider an August 8, 2008 article by Ryan Gabrielson of The Easy Valley Tribune that informed readers of a dispute between the mayor of Phoenix and a local law enforcement official.[172] Mayor Phil Gordon cited the FBI Uniform Crime Reports, which showed that violent crime rates had increased from 2006 to 2007 in Maricopa County, to criticize the county’s sheriff, Joe Arpaio, a controversial figure who, among other things, once marched prisoners from one county jail to another wearing nothing but underwear and flip-flops. Gordon was less concerned about underwear than he was about a policy Arpaio adopted in 2006, when the sheriff told his deputies to detain illegal immigrants after the Arizona state legislature passed a law authorizing local police to do so. Gordon claimed that FBI statistics showed Arpaio’s immigration enforcement diverted resources from the first duty of the sheriff’s office: protecting Maricopa County’s citizens.

Arpaio, who styles himself the “toughest sheriff in America,” quickly fired back. His office said that, in the first seven months of 2008, violent crime had fallen 10% compared to the same period from the previous year. Lisa Allen, a spokesperson for the sheriff’s office, dismissed Gordon’s claims by saying, “I don’t know where that man gets his information.” The mayor responded by evoking the prestige and authority of the nation’s top law enforcement agency: “I will rely on the FBI numbers, and not any other numbers, to judge,” he told The Easy Valley Tribune.

In fact, Gordon was relying on Arpaio. The ultimate sources of the FBI’s numbers are local law enforcement agencies, including the Maricopa County sheriff’s office. For the record, doubts about Arpaio’s priorities were the subject of a series of articles by Tribune reporters Ryan Gabrielson and Paul Giblin, who won a Pulitzer Prize for their investigations into the Maricopa County sheriff’s office.[173] In a series of articles published in July 2008, they found that the office was overwhelmed with increasing numbers of unsolved crimes and exploding budgets for overtime pay for immigration enforcement duties while citizens endured longer response times for emergency calls. But questioning Arpaio’s numbers—those that showed his performance had improved—would be difficult. As The Easy Valley Tribune noted, “In fact, there is no way to independently verify the sheriff’s office numbers. The county also does not audit or attempt to verify the statistics.”

Unaudited, unverified statistics abound in government data, particularly when outside parties—local government agencies, federal lobbyists, campaign committees—collect the data and turn it over to the government. Here is the opening paragraph of a story based on data currently available on a government website that you’ll never read, and for a very good reason:

Edward Newberry, a registered lobbyist with the powerhouse firm of Patton Boggs, violated campaign finance laws by personally contributing more than $11 million to the campaigns of a half dozen members of Congress and the presidential campaign of Sen. John McCain. Newberry, whose clients include universities, municipal governments and private companies doing business with the government, wrote checks ranging from $1 million to $2.3 million—some one thousand times more than the legal limit of $2,300 for an individual contribution—in the first five months of the 2008 election cycle.

Now, let’s profusely apologize to Mr. Newberry, who violated no law. His contributions actually ranged between $1,000 and $2,300—absolutely legal in the 2008 cycle. But when he or someone at his firm submitted to the Senate Office of Public Records his form LD-203—a relatively new report that lobbyists must file listing their campaign contributions—somehow those $1,000 and $2,300 contributions ended up with an extra “000” tacked on to the end. A corrected form was filed within a week, but the Senate database to this day contains both the original, faulty filing showing millions in contributions as well as the corrected one showing thousands—with no indication of which set of contributions is the proper one. On the plus side, the data is available in XML format, so at least all the errors are easily read by machines—anyone pulling down the raw Senate feed would get both Mr. Newberry’s actual contributions and the inflated ones. Sadly, the only way to extract meaningful information from those records is to go through them, line by line, and eyeball each entry.

That’s something the Center for Responsive Politics (CRP) has been doing for decades now with the campaign finance disclosure data published by the Federal Election Commission. Over the years, CRP has raised millions of dollars to hire dozens of researchers to literally eyeball record after record, standardizing the names of donors and their employers, matching subsidiaries to parents, and coding by industry, so that journalists and the public can make some sense of what little information federal election law requires campaigns to disclose about those who fund them (see Chapter 20).

Good Data Doesn’t Mean Good Results

But even meticulously going through line after line of government data can’t guarantee that one will end up with good results. Sunlight Foundation undertook a project awhile back called Fortune 535. Our goal was pretty simple: we wanted to see whether some members of Congress had gotten rich during their years of public service. To do so, we tried to use the personal financial disclosure forms that members of Congress file each year. Since 1978, members have had to list each individual asset they and their spouses own, the debts they’ve incurred, their sources of outside income, and other financial information on a form that’s publicly available (one major asset is excluded: homes). By comparing each member’s first filing with the most recent filing, we reasoned, we should be able to show whose pockets had gotten deeper while in office.

We knew at the outset we’d have to consider all kinds of variables, everything from the rate of inflation and the average return on investments to whether a member’s spouse had inherited money. But what we hadn’t counted on was that the way Congress has required members to disclose their information makes it virtually impossible to answer the question we wanted to answer.

Personal financial disclosure forms have always required members to value their assets within broad ranges—say, between $1 and $1,000, $1,001 and $15,000, $15,001 and $50,000, and so on. Thus, a member won’t report 500 shares of Ford Motor Company stock, but rather Ford Motor Company stock worth somewhere between $1,001 and $15,000. In theory, one should be able to calculate lower and higher end ranges for the assets over time—a member may have been worth between $265,005 and $450,000 in 1978 and between $13,500,048 and $45,500,000 in 2007. The first problem we encountered: Congress changed those ranges several times, so forms filed before 1995 couldn’t be compared with later forms.

Let’s say a member had an asset worth $10 million in 1978. He would have reported it on his personal financial disclosure form as being worth more than $250,000. In 1995, let’s say the asset had appreciated to $19 million. It would be reported as being worth more than $5 million. Let’s say the asset depreciated—maybe it was stock in an auto company or a newspaper chain. By 2007—the year we did our Fortune 535 research—the asset is worth a little more than half of what it was in 1978; say, a little more than $5 million. While in reality the member had lost much more than half his investment (we’re not even adjusting for inflation yet), it would be reported as an asset that was worth more than $250,000 in 1978 and now is worth more than $5 million—a great return on investment if you can get it, but one that doesn’t reflect reality.

There were other problems as well. Members report the value of their assets as of a date of their choosing in the month of December. They report their debts as well, but at the high point of their liability. A member who borrows $2 million in January and pays it all off by July will still report that she had a liability of between $1 million and $5 million. Consider the example of Nancy Pelosi, the Speaker of the House and third in line for the presidency. According to her 2007 personal financial disclosure form, she and her husband are either tottering on the brink of bankruptcy with a net worth of negative $9 million, or they enjoy a robust fortune of $86 million.

The House Ethics Manual tells us that financial disclosure is “the preferred method of regulating possible conflicts of interest” for members of Congress, but when the information disclosed is so vague that one can’t tell whether it’s indicating rags or riches, it’s of little utility. And trying to compare disclosures year to year to track changes in net worth—to see whether a member’s official actions might have benefited his bottom line (the whole point of the system)—is a bit like trying to compare apples to oranges to glockenspiels. The data can’t tell you whether a member of the House is rich or broke, or whether a senator made or lost millions. It sometimes seems that the more reasonable the question is, the more likely it is for government data to be unable to answer it.

This brings us back to the American Recovery and Reinvestment Act and its hundreds of billions of stimulating dollars ready to reinvigorate our ailing economy—and a very simple question. As unemployment continued to rise during the first half of 2009, past 8%, past 9%, more and more people began asking, “Where is the money?”

Back in early May, when the April jobs report put the unemployment rate at 8.9%, Recovery.gov, the website that was supposed to answer that question—“every American will be able to see how and where we spend taxpayer dollars,” President Obama said when it was launched—proclaimed that $55 billion in stimulus funds had already been spent. While the site didn’t list any of the recipients of these funds, it did break it down by program. Additional Medicaid funding for states—some $29 billion worth—received the lion’s share of the early stimulus dollars. USASpending.gov, another site that tracks government spending and tracks recipients, listed the California Department of Health and Human Services, which administers the state’s Medi-Cal program, as the top recipient of health care stimulus funding through April, with just less than $3.3 billion spent…well, not really spent.

Tony Cava, a spokesperson at the Department of Health and Human Services, said that the money had been “obligated,” which means the federal government has promised to pay it out when the state asks for it. In other words, the money is sitting in an account in Washington, not in the coffers of the state—or in the accounts receivable of hospitals, the checking accounts of doctors and nurses and orderlies who treat Medi-Cal patients, or the cash registers of the businesses they patronize. When they bill for their services, the state will pass on to the federal government a bill for the federal share. “It’s not like we’re getting a check for $8 billion,” Cava explained.[174]

The picture is just as murky when it comes to shovel-ready projects. Search USASpending.gov for Recovery funds going to the Mercer County Housing Authority in Pennsylvania, and you’ll find an award for $1,703,727 from the Department of Housing and Urban Development’s Capital Fund Program. On March 18, 2009, local housing authorities across the country started receiving letters from HUD, informing them of the stimulus money they’d receive from the Capital Fund Program, which pays for development and modernization of public housing.

That still leaves the question of when the money will actually be spent. Jim Cassidy, director of the HUD Office of Public Housing in Pittsburgh, says the timeline goes like this: by March 17, 2010 (a year after the initial award), Mercer County Housing Authority, and the rest of the recipients of stimulus funds, are required to have legally binding contracts signed to do the work. By March 17, 2011, they must have paid out 60% of the funds, and one year later, 100%.

So, how is this playing out on the ground? Beth Burkhart, the administrative director of the Mercer County Housing Authority, said they decided pretty quickly how to spend their stimulus funds—converting efficiency units in the county’s Vermeire Manor retirement homes to one-bedroom apartments. They have some firms lined up to do some of the project, but they were still accepting bids for plumbing, electrical, HVAC, and general contracting work through July 14, 2009. That means that almost none of that $1.7 million had reached the guys with the shovels, wire strippers, duct tape, and hammers.

That kind of information isn’t what government provides on USASpending.gov—instead, it tracks the amount of money awarded by government to contractors and grantees. One can find tons of data about how much stimulus money government has decided to give out, but very little on how much of it has actually been spent.

When Vice President Joe Biden released his report on the first 100 days of the stimulus program, called “100 Days, 100 Projects,” he noted in the introduction that “we have obligated more than $112 billion.” He went on to describe some of the good works that the stimulus funds had launched—the New Hampshire company hiring back workers to take on a road project, the Florida school districts hiring back teachers, the first road project in Illinois to cause new hires. Not once in his report did Biden cite Recovery.gov as a source of his information.


No citizen should have to rely on the word of Joe Biden (or any other politician) to judge the efficacy of government programs. Being able to see for oneself with a few clicks of the mouse—to know, for example, whether there’s a Superfund site near the home one is thinking of buying—is the great promise of online, transparent government. But if half the Superfund sites aren’t listed in the data, or are in the wrong place because of transposed digits in the zip codes (a common federal data problem), one might end up owning a dream home next to a toxic sludge hole.

That means that auditing government data—determining what’s collected, how it’s collected, what it’s used for, and how accurate it is—should be a priority. Certainly, government should take up the lion’s share of this work, but the public, the press, and academics also have a crucial role to play in finding bad data.

So, while it’s unlikely to top a list of voter concerns in any poll, the quality of federal data—what they get wrong and what they leave out—is rapidly becoming a critical issue for the country. We can’t create data-driven decision-making processes when the data itself is unreliable. Whether it’s bad data on crime rates, spending programs, or the disposition of nuclear waste, it’s awfully hard to make decisions when you’re basing them on faulty information.

About the Author

About the Author

Bill Allison is the editorial director at the Sunlight Foundation. A veteran investigative journalist and editor for nonprofit media, Bill worked for the Center for Public Integrity for nine years, where he coauthored The Cheating of America with Charles Lewis (Harper Perennial), was senior editor of The Buying of the President 2000 (Harper Perennial), and was coeditor of the New York Times bestseller The Buying of the President 2004 (Harper Paperbacks).

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required