The big data sweet spot: Policy that balances benefits and risks

Deciding what data to collect is hard when consequences are unpredictable.

By Andy Oram

November 12, 2014

Ecce Homo (source: Wikimedia Commons)

A big reason why discussions of “big data” get complicated — and policy-makers resort to vague hand-waving, as in the well-known White House executive office report — is that its ripple effects travel fast and far. Your purchase, when recorded by a data broker, affects not only the the ads and deals they offer you in the future, but the ones they offer innumerable people around the country that share some demographic with you.

Policy-making might be simple if data collectors or governments could say, “We’ll collect certain kinds of data for certain purposes and no others” — but the impacts of data collection are rarely predictable. And if one did restrict big data that way, its value would be seriously reduced.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Follow my steps: big data privacy vs collection

Data collection will explode as we learn how to connect you to different places you’ve been by the way you walk or to recognize certain kinds of diseases by your breath.

When such data exhaust is being collected, you can’t evade consequences by paying cash and otherwise living off the grid. In fact, trying to do so may disadvantage you even more: people who lack access to sophisticated technologies leave fewer tracks and therefore may not be served by corporations or governments.

In a new book by Nathan Eagle and Kate Greene, Reality Mining: Using Big Data to Engineer a Better World, it looks like human beings are sitting ducks for investigation. Examples from the book include:

We’re depressingly predictable, whether a researcher is trying to guess whom we’ll spend Saturday night with or where we go each day. Thus, once researchers discover a pattern, they need relatively little real-time data to characterize our behavior.
Data from a small number of participants can characterize a whole neighborhood. This helps all kinds of local revival efforts, but you may be alarmed that the whole world knows how much crime your neighborhood has, or even how many potholes. We all want to fix these things, but we should balance out such negative findings with statistics about positive elements of neighborhoods as well.

The authors point out that some data has to be collected at larger than ideal intervals because collecting data as often as they want would drain a cell phone’s battery. As devices come on the market that recharge themselves by collecting ambient energy from their environments, this restriction will vanish. Becoming newly able to collect data at a granularity that makes intense analysis productive, we’ll go wild tracking ourselves and everybody else.

Experimental computer scientists are trying to code up some cake that we can have and eat in comfort: preserving individuals’ privacy while mining their data collectively. Two methods, synthetic data creation and differential queries, are promoted as ways to provide statistically accurate data sets without revealing information for any individual. They are two different ways of satisfying a constrained set of queries without revealing data on individuals.

Synthetic data is derived from records about fake individuals that, in the aggregate, match the attributes of the real data sets. For instance, synthetic and real data may show the same number of diabetic patients in each county, and the same number of people over 65. Synthetic data is valuable for researching well-known trends in public health. Its problem is that it won’t yield new and unexpected relationships because those weren’t baked into the manufactured data.

Differential queries are still mostly experimental. They offer a set of SQL or other queries that return data based on real data, but they’re designed so that you can’t isolate the information about any individual by cleverly issuing a series of slightly changed queries. Once again, they constrain researchers more than deidentified data does. (This short article nicely compares the strengths and weaknesses of deidentification, synthetic data, and differential queries.)

Potential big data policy solutions

Many researchers are pushing ahead the frontiers of policy on big data. For instance, Jules Polonetsky and Omer Tene list benefits of data processing to individuals, communities, organizations, and society. For instance, with data gathered about their customers, “businesses can optimize distribution methods, efficiently allocate credit, and robustly combat fraud.” And incidents that individuals report in their neighborhoods (such as through FixMyStreet) benefit the whole community.

But the rights of individuals still need to be balanced during data collection. An opening for more individual control is provided by the FTC in its regulation of “unfair” data practices, while the EU permits individuals to object to data uses (Article 14 of their directive). The web page with Polonetsky and Tene’s article points to a series by multiple contributors covering rights in data collection.

In another article, Polonetsky and Tene propose opening up data to individuals (either their own data or aggregate data) and requiring organizations to state their decision-making processes. There are also technical solutions that hide private details about individuals while telling the institutions they deal with just enough to complete a transaction (for instance, that the individual is at least 18 years old).

A number of contributors in the web series want data collectors to ask the public more about how their data can be used. Ombudsmen systems might allow us to appeal the uses of data. Maybe we’ll even arbitrate or go to court to halt its use.

Ebola as a big data frontier

Right now, one of the biggest global worries concerns the spread of communicable disease, and many hope that data processing can help predict outbreaks. Although disease-carrying critters are individually unsophisticated, they sometimes show more surprising trends. I wonder, for instance, whether an analysis of the spread of SARS in early 2003 would have predicted Toronto as the location of the only major outbreak outside East Asia.

According to Eagle and Greene, disease tracking started with public health records that go back as far as 1948. The book compares the relative value of Twitter and Google’s Flu Tracker (which suffered a setback one year that should be correctable), both of which can identify the spread of annual influenza outbreaks quicker than official data reports.

Likewise, Facebook ads have been used to conduct a health survey. By asking members of the site whether they had taken the human papillomavirus (HPV) vaccine, researchers tried to demonstrate that a campaign on Facebook could collect data not easily available to traditional public health authorities. Their success reminds us what makes social media so powerful: researchers could target ads at their desired population because members of Facebook tell it so many personal details (mostly about age and location). But the sample was characterized by several levels of self-selection–the choice to join Facebook, to click on the ad, to complete the survey–so there are reasons to doubt whether the study was valid, even though the researches cite other studies claiming that the samples were representative of the general population.

Eagle and Greene suggest that airline flight patterns could provide clues as to where diseases caught through casual contact will spread. For insect-borne diseases, shipping patterns may prove a better indicator than flights because mosquitos can easily hop on and off boats.

The book’s focus on disease tracking turns out to be even more relevant than the authors may have thought, as shown by the growing Ebola crisis that broke out after the book’s publication. We know that there are many undiagnosed cases of Ebola, and that health delivery systems are strained (including the deaths of many health care workers) in the regions where Ebola shows no regard for the jurisdictions separating various health care systems.

Mobile phone data is already being collected to try to predict areas at risk of the epidemic. Although hopes for a cure are in the news, data may do more than drugs to stem this crisis.

Ideally, we’ll find some policy that lies at a sweet spot between, “Let anybody collect any data they can get their hands on” and “Let’s regulate every little step taken by any institution.” But we haven’t gotten there. We just don’t know who wins, who loses, and what ultimately will come down the stream amidst the flood of data.

This post is part of our on-going investigations into the perils of big data, the importance of building a data culture, and the value of applying techniques from design and social science to big data.

Post topics: Data