Data, even “big data,” doesn’t stay in the same place: it wants to move. There’s a long history of language about moving data: we have had dataflow architectures, there's a great blog on visualization titled FlowingData, and Amazon Web Services has a service for moving data by the (literal) truckload. Although the scale and speed at which data moves has changed over the years, we’ve recognized the importance of flowing data ever since the earliest years of computing. If we’re going to think about the ethics of data and how it’s used, then, we can’t just think about the content of the data, or even its scale: we have to take into account how data flows.
In Privacy in Context, Helen Nissenbaum connects data’s mobility to privacy and ethics. For Nissenbaum, the important issue isn’t what data should be private or public, but how data and information flow: what happens to your data, and how it is used. Information flows are central to our expectations of privacy, and respecting those expectations is at the heart of data ethics. We give up our data all the time. It’s next to impossible to live in modern society without giving up data: we use credit cards to pay for groceries, we make reservations at restaurants, we fill prescriptions at pharmacies. And we usually have some sort of expectation that our data will be used. But those expectations include expectations about how the data will be used: who will have access to it, for what reason, and for what purposes.
Problems arise when those expectations are violated. As Nissenbaum writes, "What people care most about is not simply restricting the flow of information but ensuring that it flows appropriately." The infamous Target case, in which Target outed a pregnant teenager by sending ad circulars to her home, is a great example. We all buy things, and when we buy things, we know that data is used—to send bills and to manage inventory, if nothing else. In this case, the surprise was that Target used this customer's purchase history to identify her as pregnant, and send circulars advertising products for pregnant women and new mothers to her house. The problem isn't the collection of data, or even its use; the problem is that the advertising comes from, and produces, a different and unexpected data flow. The data that’s flowing isn’t just the feed to the marketing contractor. That ad circular, pushed into a mailbox (and read by the girl’s father) is another data flow, and one that’s not expected. To be even more precise: the problem isn’t even putting an ad circular in a mailbox, but that this data flow isn’t well defined. Once the circular goes in the mailbox, anyone can read it.
Facebook’s ongoing problems with the Cambridge Analytica case aren’t problems of data theft or intrusion; they’re problems of unexpected data flows. Customers who played the game "This is Your Digital Life" didn’t expect their data to be used in political marketing—to say nothing of their friend’s data, which was exposed even if they didn’t play. Facebook asked Cambridge Analytica to delete the data back in 2015, but apparently did nothing to determine whether the data was actually deleted, or shared further. Once data has started flowing, it is very difficult to stop it.
Data flows can be very complex. danah boyd, in the second chapter of It’s Complicated: The Social Lives of Networked Teens, describes the multiple contexts that teenagers use on social media, and their strategies for communicating within their groups in a public medium: in particular, their use of coded messages that are designed to be misunderstood by parents or others not in their group. They are creating strategies to control information flows that appear to be out of their control. Teens can’t prevent parents from seeing their Facebook feeds, but they can use a coded language to prevent their parents from understanding what they’re really saying.
Everyone who works with data knows that data becomes much more powerful when it is combined with data from other sources. Data that seems innocuous, like a grocery store purchase history, can be combined with geographic data, medical data, and other kinds of data to characterize users and their behavior with great precision. Knowing whether a person purchases cigarettes can be of great interest to an insurance company, as can knowing whether a cardiac patient is buying bacon. Increasing the police presence in some neighborhood areas inevitably leads to more arrests in those neighborhoods, creating the appearance of more crime. Data flows have complex topologies: multiple inputs, outputs, and feedback loops. The question isn’t just where your data goes and how it will be shared; it’s also what incoming data will be mixed with your data.
Nissenbaum argues that we shouldn’t be asking about absolute notions of what data should or shouldn’t be “private,” but about where the data can travel, our expectations about that travel, and what happens when data reaches its destination. That makes a lot of intuitive sense. A pharmacy or a grocery store collects a lot of data just to do business: again, it has to do billing, it has to manage stock. It has some control over how that data is remixed, shared, and commoditized. But it doesn't have control over how its partners ultimately use the data. It might be able to control what mailers its advertising agencies sends out—but who's going to raise a red flag about an innocent circular advertising baby products? It can't control what an insurance company, or even a government agency, might do with that data: deny medical benefits? Send a social worker? In many cases, consumers won't even know that their privacy has been violated, let alone how or why; they'll just know that something has happened.
As developers, how can we understand and manage data flows according to our users' expectations? That's a complex question, in part because our desires and expectations as both users and developers are different from our users’, and we can’t assume that users understand how their data might be put to work. Furthermore, enumerating and evaluating all possible flows, together with the consequences of those flows, is certainly NP-hard.
But we can start asking the difficult questions, recognizing that we’re neither omniscient nor infallible. The problem facing us isn’t that mistakes will be made, because they certainly will; the problem is that more mistakes will be made, and more damage will be done, if we don’t start taking responsibility for data flows. What might that responsibility mean?
Principles for ethical data handling (and human experimentation in general) always stress "informed consent"; Nissenbaum’s discussion about context suggests that informed consent is less about usage than about data flow. The right question isn't, "can our partners make you offers about products you may be interested in?" but, "may we share your purchase data with other businesses?" (If so, what businesses?) Or perhaps, “may we combine your purchase data with other demographic data to predict your future purchases?” (If so, what other demographic data?)
One way to prevent unexpected data flows is to delete the data before it has a chance to go anywhere. Deleted data is hard to abuse. A decade ago, data developers were saying "Save everything. Storage is cheap." We now understand that's naive. If data is collected for a purpose, it might be necessary to delete it when it has served its purpose—for example, most libraries delete records of the books a user has checked out after the books have been returned. Deleted data can’t be stolen, inadvertently shared, or demanded by a legal warrant. “Save everything” invites troublesome data flows.
But data deletion is easier said than done. The difficulty, as Facebook found out with Cambridge Analytica, is that asking someone to delete data doesn’t mean they will actually delete it. It isn’t easy to prove that data has been deleted; we don’t have auditing tools that are up to the task. In many cases, it’s not even clear what “deletion” means: does it mean that the data is removed from backups? Backups from which data is removed after-the-fact aren’t really backups; can they be trusted to restore the system to a known state? Reliable backups are an important (and infrequently discussed) part of ethical data handling, but they are also a path through which data can escape and continue to flow in the wild.
And deletion doesn’t always work in the users’ favor. Deleting data prematurely makes it difficult for a customer to appeal a decision; redress assumes we can reconstruct what happened to find an appropriate solution. Historically, it’s almost certainly true that more data has been deleted to preserve entrenched power than to preserve individual privacy. The ability to "undelete" is powerful, and shouldn't be underestimated. Data should be deleted as soon as it’s no longer needed, but no sooner—and determining when data really is no longer needed isn’t a trivial problem.
These aren’t problems to be solved in a short article. However, they are problems that we in the data community need to recognize and face. They won’t go away; they will become more serious and urgent as time goes on. How does data flow? What dams and levees can we create that will prevent data from flowing in unexpected or unwanted ways? And once we create those levees, what will happen when they break? That will inevitably be one of the most important stories of the next year.