Chapter 30. Anonymizing Data Is Really, Really Hard
Damian Gordon
Data analytics holds the promise of a more profound and complete understanding of the world around us. Many have claimed that because of the present-day ubiquity of data, it has become possible to finally automate everything from value creation to organizational adaptability. To achieve this, large quantities of data about people (and their behaviors) are required. But there is a balance to be struck between the need for this very detailed data and the rights of individuals to maintain their privacy. One approach to dealing with this challenge is to remove some of the key identifiers from a dataset, sometimes called the “name data,” which typically includes fields such as Name, Address, and Social Security Number. Those are the features that would appear to be the key characteristics that uniquely identify an individual. Unfortunately, there is a wide range of techniques that allows others to de-anonymize such data.
Some datasets can be de-anonymized by very rudimentary means; for example, some individuals in a dataset of anonymous movie reviews were identified simply by searching for similarly worded reviews on websites that are not anonymous—IMDB, for example. In another case, AOL released a list of 20 million web search queries it had collected, and two ...