book

97 Things About Ethics Everyone in Data Science Should Know

by Bill Franks

August 2020

Beginner

344 pages

10h 23m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Content preview from 97 Things About Ethics Everyone in Data Science Should Know

Chapter 30. Anonymizing Data Is Really, Really Hard

Damian Gordon

University Lecturer, Technological University of Dublin

Data analytics holds the promise of a more profound and complete understanding of the world around us. Many have claimed that because of the present-day ubiquity of data, it has become possible to finally automate everything from value creation to organizational adaptability. To achieve this, large quantities of data about people (and their behaviors) are required. But there is a balance to be struck between the need for this very detailed data and the rights of individuals to maintain their privacy. One approach to dealing with this challenge is to remove some of the key identifiers from a dataset, sometimes called the “name data,” which typically includes fields such as Name, Address, and Social Security Number. Those are the features that would appear to be the key characteristics that uniquely identify an individual. Unfortunately, there is a wide range of techniques that allows others to de-anonymize such data.

Some datasets can be de-anonymized by very rudimentary means; for example, some individuals in a dataset of anonymous movie reviews were identified simply by searching for similarly worded reviews on websites that are not anonymous—IMDB, for example. In another case, AOL released a list of 20 million web search queries it had collected, and two ...