O'Reilly logo

Clojure Data Analysis Cookbook - Second Edition by Eric Rochester

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Identifying and removing duplicate data

One problem when cleaning up data is dealing with duplicates. How do we find them? What do we do with them once we have them? While a part of this process can be automated, often merging duplicated data is a manual task, because a person has to look at potential matches and determine whether they are duplicates or not and determining what needs to be done with the overlapping data. We can code heuristics, of course, but at some point, a person needs to make the final call.

The first question that needs to be answered is what constitutes identity for the data. If you have two items of data, which fields do you have to look at in order to determine whether they are duplicates? Then, you must determine how close ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required