O'Reilly logo

Using OpenRefine by Max De Wilde, Ruben Verborgh

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Recipe 3 – detecting duplicates

In this recipe, you will learn what duplicates are, how to spot them, and why it matters.

The only type of customized facet that we left out in the previous recipe is the duplicates facet. Duplicates are annoying records that happen to appear twice (or more) in a dataset. Keeping identical records is a waste of space and can generate ambiguity, so we will want to remove these duplicates. This facet is an easy way to detect them, but it has a downside; it only works on text strings, at least straightforwardly (to learn how to tweak it to work on integers as well, have a look at Appendix, Regular Expressions and GREL).

Too bad then; we cannot use a duplicate facet on the Record ID column. The next best thing is to run ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required