Recipe 3 – detecting duplicates
In this recipe, you will learn what duplicates are, how to spot them, and why it matters.
The only type of customized facet that we left out in the previous recipe is the duplicates facet. Duplicates are annoying records that happen to appear twice (or more) in a dataset. Keeping identical records is a waste of space and can generate ambiguity, so we will want to remove these duplicates. This facet is an easy way to detect them, but it has a downside; it only works on text strings, at least straightforwardly (to learn how to tweak it to work on integers as well, have a look at Appendix, Regular Expressions and GREL).
Too bad then; we cannot use a duplicate facet on the Record ID column. The next best thing is to run ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access