Datasets may contain duplicate records that often must be removed before data mining can begin. For example, the same individual may appear multiple times in a dataset with different addresses. The Distinct node finds or removes duplicate records in a dataset. The Distinct node, located in the Record Ops palette, checks for duplicate records and identifies the cases that appear more than once in a file so they can be reviewed and/or removed.
A duplicate case is defined by having identical data values on one or more fields that are specified. Any number or combination of fields may be used to specify a duplicate:
- Place a Distinct node from the Record Ops palette onto the canvas.
- Connect the Sort node ...