November 2017
Beginner to intermediate
366 pages
7h 59m
English
We will use compare.dedup to generate the features:
> rec.pairs <- compare.dedup(RLdata500+ ,blockfld = list(1, 5:7)+ ,strcmp = c(2,3,4)+ ,strcmpfun = levenshteinSim)> summary(rec.pairs)Deduplication Data Set500 records 1221 record pairs 0 matches0 non-matches1221 pairs with unknown status> matches <- rec.pairs$pairs> matches[c(1:3, 1203:1204), ] id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match1 1 174 1 NA 0.1428571 NA 0 0 0 NA2 1 204 1 NA 0.0000000 NA 0 0 0 NA3 2 7 1 NA 0.3750000 NA 0 0 0 NA1203 448 497 1 NA 0.0000000 NA 0 0 0 NA1204 450 477 1 NA 0.0000000 NA 0 0 0 NA
We have 500 records; we should generate 500*(500-1)/ 2 pairs to do the comparisons--in general, for n records it will be n(n-1)/2 pairs. In ...
Read now
Unlock full access