Summary
We introduced the problem of record linkage and emphasized its importance. We introduced the package, RecordLinkage, in R to solve record linkage problems. We started with generating features, string- and phonetic-based, for record pairs so that they can be processed further down the pipeline to dedup records. We covered expectation maximization and weights-based methods to perform a dedup task on our record pairs. Finally, we wrapped up the chapter by introducing machine learning methods for dedup tasks. Under unsupervised methods, K-means clustering was discussed. We further leveraged the output of the K-means algorithm to train a supervised model.
In the next chapter we go through streaming data and its challenges. We will build ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access