The emp_title column transformation

The first column, emp_title, describes the employment title. However, it is not unified-there are multiple versions with the same meaning ("Bank of America" versus "bank of america") or a similar meaning ("AT&T" and "AT&T Mobility"). Our goal is to unify the labels into a basic form, detect similar labels, and replace them by a common title. The theory is the employment title has a direct impact on the ability to pay back the loan.

The basic unification of labels is a simple task-transform labels into lowercase form and drop all non-alphanumeric characters ("&" or "."). For this step, we will use the Spark API for user-defined functions:

val unifyTextColumn = (in: String) => {if (in != null) in.toLowerCase.replaceAll( ...

Get Mastering Machine Learning with Spark 2.x now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.