Missing data

The data description mentions that sensors used for activity tracking were not fully reliable and results contain missing data. We need to explore them in more detail to see how this fact can influence our modeling strategy.

The first question is how many missing values are in our dataset. We know from the data description that all missing values are marked by the string NaN (that is, not a number), which is now represented as Double.NaN in the RDD rawData. In the next code snippet, we compute the number of missing values per row and the total number of missing values in the dataset:

val nanCountPerRow = rawData.map { row => row.foldLeft(0) { case (acc, v) => acc + (if (v.isNaN) 1 else 0) } } val nanTotalCount = nanCount.sum ...

Get Mastering Machine Learning with Spark 2.x now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.