The data description mentions that sensors used for activity tracking were not fully reliable and results contain missing data. We need to explore them in more detail to see how this fact can influence our modeling strategy.
The first question is how many missing values are in our dataset. We know from the data description that all missing values are marked by the string NaN (that is, not a number), which is now represented as Double.NaN in the RDD rawData. In the next code snippet, we compute the number of missing values per row and the total number of missing values in the dataset:
val nanCountPerRow = rawData.map { row => row.foldLeft(0) { case (acc, v) => acc + (if (v.isNaN) 1 else 0) } } val nanTotalCount = nanCount.sum ...