O'Reilly logo

Mastering Machine Learning with Spark 2.x by Michal Malohlava, Max Pumperla, Alex Tellez

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Missing data

The data description mentions that sensors used for activity tracking were not fully reliable and results contain missing data. We need to explore them in more detail to see how this fact can influence our modeling strategy.

The first question is how many missing values are in our dataset. We know from the data description that all missing values are marked by the string NaN (that is, not a number), which is now represented as Double.NaN in the RDD rawData. In the next code snippet, we compute the number of missing values per row and the total number of missing values in the dataset:

val nanCountPerRow = rawData.map { row => row.foldLeft(0) { case (acc, v) => acc + (if (v.isNaN) 1 else 0) } } val nanTotalCount = nanCount.sum ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required