Dealing with missing data

A common problem of data preprocessing is how to handle missing data. Spark DataFrames, which are similar to pandas DataFrames, offer a wide range of operations that you can do on them. For example, the easiest option to achieve a dataset composed of complete rows only is to discard rows containing missing information. For this, in a Spark DataFrame, we first have to access the na attribute of the DataFrame and then call the drop method. The resulting table will contain only the complete rows:

In: df.na.drop().show()Out: +-------+------+-------+     |balance|gender|user_id|     +-------+------+-------+     |    1.0|     M|      1|     |   -0.5|     F|      2|     |    0.0|     F|      3|     |    3.0|     M|      5|     +-------+------+-------+

If such an operation removes too many rows, ...

Get Python Data Science Essentials - Third Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.