Let's do a check to see whether there are any missing values in our dataset:
print(df.isnull().sum())
We'll see the following output showing the number of missing values in each column:
We can see that there are only five rows (out of 500,000 rows) with missing data. With a missing data percentage of just 0.001%, it seems that we don't have a problem with missing data. Let's go ahead and remove those five rows with missing data:
df = df.dropna()
At this point, we should also check the data for outliers. In a dataset as massive as this, there are bound to be outliers, which can skew our model. Let's ...