This is the phase where you will create your final dataset to be used in the modeling phase by joining different data sources, cleaning, formatting, and engineering features.
In this phase, you are typically trying to address the following items:
- Identifying relevant datasets for model building.
- Documenting data joins and aggregations to construct the final dataset.
- Writing functions with useful arguments to have flexibility later in the project for cleaning and formatting datasets, such as removing outliers by x%, or imputing missing values with mean, median, or most frequent.
- Treating outliers accordingly.
- Playing with feature engineering methods.
- Selecting the features. In general, there are three main methods for feature ...