In a nutshell, here is what you need to do to build the schema for your datasets:
- Understand your data well. Identify the noise and signals.
- Capture features and labels. Identify categorical variables.
- Identify categorical features that one-hot encoding can be applied to.
- Pay attention to missing or bad data.
- Add features using type-specific methods such as addColumnInteger() and addColumnsInteger(), where the feature type is an integer. Apply the respective Builder method to other data types.
- Add categorical variables using addColumnCategorical().
- Call the build() method to build the schema.
Note that you cannot skip/ignore any features from the dataset without specifying them in the schema. You need to remove the outlying ...