- Identify the outliers in the data: For a small dataset with just a few features, we can spot outliers/noise via manual inspection. For a dataset with a large number of features, we can perform Principal Component Analysis (PCA), as shown in the following code:
INDArray factor = org.nd4j.linalg.dimensionalityreduction.PCA.pca_factor(inputFeatures, projectedDimension, normalize); INDArray reduced = inputFeatures.mmul(factor);
- Use a schema to define the structure of the data: The following is an example of a basic schema for a customer churn dataset. You can download the dataset from https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling/downloads/bank-customer-churn-modeling.zip/1:
Schema schema = new Schema.Builder() ...