November 2018
Intermediate to advanced
300 pages
7h 42m
English
As the number of negative examples, that is, instances of fraud, is very small compared to positive examples, the learning algorithms struggle with induction. We can help them by giving them a dataset where the share of positive and negative examples is comparable. This can be achieved with dataset rebalancing.
Weka has a built-in filter, Resample, which produces a random subsample of a dataset, using sampling either with replacement or without replacement. The filter can also bias the distribution toward a uniform class distribution.
We will proceed by manually implementing k-fold cross-validation. First, we will split the dataset into k equal folds. Fold k will be used for testing, while the other folds will be used ...