Biased datasets
The presence of bias within the sample datasets is often the result of the selection methods used to gather the data (known as selection bias). For example, in the training of malware detectors, we often use samples obtained from honeypots within the corporate security perimeter.
Honeypots are effective tools for gathering security information: they unveil the specific risks of tailored attacks to which the organization is exposed. However, honeypots are unlikely to ensure that the samples collected resemble all the different types of malware threats in the wild. Therefore, the use of honeypots may introduce selection bias into training datasets.
Similar considerations can be made regarding the training of anti-spam classifiers: ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access