Word Choice

Earlier Bayesian filter authors made interesting choices about the actual sets of words of a message they chose to analyze. Some, as mentioned earlier, derived word stems. Others processed only the text of the message.

Yerazunis’s CRM114 tool looks at every string of numbers/text (i.e., all punctuation and white space is considered to be a “word” delimiter) of the message. A slight modification to his program decodes attachments. This means that every piece of the header is examined, including timestamps, message identification numbers, and other administrivia in addition to more intuitively pleasing items like the sender’s email address. The training methodologies (see the next section) work to balance items that are found in both ...

Get Slamming Spam: A Guide for System Administrators now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.