Word Analysis

Bayes Rule enables calculation of the probability that a message is spam, given an observed probability that various words indicated spam (or non-spam) in the past. One of the drawbacks of non-Bayesian filtering is the lack of a “big picture” about the message (for example, looking only for certain keywords, addresses, or other patterns). Initial Bayesian spam filters chose only 30 words to examine [1, 2]. Newer filters [4] look much more deeply.

One author [5] carefully determined word stems (such as reducing “mails” and “mailing” to “mail”). Graham [1, 2] was careful to generalize his analyses to include headers (which is intuitive because certain sources of email issue only spam).

Bill Yerazunis, author of the spam-filtering ...

Get Slamming Spam: A Guide for System Administrators now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.