Advanced Techniques: Cleverer Statistics

Several spam researchers recognized a problem with the raw Bayesian approach: a word that appears only once in, say, a spam message has a probability of 100% associated with that word’s “spamminess.” Intuitively, this does not feel right because random words might appear in any email.

Gary Robinson [7] made several extremely useful suggestions in his Linux Journal article on spam. First of all, he defined p(w) as the probability that an email with the word “w” is spam:

where

and similarly,

When a ...

Get Slamming Spam: A Guide for System Administrators now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.