Advanced Techniques: Cleverer Statistics

Several spam researchers recognized a problem with the raw Bayesian approach: a word that appears only once in, say, a spam message has a probability of 100% associated with that word’s “spamminess.” Intuitively, this does not feel right because random words might appear in any email.

Gary Robinson [7] made several extremely useful suggestions in his Linux Journal article on spam. First of all, he defined p(w) as the probability that an email with the word “w” is spam:

where

and similarly,

When a ...

Get Slamming Spam: A Guide for System Administrators now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.