4. SpamAssassin as a Learning System

Chapter 4. SpamAssassin as a Learning System

SpamAssassin provides many rules that have proven useful in distinguishing spam from non-spam messages, and these rules are updated at each new release. But SpamAssassin provides more than just generic rules; it has the capability of learning about your email environment and adapting its detection behavior to maximize its accuracy in that environment.

SpamAssassin includes two adaptive systems that can be used in concert: autowhitelisting and Bayesian filtering. This chapter discusses the principles, configuration, and operation of both systems.

Autowhitelisting

SpamAssassin’s autowhitelisting algorithm learns each sender’s history of sending spam or non-spam messages and modifies the spam score of their subsequent mailings on the basis of this history. The primary goal of autowhitelisting is to reduce false positives—to make it less likely that a non-spam message will be tagged as spam—by assuming that people who send you non-spam messages will not begin to spam you. It can also reduce false negatives if a spammer consistently sends email from the same email address, but this happens infrequently enough that autowhitelisting rarely has a significant effect on false negatives.

Principles

When autowhitelisting is enabled, SpamAssassin maintains a database keyed on message senders’ email addresses and the IP addresses of their nearest untrusted relay (if any). Each time a message from a given sender is received, the message’s spam score is added to the sender’s total score in the database, and a count of the number of messages received from that sender is updated.

The average sender score—the total score divided by the number of messages received—is used to modify the spam score of new messages from that sender. Specifically, the difference between the average score and the new message’s score is multiplied by a configurable factor, and the result is added to the new message’s spam score. The effect is that when the new message has a higher spam score than average, its spam score is adjusted downward; when the new message has a lower spam score than average, its spam score is adjusted upward.

As you might expect from this explanation, the autowhitelist tests are the last ones performed by SpamAssassin. All other tests must be run first in order to have the most accurate spam score for a message before comparing it to the sender’s historical average. In addition, the sender’s historical average is updated with the spam score of a new message before the autowhitelist modifier is applied.

Configuration

The most important decisions to make in autowhitelisting are how much weight SpamAssassin should put on a sender’s history of sending spam or non-spam messages and how much weight it should put on the spam score of the message it is checking.

Use the auto_whitelist_factor directive to set the multiplier that is applied to the difference between a message’s spam score and the sender’s historical average score. It can range from to 1. The default factor is 0.5, which causes the final spam score to be halfway between the message’s spam score and the sender’s average score.

To put more weight on the historical average, increase the auto_whitelist_factor. When the auto_whitelist_factor is set to 1, the historical average alone will be the new message’s spam score (recall, however, that the score before autowhitelisting is performed is fed back into the system and becomes part of the new historical average).

To put less weight on the historical average, decrease the auto_whitelist_factor. When the auto_whitelist_factor is set to 0, the historical average is ignored, and the current message’s spam score will not be modified based on the sender’s past messages.

Table 4-1 illustrates the impact of several different settings for auto_whitelist_factor. Each row of the table represents a new message from the same sender. Table columns show the spam score of each message before applying an autowhitelist modifier, the sender’s historical average score, and the spam score after applying an autowhistelist modifier. In this example, the sender sends several non-spam messages and then sends a message that looks like spam to SpamAssassin (a false positive). As you can see, with autowhitelisting using factors of 0.5, 0.75, or 1, the message will not reach the usual spam threshold of 5 because of the sender’s history of non-spam messages. Without autowhitelisting (i.e., with an factor of 0), the message receives a score of 6.

Table 4-1. The impact of auto_whitelist_factor (AWF)

Message number	Message score (before autowhitelist)	Sender average score	Score after autowhitelist with given AWF
			0	.5	.75	1
1	2	(none)	2	2	2	2
2	1	2	1	1.5	1.75	2
3	1	1.5	1	1.25	1.375	1.5
4	0	1.33	0	0.67	1.00	1.33
5	2	1.0	2	1.5	1.25	1.0
6	6	1.2	6	3.6	2.4	1.2

SpamAssassin stores its autowhitelist data in database files. SpamAssassin lets Perl’s AnyDBM module choose which database format will be used, based on which system libraries are available. In SpamAssassin 3.0, you can control this choice by setting the auto_whitelist_db_modules option to a space-separated list of Perl database modules to be tried in order; the first module that loads successfully will be used. For example, the default module order is specified like this:

auto_whitelist_db_modules DB_File GDBM_File NDBM_File SDBM_File

How you configure autowhitelisting also depends on whether you want each user to have his own whitelist database, or whether you want to use one database in common across all users.

Configuring per-user autowhitelists

By default, SpamAssassin maintains a separate autowhitelist for each user on the system. SpamAssassin stores the autowhitelist database for a user in the auto-whitelist file in the .spamassassin subdirectory of each user’s home directory. SpamAssassin uses one of several database formats for this file, depending on what database libraries are available on the system; the Berkeley DB format is chosen when it’s available.

SpamAssassin 3.0 can also store autowhitelists in an SQL database, which is useful when users don’t have accounts on the mail server. To store addresses in SQL, you must install the DBI Perl module and an appropriate driver module for your SQL server. Common choices are DBD-mysql (for the MySQL server), DBD-Pg (for the PostgreSQL server), and DBD-ODBC (for connection to an ODBC-compliant server).

You should create a database and a user with privileges to access it. You must then create a table in the database to store the user autowhitelist. The SpamAssassin source code includes schemas for MySQL and PostgreSQL tables in the sql subdirectory. Here is the MySQL schema:

CREATE TABLE awl (
  username varchar(100) NOT NULL default '',
  email varchar(200) NOT NULL default '',
  ip varchar(10) NOT NULL default '',
  count int(11) default '0',
  totscore float default '0',
  PRIMARY KEY  (username,email,ip)
) TYPE=MyISAM;

Each row in this table specifies an autowhitelist entry for a single sender for an individual SpamAssassin user. SpamAssassin uses the columns to store the following information:

username: Stores the username or email address of the user (the latter is more useful in virtual hosting environments).
email: Stores the email address of a sender whose messages’ spam scores are being tracked.
ip: Stores the IP address of the sender.
count: Stores the total number of messages received from the sender.
totscore: Stores the total spam score of messages received from the sender.

To configure SQL support for autowhitelists, set the following configuration parameters in your systemwide configuration file (local.cf ):

auto_whitelist_factory Mail::SpamAssassin::SQLBasedAddrList

Configures SpamAssassin to use SQL-based autowhitelists instead of file-based autowhitelists.

user_awl_dsn DSN

Defines the data source name for the SQL database, telling spamd how it will connect to the database server. A typical DSN for the Perl DBI module is written like this:

DBI:databasetype:databasename:hostname:port

For example, to use a MySQL database named saawl running on a database server on the SpamAssassin host, the DSN would read:

DBI:mysql:saawl:localhost:3306

If the server were running PostgreSQL, the DSN would read:

dbi:Pg:dbname=saawl;host=localhost;port=5432;

user_awl_sql_username username

Defines the username that will be used to connect to the database server. This user must have permission to modify the data in the table (including inserting and deleting rows).

user_awl_sql_password password

Defines the password associated with the username that will be used to connect to the server.

user_awl_sql_table tablename

Defines the name of the table that contains autowhitelist data. The default tablename is awl.

Configuring a system-wide autowhitelist

It is often desirable to maintain a single autowhitelist for all users of a system. When users don’t have home directories, such an approach is not just desirable but may be necessary if autowhitelisting is to be used. You can configure a systemwide autowhitelist by setting the auto_whitelist_path directive to the full path of the autowhitelist database file. Set auto_whitelist_path in the systemwide configuration file. For example, to set up a systemwide autowhitelist in the file /etc/mail/spamassassin/auto-whitelist, use the following directive:

auto_whitelist_path /etc/mail/spamassassin/auto-whitelist

If SpamAssassin encounters this directive, it checks to be sure the database file exists. If the file does not exist, SpamAssassin attempts to create it. You may not want to give SpamAssassin write access to the directory you specify. One way around that is to create the file as root, change its ownership to the SpamAssassin user, and set the mode to allow read/write access, all before you add the auto_whitelist_path to your configuration file.

However you create it, the systemwide autowhitelist database file should be readable and writable by the user running SpamAssassin. Depending on your configuration, SpamAssassin may be running as root, as one of several users on the system, or as a default unprivileged user such as nobody. If you let SpamAssassin create the systemwide autowhitelist database file, you can use the auto_whitelist_file_mode directive to specify the file’s mode. It defaults to 0700 but may need to be set to 0770 or 0777 depending on your configuration, when multiple users must access the file.

Warning

Using a systemwide autowhitelist with mode 0777 (or 0770 and an inappropriate group) will enable a curious local user to learn the email addresses of message senders and their average spam scores or to modify those scores. A malicious user could modify the database to give legitimate senders a false history of spamming. In general, file modes other than 0700 should be avoided.

Using an Autowhitelist

Once the autowhitelisting system is configured, you must instruct SpamAssassin to use it. In SpamAssassin 2.63, if you invoke SpamAssassin with the spamassassin script, add the --auto-whitelist option to direct the script to consult your autowhitelist. If you invoke SpamAssassin with the spamc client, you should start spamd (the daemon) with the --auto-whitelist option to direct it to consult user autowhitelists.

SpamAssassin 3.0 contains no --auto-whitelist command-line options. Instead, autowhitelists are always used when the use_auto_whitelist configuration option is set in a user’s (or a systemwide) configuration file.

If you’ve written a Perl application that uses Mail::SpamAssassin to checks messages, you can take advantage of autowhitelists, but it requires a little additional setup. You must create an address list factory, an object that generates objects to store autowhitelisted addresses, and you must associate the address list factory with your Mail::SpamAssassin object. Here is sample code that does this:

#!/usr/bin/perl

use Mail::SpamAssassin;

my $spamtest = Mail::SpamAssassin->new( );
my $awl = Mail::SpamAssassin::DBBasedAddrList->new;
$spamtest->set_persistent_address_list_factory($awl);
# Now go on to use $spamtest as usual.

Mail::SpamAssassin also provides methods for adding and removing addresses from the autowhitelist. See the manpage for more information.

You can use the spamassassin script to manipulate the contents of your autowhitelist. The following command-line options to spamassassin operate on your autowhitelist:

--add-addr-to-whitelist= emailaddress: Adds emailaddress to the autowhitelist with an initial score of -100. SpamAssassin will forget any past history associated with the address.
--add-addr-to-blacklist= emailaddress: Adds emailaddress to the autowhitelist with an initial score of 100. SpamAssassin will forget any past history associated with the address.
--remove-addr-from-whitelist= emailaddress: Removes emailaddress from the autowhitelist. SpamAssassin will forget any past history associated with the address.
--add-to-whitelist: When you pipe an email message to spamassassin --add-to-whitelist, SpamAssassin adds all email addresses found in the To, From, Cc, Reply-To, Sender, Errors-To, and Mail-Followup-To headers or in the body of the message to the autowhitelist with initial scores of -100. SpamAssassin will forget any past history associated with these addresses.
--add-to-blacklist: When you pipe an email message to spamassassin --add-to-blacklist, SpamAssassin adds all email addresses found in the To, From, Cc, Reply-To, Sender, Errors-To, and Mail-Followup-To headers or in the body of the message to the autowhitelist with initial scores of 100. SpamAssassin will forget any past history associated with these addresses. Because this behavior will probably result in the blacklisting of your own email address, this option is usually useless.
--remove-from-whitelist: When you pipe an email message to spamassassin --remove-from-whitelist, SpamAssassin removes all email addresses found in the To, From, Cc, Reply-To, Sender, Errors-To, and Mail-Followup-To headers or in the body of the message from the autowhitelist and forgets any past history associated with these addresses.

Warning

Be careful with --add-to-blacklist. A malicious spammer could send you HTML email with friendly addresses (including your own) embedded in invisible mailto: tags. Piping this message to spamassassin --add-to-blacklist causes SpamAssassin to add all of those addresses to the autowhitelist as likely spammers! Using --add-addr-to-blacklist with individual email addresses is safer.

Bayesian Filtering

SpamAssassin’s Bayesian classifier learns to distinguish the features that characterize spam from those that characterize non-spam in the messages that you receive. Properly trained, the Bayesian classifier can reduce both false positives and false negatives.

Principles

Bayesian filtering is based on Bayes’ Theorem, a statement of probability theory propounded by the Reverend Thomas Bayes in 1763. Bayes’ Theorem is important in many fields where classifying data is essential, including computer vision, psychophysics, and diagnostic decision-making in health care. SpamAssassin’s implementation is mostly based on the work of Paul Graham (archived at http://www.paulgraham.com) and Gary Robinson (http://www.garyrobinson.net).

Conceptually, Bayes’ Theorem states that the probability of some event (such as a message being spam) given a test result (such as matching a spam-checking rule) depends on the baseline probability of the event before the test result is known and on the discriminating power of the test. A corollary is that the discriminating power of a test can be measured by comparing the probability of the event given a known test result to the baseline probability before the result is known. The more the test result can increase (or decrease) the probability from baseline, the stronger the test.

Tip

Actually, SpamAssassin’s “Bayesian” system doesn’t really compute the baseline probability or frequency of spam versus non-spam messages—which some have argued means it’s not strictly Bayesian at all. Instead it assumes values that seem reasonable and useful.

In the context of spam-checking, a Bayesian approach amounts to developing potential rules and asking how much each rule, if matched, should change the system’s perception of the likelihood that a message is spam. Very strong rules come in two forms. Some are patterns that only occur in spam (and never in non-spam), thus yielding a high probability that a message that matches one of the patterns is spam. Others are patterns that only occur in non-spam (and never in spam), thus yielding a low probability that a message that matches the pattern is spam. Weaker rules—patterns found in both spam and non-spam messages but with different frequencies—result in less extreme probabilities.

To use Bayesian filtering successfully, you must have a corpus of messages that you have decided are definitely spam, a corpus of messages that you have decided are definitely non-spam, and an algorithm for analyzing the two sets of messages to develop rules and test their strength. SpamAssassin provides the algorithm and a script that you can use to identify messages as spam or non-spam in order to train the filter. It also provides a mechanism for training itself with messages that are very likely to be spam or non-spam.

The results of the SpamAssassin learning process are a set of databases. One database contains tokens (strings of 3-15 characters) that have been seen, how often each has been seen in spam and non-spam messages, and the date and time that each token last proved useful in classifying a message. During learning, tokens are derived from both the message headers (with several commonly misleading headers ignored) and message body. Tokens that haven’t been useful in a long time may be removed from the database to increase efficiency. Another database keeps track of which messages have been learned, so SpamAssassin doesn’t waste time relearning old messages.

During spam-checking, a message to be checked is split into tokens. SpamAssassin then looks up each token in the token database. Up to 150 of the most diagnostic tokens in the message are identified, and their associated predictive values are combined using one of two mathematical functions to yield a final prediction of the probability that the message is spam. This predicted probability is matched by special SpamAssassin rules that associate probability ranges with spam score modifiers.

Configuration

SpamAssassin’s Bayesian classifier is controlled by more than a dozen configuration directives, though only a few are regularly modified by system administrators. These are the most useful:

use_bayes

This directive controls whether the Bayesian classifier is used at all. It defaults to 1 (use Bayesian filtering). By setting it to 0, Bayesian filtering is disabled completely.

bayes_auto_learn, bayes_auto_learn_threshold_nonspam, bayes_auto_learn_threshold_spam

These directives configure the automatic learning system, which automatically feeds messages with very high or very low spam scores to the Bayesian classifier. The bayes_auto_learn directive enables (1) or disables (0) this feature; it is enabled by default. The threshold directives determine which messages will be automatically learned as spam or non-spam. Messages with spam scores lower than bayes_auto_learn_threshold_nonspam are learned as non-spam; this value defaults to 0.1. Messages with spam scores higher than bayes_auto_learn_threshold_spam are learned as spam; this value defaults to 12 and cannot be set lower than 6. The spam score used for making this determination does not include modifiers for the Bayesian system itself, for the autowhitelist, or for user-configured whitelists or blacklists.

bayes_ignore_header headername

This directive tells the Bayesian classifier to ignore the given header when learning or classifying messages. It is most often used when another spam-tagging system adds headers before SpamAssassin receives the message, in order to prevent the classifier from learning the other spam tag instead of the features of the actual message.

bayes_ignore_from address (SpamAssassin 3.0)

This directive prevents Bayesian classification and learning from being performed on messages sent from address and is a form of whitelisting. It’s most useful when you want to receive messages from a few senders and the messages may include tokens that would otherwise suggest spam.

You can use multiple bayes_ignore_from directives or multiple addresses in a single directive to whitelist several addresses. You can also use as asterisk (*) as a wildcard for zero or more characters and a question mark (?) as a wildcard for zero or one character, much as you would to specify filename patterns in a shell.

bayes_ignore_to address (SpamAssassin 3.0)

This directive prevents Bayesian classification and learning from being performed on messages sent to address, and is a form of whitelisting recipients. It’s useful in sitewide Bayesian filtering to prevent any learning from being performed from messages sent to postmaster, for example, who is likely to receive forwarded spam, non-spam messages discussing spam, etc. Specify addresses as you would to the bayes_ignore_from directive discussed previously.

bayes_learn_during_report

When this directive is enabled (1), messages that are reported to clearinghouses as spam with the spamassassin --report command are also learned as spam by the Bayesian classifier. This saves you an extra learning step. Set the directive to 0 to disable this feature. It is enabled by default.

bayes_path and bayes_file_mode

By default, SpamAssassin maintains separate Bayesian databases for each user on the system. The databases for a user are stored in the .spamassassin subdirectory of the user’s home directory and their names begin bayes_, such as bayes_seen and bayes_toks. These files are kept in one of several possible database formats (Berkeley DB format is generally preferred when it’s available to SpamAssassin).

Separate databases for each user are ideal for Bayesian learning because different users may receive different kinds of spam and non-spam messages. However, it is often necessary to maintain a single Bayesian database for all users of a system, either to save on disk space or because users don’t have home directories. You can configure a systemwide Bayesian database set by setting the bayes_path directive to the full path of the Bayesian database file prefix. For example, to set up systemwide Bayesian databases in the files /etc/mail/spamassassin/bayes_*, use the following directive:

bayes_path /etc/mail/spamassassin/bayes

By default, the Bayesian databases are created with mode 0700. The bayes_file_mode directive can be used to set a different file mode (e.g., 0770) if you need to share the databases among a group. This might be necessary if SpamAssassin can be invoked with the privileges of different users. Care should be taken with this directive, as a malicious user with access to the Bayesian databases can cause legitimate email to be mistagged as spam.

The following directives influence the internal workings of the Bayesian classifier. For the most part, they can be left to the default settings.

bayes_min_ham_num and bayes_min_spam_num: These directives set the minimum number of ham (non-spam) and spam messages that must be learned by SpamAssassin before it will use the predictions of the Bayesian classifier to score new messages. They default to 200 each; until 200 ham and 200 spam messages have been learned, the SpamAssassin rules that rely on the Bayesian classifier will not be applied to email.
bayes_use_hapaxes: Hapaxes are tokens that have been seen only once during learning so far. Accordingly, SpamAssassin’s concept of whether a hapax is associated with spam or ham is based on limited data and may not be reliable. On the other hand, SpamAssassin can learn hundreds or thousands of hapaxes, and using hapaxes seems to provide better accuracy, so this setting defaults to 1 (enabled).
bayes_use_chi2_combining: This directive controls which of the two mathematical functions are used to combine token probabilities into an overall message probability. When enabled (1), the approach is based on the distribution of the chi-squared statistic; when disabled (0), a so-called “naïve Bayesian” function combines the probabilities using the assumption that errors in classification from each token are independent of one another. SpamAssassin’s maintainers have found the chi-squared method more useful, and it is the default.
bayes_auto_expire and bayes_expiry_max_db_size: When bayes_auto_expire is enabled (1), SpamAssassin will automatically attempt to remove old tokens during learning when the token database exceeds bayes_expiry_max_db_size tokens. This is the default. When disabled (0), token expiration must be performed manually. Automatic expiration occurs no more than once every 12 hours.
bayes_learn_to_journal and bayes_journal_max_size: When bayes_learn_to_journal is enabled (1), SpamAssassin will store newly learned data in a journal file, rather than directly into the Bayesian databases. The journal file will be synchronized into the databases at least daily, or when the journal exceeds bayes_journal_max_size bytes (102,400 by default). Using journaling reduces disk contention for the databases, which must be exclusively locked while being updated, but results in a delay between the time a message is learned and the time the learned tokens can be used to classify further messages. Journaling might be particularly useful if the journal could be kept in a different location than the databases (e.g., on a RAM disk), but this directive is not supported as of SpamAssassin 3.0. bayes_learn_to_journal is disabled by default.

Training

There are two main strategies for training a Bayesian classifier: train everything and train-on-error. In the train everything strategy, you train the classifier with every message that you receive. This strategy is highly responsive to changes in spam patterns but may change too quickly in response to unrelated variability in messages. In addition, it is resource intensive to scan every message. In the train-on-error strategy, you train the classifier only with messages that it has previously classified incorrectly (i.e., false positives and false negatives). This strategy is resource efficient but may not train the classifier as quickly when spam patterns change.

Based on experiments conducted by Greg Louis (and described at http://www.bgl.nu/bogofilter/), the train everything strategy appears to be more efficient for initial training. Once a suitable number of messages have been learned, however, switching to a train-on-error approach saves resources, because many fewer messages must be trained. Louis suggests that switching to train-on-error after 10,000 spam and 10,000 non-spam message have been learned may be reasonable. You can train SpamAssassin’s Bayesian classifier with either strategy.

The sa-learn script is your primary interface for training the Bayesian classifier. The first step in using Bayesian filtering is collecting a corpus of messages you’ve received that you have verified are spam and a corpus that you’ve verified are non-spam. The easiest and best way to do so is to simply start saving spam you receive to one folder and any non-spam messages that you would ordinarily delete to another. The two collections of messages can either be in maildir format (in which each file contains a single message) or mbox format (in which a single file contains multiple messages).

It’s important that the messages be from the same time period; if you train SpamAssassin with a set of spam messages from 2003 and a set of non-spam messages from 2004, it will quickly learn that an effective way to detect spam is to look for messages in 2003! Similarly, forwarded spam, or messages discussing spam in your corpus (“Hey, look at this spam I just got; it’s really strange. Here it is . . . “) can result in the classifier learning artificial rules that will degrade its accuracy with normal messages.

Next, run sa-learn on each corpus, using either the --spam or --ham command-line options to specify what each corpus represents. Example 4-1 shows the process for a set of mbox files—a file of saved spam, a file of saved (non-spam) messages related to a project, and the user’s mail spool. The project files and mail spool files together form a corpus of known good messages. This example assumes that each user maintains her own Bayesian databases, so sa-learn is run by each user on her own messages.

Example 4-1. Learning from a set of mbox files

$ ls -F Mail
spam    myproject
$ sa-learn --mbox --spam Mail/spam
$ sa-learn --mbox --ham mail/myproject
$ sa-learn --mbox --ham /var/spool/mail/$LOGNAME

Example 4-2 shows the process for a set of maildirs, again assuming that each user has his own Bayesian databases. The commands in the example are those that would be executed by each individual user. Providing a directory as an argument to sa-learn causes it to learn from every file in that directory. The example also illustrates the use of the --no-rebuild option to defer rebuilding of the databases until the --rebuild option is used. When performing learning on a large set of small files (the very essence of a maildir), deferring the expensive database-rebuilding step is more efficient than rebuilding after each file.

Example 4-2. Learning from a set of maildirs

$ ls -F mail
INBOX/    spam/    myproject/
$ sa-learn --no-rebuild --spam mail/spam
$ sa-learn --no-rebuild --ham mail/INBOX
$ sa-learn --no-rebuild --ham mail/myproject
$ sa-learn --rebuild

If you’re the sort who likes to see the progress of the training (or who worries when you run a command that takes longer than a few seconds to finish), you can add the --showdots option to cause sa-learn to print a period for each message it processes.

You can also call sa-learn on an individual file containing a mail message, or you can pipe a mail message to sa-learn’s standard input. Finally, you can put the names of mailboxes, files, or directories into a file and run sa-learn with the --folders= filename option, and it will read the file and directory names from the filename file and learn from each.

Tip

The Bayesian classifier is most effective when trained on large collections of both spam and non-spam messages. In particular, training using many spam messages and fewer non-spam messages is likely to produce an ineffective filter. Aim for a couple thousand messages of each type, collected prospectively from your personally received mail.

If you mistakenly train the Bayesian classifier that a message is spam, simply direct sa-learn to relearn it as ham; if you mistakenly learn a message as ham, you can direct sa-learn to relearn it as spam. This process is also how you later train the classifier on errors. You can also cause SpamAssassin to forget a message entirely by running sa-learn --forget on the message.

sa-learn also accepts the same --configpath /path/to/ruleset/directory, --prefspath /path/to/user_prefs, and --siteconfigpath /path/to/sitewide/directory directives that the spamassassin script does. They are described in Chapter 2.

Once your Bayesian classifier has been trained and is contributing to spam-checking, you might be curious to find out which tokens are actually being used. The sa-learn --dump type command displays that information. type can be one of these choices:

data will cause sa-learn to display all of the tokens it has learned, with their associated spam probabilities, number of occurrences in spam and ham messages, and last time used.
magic will cause sa-learn to display “magic” tokens. Although they’re stored in the database, these tokens don’t represent parts of email messages. They include such information as the number of spam and ham messages in the databases, the last time a token was used, etc.
all will cause sa-learn to display tokens of both types.

Here are the first and last five lines of sa-learn --dump data | sort -n as executed on one system:

0.000    0    110 1072880922  discussion
0.000    0    112 1071162080  HMBOX-Line:2002
0.000    0    112 1072907632  modify
0.000    0    113 1072915324  H*u:Windows
0.000    0    115 1072900545  Sender
...
1.000  310      0 1071162080  N:HEADER_NBITS
1.000  316      0 1072026198  8-bit
1.000  323      0 1071162080  HEADER_8BITS
1.000  328      0 1072026198  N:N-bit
1.000  394      0 1072910571  Forged

The first five lines show tokens that have only exclusively appeared in non-spam messages. The last five show tokens that have exclusively appeared in spam messages. Tokens starting with H were found in headers; some headers are abbreviated with special codes starting with an asterisk (*)—so H*u: means the User-Agent header. Tokens starting with N: indicate that Ns that appear in the token should match any sequence of digits.

You can restrict which tokens are shown by sa-learn --dump by adding the --regexp regexp command-line option and providing a regular expression pattern regexp. Only tokens that match regexp will be displayed. This option is useful when you want to see the spam probability associated with specific tokens.

Daily Use

When you first enable the Bayesian classifier in SpamAssassin, you will initially notice little change in the way messages are checked for spam. Once you’ve trained the classifier with enough messages, however, your spam scores for messages will begin to change substantially in two ways:

Messages will show that they are hitting SpamAssassin rules with names like BAYES_44 or BAYES_80. These rules, which can be found in the 23_bayes.cf file, are triggered when the Bayesian classifier assigns a given probability of spam to a message. For example, the BAYES_44 rule is matched when a message has a probability of spam between 0.44 and 0.4999; the BAYES_80 rule is triggered when a message has a probability of spam between 0.80 and 0.90. Rules that match on probabilities less than 0.5 lower spam scores, and those that match on probabilities greater than 0.5 raise spam scores.
Most of the non-Bayesian rules assign different scores when the classifier is trained and in use than when it is not. In many cases, non-Bayesian rules produce less extreme scores, which reflects the supposition that the Bayesian classifier should be better than static rules at distinguishing spam from non-spam.

Ongoing training

Ongoing training is essential to maintaining the performance of a Bayesian filter. As in initial training, you must continue to provide examples of both spam and non-spam messages.

As you receive messages, check each message classified as spam to be sure that it is really spam and not a false positive. If the message’s spam score is higher than the threshold for automatic learning, the message should have already been fed back into the classifier to train it. You can determine if this has happened by looking at the autolearn= section of the X-Spam-Status header added by SpamAssassin. If the message’s spam score wasn’t high enough for automatic learning, submit it to sa-learn --spam yourself. If you come across a false positive, submit it to sa-learn --ham instead.

Similarly, you can submit your non-spam messages to sa-learn --ham if their spam scores are too high for the automatic learning threshold for ham. Any spam SpamAssassin misses should definitely be submitted to sa-learn --spam.

You can make the ongoing training process more convenient using one of two common ways. If you read your email with an email client that allows you to bind commands to keys, you could define keystrokes to invoke sa-learn --ham or sa-learn --spam on the current message. Another approach is to save all spam messages into a single mail folder and all non-spam messages that you plan to delete into a second folder, and then run sa-learn on each folder (and possibly on your inbox if you keep many undeleted messages there) at the end of your mail-reading session. Users or system administrators can set up cron jobs to automate this process.

Expiration and importing

Expiration and importing are two other functions of sa-learn that you will use infrequently. Expiration removes old tokens from the database, and importing updates the database if a new SpamAssassin release changes database formats.

As discussed earlier in this chapter, when bayes_auto_expire is enabled (the default), SpamAssassin’s Bayesian classifier regularly reviews its database of tokens to determine if any should be expired. Expiration is always skipped when fewer than 100,000 tokens are in the database. The automatic expiration process runs no more than once every 12 hours and only when the number of tokens exceeds bayes_expiry_max_db_size.

If you do not use bayes_auto_expire, or if you want to expire tokens manually, you can force an expiration attempt by running sa-learn --force-expire. Doing so may not actually expire any tokens; for example, when fewer than 100,000 tokens or all tokens have been recently used, no tokens will be expired.

The sa-learn --import command is used to update the Bayesian databases from their format in an older version of SpamAssassin to the current format. The release notes for new versions of SpamAssassin should tell you when running sa-learn --import is necessary. In many cases, SpamAssassin will perform importation when it automatically learns a new message, so this command may not be necessary.

Warning

The import process can be both CPU and disk intensive, especially with a large database of tokens. It is best run during off-hours or times of low system load.

Storing Bayesian Data in SQL

SpamAssassin 3.0 can optionally store per-user Bayesian data in an SQL database, which is useful when users don’t have accounts on the mail server. To store Bayesian data in SQL, you must install the DBI Perl module and an appropriate driver module for your SQL server. Common choices are DBD-mysql (for the MySQL server), DBD-Pg (for the PostgreSQL server), and DBD-ODBC (for connection to an ODBC-compliant server).

You should create a database and a user with privileges to access it. You must then create a set of tables in the database to store the Bayesian data. The SpamAssassin source code includes schemas for MySQL, PostgreSQL, and SQLite tables in the sql subdirectory. Here is the MySQL schema:

CREATE TABLE bayes_expire (
  username varchar(200) NOT NULL default '',
  runtime int(11) NOT NULL default '0',
  KEY bayes_expire_idx1 (username)
) TYPE=MyISAM;

CREATE TABLE bayes_global_vars (
  variable varchar(30) NOT NULL default '',
  value varchar(200) NOT NULL default '',
  PRIMARY KEY  (variable)
) TYPE=MyISAM;

INSERT INTO bayes_global_vars VALUES ('VERSION','2');

CREATE TABLE bayes_seen (
  username varchar(200) NOT NULL default '',
  msgid varchar(200) binary NOT NULL default '',
  flag char(1) NOT NULL default '',
  PRIMARY KEY  (username,msgid),
  KEY bayes_seen_idx1 (username,flag)
) TYPE=MyISAM;

CREATE TABLE bayes_token (
  username varchar(200) NOT NULL default '',
  token varchar(200) binary NOT NULL default '',
  spam_count int(11) NOT NULL default '0',
  ham_count int(11) NOT NULL default '0',
  atime int(11) NOT NULL default '0',
  PRIMARY KEY  (username,token)
) TYPE=MyISAM;

CREATE TABLE bayes_vars (
  username varchar(200) NOT NULL default '',
  spam_count int(11) NOT NULL default '0',
  ham_count int(11) NOT NULL default '0',
  last_expire int(11) NOT NULL default '0',
  last_atime_delta int(11) NOT NULL default '0',
  last_expire_reduce int(11) NOT NULL default '0',
  PRIMARY KEY  (username)
) TYPE=MyISAM;

For each user, these tables maintain information about token expiration (bayes_expire), messages seen (bayes_seen), tokens seen (bayes_token), and per-user configuration variables (bayes_vars). A table for global configuration variables (bayes_global_vars) is also available. The names of rows in these tables are similar to the corresponding SpamAssassin configuration variables and indicate the data they store.

To configure SQL support for Bayesian data, set the following configuration parameters in your systemwide configuration file (local.cf):

bayes_store_module Mail::SpamAssassin::BayesStore::SQL: Configures SpamAssassin to use SQL-based storage for Bayesian data instead of file-based (DBM) storage.
bayes_sql_dsn DSN: Defines the data source name for the SQL database. See the earlier definition of bayes_awl_dsn for examples of how to define a DSN.
bayes_dsn_sql_username username: Defines the username that will be used to connect to the database server. This user must have permission to modify the data in the table (including inserting and deleting rows).
bayes_dsn_sql_password password: Defines the password associated with the username that will be used to connect to the server.

SpamAssassin will now store Bayesian data learned from messages (either automatically or via sa-learn) in the SQL database and will look up tokens in this database when checking messages for a user.

SpamAssassin provides one additional configuration variable for SQL storage of Bayesian data:

bayes_sql_override_username someusername: When this directive is set, the SQL query for Bayesian data will use someusername in place of the current user’s name when adding new message data or retrieving data for message-checking. Generally, this directive should only be used in per-user configuration files so that most users have their own personal Bayesian data. In principle, you could also use it in the site-wide configuration file to create a sitewide Bayesian database, and then use it in per-user configuration files to exclude certain users from the sitewide data.

A Sitewide Bayesian Classifier

Bayesian filtering is most effective when each user maintains his own set of token databases trained from his own email. By learning about the peculiar characteristics of spam and non-spam messages received by an individual user, the Bayesian classifier becomes an effective test for future messages to that user. A pharmacist might receive a lot of legitimate email about sildenafil citrate, and having all of these messages tagged as spam (or worse) could be a serious problem.

Many sites, however, prefer to have a single set of databases for all users at the site, either to save disk space or because users do not have home directories and setting up SpamAssassin 3.0’s SQL storage is infeasible. Setting up a sitewide Bayesian classifier is possible with SpamAssassin. Perform the following steps:

Set bayes_path and bayes_file_mode in the systemwide configuration file. Be sure the directory specified in bayes_path is readable, writable, and searchable by the user that SpamAssassin will be running as, so that it can create the proper files. The bayes_file_mode should be as strict as possible, typically 0700, which is the default setting. It’s a good idea to set it explicitly, rather than rely on the default.
Provide a mechanism for users or administrators to submit messages for training. This step is the most difficult part of a sitewide Bayesian classifier. Because the database files will be owned by the user that SpamAssassin runs as, even local users typically will not be able to run sa-learn with the proper permissions to update the databases.

One solution for enabling users to submit spam messages for training is to ask users to bounce any spam they receive to a central mailbox that can be processed by a privileged script. For example, set up an email alias of spamtrap on the SpamAssassin system that pipes incoming messages to a script like that shown in Example 4-3. As an extra benefit, you can publicize the spamtrap address on public web pages or in Usenet postings and actually use it as a spam trap—spammers who harvest the address and send spam to it will find their spam fed into your learning and reporting systems.

Example 4-3. A sitewide script for learning spam

#!/bin/sh
#
# This script accepts an email message on its standard input
# and feeds it to SpamAssassin's learning and/or reporting systems
# It is meant to be run as root or as the user who owns the 
# SpamAssassin Bayesian databases


PATH=/bin:/usr/bin:/sbin:/usr/sbin

# Three choices:
# 1. Uncomment the following line to use --report if
# you have bayes_learn_during_report enabled.
spamassassin --report

# 2. Uncomment the following line to use sa-learn and
# spamassassin --report when you don't have
# bayes_learn_during_report enabled
# sa-learn --spam | spamassassin --report

# 3. Uncomment the following line to use sa-learn
# alone.
#sa-learn --spam

Warning

If you ask users to use a centralized spamtrap address, it is crucial that they bounce or redirect their messages, rather than forward their messages. A forwarded message’s headers will show the message as being sent by the forwarding user, which is not what you want the Bayesian classifier to learn! Most mail clients provide a function for redirecting a message to a new address so that it still appears to be coming from the original sender. If your mail clients add extra headers when they do this, these headers are good candidates for bayes_ignore_header. You have to test to determine which, if any, headers your mail clients add and to be sure SpamAssassin is ignoring them.

A similar solution for non-spam messages is much more difficult—for social, rather than technical, reasons. Users may well be reluctant to forward their legitimate email to any central address. Unfortunately, without a good corpus of non-spam messages, the Bayesian filter will not perform well. One possible approach is to raise the bayes_auto_learn_threshold_nonspam slightly (e.g., to 0.5 or 1.0) so that much legitimate email will be auto-learned.

Get SpamAssassin now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

SpamAssassin by Alan Schwartz

Chapter 4. SpamAssassin as a Learning System

Autowhitelisting

Principles

Configuration

Configuring per-user autowhitelists

Configuring a system-wide autowhitelist

Warning

Using an Autowhitelist

Warning

Bayesian Filtering

Principles

Tip

Configuration

Training

Tip

Daily Use

Ongoing training

Expiration and importing

Warning

Storing Bayesian Data in SQL

A Sitewide Bayesian Classifier

Warning

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly