SpamAssassin

Chapter 1. Introducing SpamAssassin

The SpamAssassin system is software for analyzing email messages, determining how likely they are to be spam, and reporting its conclusions. It is a rule-based system that compares different parts of email messages with a large set of rules. Each rule adds or removes points from a message’s spam score. A message with a high enough score is reported to be spam.

Tip

SpamAssassin was a trademark of Deersoft, and Deersoft has been acquired by Network Associates. In this book, I won’t write SpamAssassin™ each time I mention it because that would be distracting, but you should assume that the trademark symbol is there.

Many spam-checking systems are available. SpamAssassin has become popular for several reasons:

It uses a large number of different kinds of rules and weights them according to their diagnosticity. Rules that have been demonstrated to be more effective at discriminating spam from non-spam email are given higher weightings.
It is easy to tune the scores associated with each rule or to add new rules based on regular expressions.
SpamAssassin can adapt to each system’s email environment, learning to recognize which senders are to be trusted and to identify new kinds of spam.
It can report spam to several different spam clearinghouses and can be configured to create spam traps—email addresses that are used only to forward spam to a clearinghouse.
It is free software, distributed under either the GNU Public License or the Artistic License. Either license allows users to freely modify the software and redistribute their modifications under the same terms.

Example 1-1 shows a message that has been tagged as spam by SpamAssassin. Elements added by SpamAssassin appear in bold.

Example 1-1. A message tagged by SpamAssassin

From riverol5380503@jubii.dk Fri Nov  7 18:26:05 2003
Received: from localhost [127.0.0.1] by localhost
        with SpamAssassin (2.60 1.212-2003-09-23-exp);
        Sun, 09 Nov 2003 12:24:22 -0600
From: "brianj" <riverol5380503@jubii.dk>
To: <Undisclosed.Recipients@mailin-2.priv.cc.uic.edu>
Subject: Live your dream life!!                MPNWSTU
Date: Fri, 07 Nov 2003 15:32:41 -0800
Message-Id: <000016646728$00007347$00000042@mail3.mailnara.net>
X-Spam-Status: Yes, hits=12.9 required=5.0 tests=CLICK_BELOW,
        FORGED_MUA_EUDORA,FROM_ENDS_IN_NUMS,MISSING_OUTLOOK_NAME,
        MSGID_OUTLOOK_INVALID,MSGID_SPAM_ZEROES,NORMAL_HTTP_TO_IP,
        SUBJ_HAS_SPACES,SUBJ_HAS_UNIQ_ID autolearn=no version=2.60
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 2.60 (1.212-2003-09-23-exp)
X-Spam-Level: ************
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="----------=_3FAE8656.371BED4D"

This is a multi-part message in MIME format.

------------=_3FAE8656.371BED4D
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

Spam detection software, running on the system has
identified this incoming email as possible spam.  The original message
has been attached to this so you can view it (if it isn't spam) or block
similar future email.  If you have any questions, see
the administrator of that system for details.

Content preview:  Do you owe large sums of money? Are you stuck with high
  interest ra{tes? We can help! You can do what tens of thousands of
  americans have done, consolidate your high interest bills into one
  easy, low interest, monthly payment. [...]

Content analysis details:   (12.9 points, 5.0 required)

 pts rule name              description
---- ---------------------- ----------------------------------------------
 1.0 SUBJ_HAS_SPACES        Subject contains lots of white space
 4.3 MSGID_SPAM_ZEROES      Spam tool Message-Id: (12-zeroes variant)
 0.9 FROM_ENDS_IN_NUMS      From: ends in numbers
 0.2 NORMAL_HTTP_TO_IP      URI: Uses a dotted-decimal IP address in URL
 0.2 SUBJ_HAS_UNIQ_ID       Subject contains a unique ID
 4.3 MSGID_OUTLOOK_INVALID  Message-Id is fake (in Outlook Express format)
 0.1 MISSING_OUTLOOK_NAME   Message looks like Outlook, but isn't
 1.9 FORGED_MUA_EUDORA      Forged mail pretending to be from Eudora
 0.0 CLICK_BELOW            Asks you to click below

The original message was not completely plain text, and may be unsafe to
open with some email clients; in particular, it may contain a virus,
or confirm that your address can receive spam.  If you wish to view
it, it may be safer to save it to a file and open it with an editor.

------------=_3FAE8656.371BED4D
Content-Type: message/rfc822; x-spam-type=original
Content-Description: original message before SpamAssassin
Content-Disposition: attachment
Content-Transfer-Encoding: 8bit

Received: (qmail 25515 invoked from network); 7 Nov 2003 18:26:02 -0600
Received: from mailin-2.cc.uic.edu (HELO mailin-2.priv.cc.uic.edu) (128.248.155.213)
  by email0.cc.uic.edu with SMTP; 7 Nov 2003 18:26:02 -0600
Received: from mail3.mailnara.net (c-24-98-136-187.atl.client2.attbi.com [24.98.136.187])
        by mailin-2.priv.cc.uic.edu (8.12.10/8.12.9) with ESMTP id hA80PxJk011669;
        Fri, 7 Nov 2003 18:26:00 -0600
Message-ID: <000016646728$00007347$00000042@mail3.mailnara.net>
To: <Undisclosed.Recipients@mailin-2.priv.cc.uic.edu>
From: "brianj" <riverol5380503@jubii.dk>
Subject: Live your dream life!!                MPNWSTU
Date: Fri, 07 Nov 2003 15:32:41 -0800
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="----------=_1068251164-2528-687"
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: QUALCOMM Windows Eudora Version 5.1
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.3018.1300
Content-Length: 2290
Lines: 72

Do you owe large sums of money? Are you stuck with
high interest ra{tes? We can help!

You can do what tens of thousands of americans have
done, consolidate your high interest bills into one
easy, low interest, monthly payment.

By first reducing, and then completely removing your
d+ebts, you will be able to start fresh. Why keep
dealing with the stress, headaches, and wasted money,
when you can consolidate your d+ebt and pay them off
much sooner!

Click below to learn more:

http://61.186.254.9?affiliateid=mailer10

hjeuubnfs

------------=_3FAE8656.371BED4D--

The SpamAssassin report is revealing. Despite the fact that this message includes several tricks to fool spam-checkers, such as random characters at the end and breaking up the words “rates” and “debt” with symbols, SpamAssassin identifies several suspicious characteristics and assigns a high spam score.

How SpamAssassin Works

There are several ways that SpamAssassin makes up its mind about a message:

The message headers can be checked for consistency and adherence to Internet standards (e.g., is the date formatted properly?).
The headers and body can be checked for phrases or message elements commonly found in spam (e.g., “MAKE MONEY FAST” or instructions on how to be removed from future mailings)—in several languages.
The headers and body can be looked up in several online databases that track message checksums of verified spam messages.
The sending system’s IP address can be looked up in several online lists of sites that have been used by spammers or are otherwise suspicious.
Specific addresses, hosts, or domains can be blacklisted or whitelisted. A whitelist can be automatically constructed based on the sender’s past history of messages.
SpamAssassin can be trained to recognize the types of spam that you receive by learning from a set of messages that you consider spam and a set that you consider non-spam. (SpamAssassin and the spam-filtering community often refer to non-spam messages as ham. )
The sending system’s IP address can be compared to the sender’s domain name using the Sender Policy Framework (SPF) protocol (http://spf.pobox.com) to determine if that system is permitted to send messages from users at that domain. This feature requires SpamAssassin 3.0.
SpamAssassin can privilege senders who are willing to expend some extra computational power in the form of Hashcash (http://www.hashcash.org). Spammers cannot do these computations and still send out huge amounts of mail rapidly. This feature requires SpamAssassin 3.0.

SpamAssassin combines message format validation, content-filtering, and the ability to consult network-based blacklists. Filtering systems require little user intervention and introduce little delay into the process of sending and receiving email. There are other approaches to preventing spam, each of which comes with its own advantages and disadvantages (and many of which can be used in addition to, as well as in place of, SpamAssassin).

In a challenge/response system, the system holds all messages from unknown senders and sends them a reply message with a unique code or set of instructions (the challenge). The senders must reply to the challenge in some fashion that verifies their email addresses and (generally speaking) proves that they are human beings, rather than an automated bulk mail program (the response). After a successful response, the system allows messages from the sender to be accepted, rather than holding them.

In greylisting systems, the mail server initially returns a temporary SMTP (Simple Mail Transfer Protocol) failure code to messages from new senders or sending systems. If the sending system attempts to resend the message after a reasonable time period, the mail server accepts the message and subsequent messages from the sending host. Because spammers are likely either to treat the temporary failure as a permanent failure, or to attempt to deliver messages continually during the greylisting time period, their messages are not received.

In time-limited address systems, users generate unique variations of their email address to include in different web forms, email messages, newsgroup postings, etc. Addresses may be valid only for a limited time or may be valid until revoked by the user. In these systems, if a user receives spam at one of his addresses, he can usually identify the company that spammed him (or sold his address to a spammer), and he can quickly invalidate the address to prevent further spam.

In micropayment systems, senders must pay a small fee for each message they send, making large-scale spam runs costly. In some of these systems, the micropayment is refunded when the recipient determines that the message is in fact non-spam. (SpamAssassin 3.0 supports a variation of micropayments in the form of Hashcash, in which the payment is made in processing time rather than money.)

Organization of SpamAssassin

At heart, SpamAssassin is a set of modules written in the Perl programming language, along with a Perl script that accepts a message on standard input and checks it using the modules. For higher-performance applications, SpamAssassin also includes a daemonized version of the spam-checker and a client program in C that can accept a message on standard input and check it with the daemon.

Most of SpamAssassin’s behavior is controlled through a systemwide configuration file and a set of per-user configuration files. The per-user configuration can also be stored in an SQL database.

Tip

For a great deal more about Perl, check out Learning Perl, by Randal L. Schwartz and Tom Phoenix, or Programming Perl, by Larry Wall, Tom Christiansen, and Jon Orwant, both from O’Reilly.

Mailers and SpamAssassin

Although it’s possible to run SpamAssassin manually on a single message, SpamAssassin becomes really useful when all incoming messages are scanned automatically. There are several ways that this can be done.

Figure 1-1 shows a typical mail transmission. The sending system connects to the recipient’s mail transport agent (MTA) and transmits the message. If the message is destined for a user on the MTA’s system, the MTA hands the message off to the local mail delivery agent (MDA), which is responsible for storing the message in a user’s mailbox. Users may log into the system and read their mail directly from their mailboxes (as is typical on multiuser Unix systems), or, if the system runs the appropriate servers, users may download their mail using a mail client that supports the POP (Post Office Protocol) or IMAP (Internet Message Access Protocol) protocols.

Figure 1-1. A typical mail transmission

SpamAssassin can be run in three fundamental places: at the MTA, at the MDA, and as a POP proxy. Each has advantages and disadvantages.

Scanning at the MTA

Some MTAs provide a way for incoming messages to be passed through a filter during the SMTP transaction; others can pass messages through a filter after the SMTP transaction is complete. Spam-checking is one kind of filtering that can be usefully performed at the MTA; virus-checking is another. In many cases, sophisticated filtering daemons have been developed for specific MTAs, and these daemons are capable of calling SpamAssassin to perform spam checks.

Because all email destined for users on the system must pass through the MTA, it is a natural place for centralized spam-checking. If you run a gateway MTA that delivers mail to several internal systems, you can perform spam-checking at the gateway MTA to limit the amount of spam that any internal server will receive.

In addition to tagging messages that appear to be spam, MTA-based filters can often take other actions, such as blocking a message (either refusing to complete the SMTP transaction or discarding it after the SMTP transaction has taken place) or redirecting it to quarantine area. If the MTA is already running a filtering system to do virus-checking, spam-checking can usually be performed by the same filter and share some of the overhead associated with filtering.

A disadvantage of scanning at the MTA alone is that the MTA filtering system may not be able to access per-user preferences for scanning if the filter does not have access to the recipient information, if the recipient is at another host, or if the message is destined for multiple users on the same system.

Scanning at the MDA

On many Unix systems, the mail delivery agent is procmail, which can submit messages to SpamAssassin and act on the results. This is the most typical way that SpamAssassin is installed alone, as it does not require any MTA-specific filter interfaces.

This configuration maximizes flexibility. Systemwide SpamAssassin rules can be applied to all incoming messages, and users can supplement or modify them with their own per-user SpamAssassin configuration, because, by definition, the MDA always knows the recipient to which it is delivering the message. Users who are proficient in writing procmail recipes gain complete control over the disposition of messages marked as likely spam; procmail can be instructed to discard them, file them in a separate mailbox, modify message headers, or take many other actions.

The downside of this configuration is that spam-checking is applied only after a message has been received by the system and has consumed some system resources. Another disadvantage is that spam-checking must be set up on every system that has local recipients, rather than at a single centralized MTA gateway.

Scanning with a POP Proxy

POP mail users who want the benefits of SpamAssassin on mail servers that don’t provide it can use a proxy to perform spam-checking. The proxy runs on the client computer and integrates with the POP mail reader to scan messages as they are downloaded via POP.

The best known POP proxy for SpamAssassin on Windows systems is SAproxy by Stata Labs. SAproxy Pro is a commercial product, but the source code is freely available under the same terms as SpamAssassin itself for administrators who wish to compile it and provide it to their users.

Proxies are the most decentralized approach to spam-checking and require the mail server to be liberal in accepting messages so that each user’s proxy can apply their own standards. This may increase the storage load on the mail server. On the other hand, proxies completely remove the computational load from the mail server, as all spam-checking is performed by the client.

Scanning at Multiple Places

It’s entirely possible to run SpamAssassin at two or even all three of the places discussed in the previous sections. An MTA-based filter could use SpamAssassin with conservative settings to refuse messages that are highly suspicious. An MDA filter on the same system could apply a more liberal (and per-user) definition of spam in order to tag messages for users who read their mail on the server itself. Finally, POP users could apply their own spam-checking by running SAproxy on their client machines.

The Politics of Scanning

If you’re an ISP that provides email service, many of your users will want—perhaps even demand—spam-tagging or spam-filtering of their incoming email. Other users, however, may not want their email tagged or filtered, either because they don’t get much spam, don’t perceive the spam they receive to be a problem, or are concerned about the possibility of a real message being mistakenly tagged as spam.

Before you implement systemwide or sitewide spam-checking, consider carefully the needs of your users and your responsibilities toward them. At minimum, you must inform users (and would-be users) of any unconditional spam-checking you perform on their email. Better yet is to provide spam-tagging only for those users who opt to turn it on. Best of all is to enable each user to configure their own settings and threshold for how spam is recognized. This is doubly important if you not only tag messages for users but actually filter or block spam for them.

SpamAssassin is an excellent tool for distinguishing spam and non-spam email, but only if you’ve determined that your users want you to distinguish the two.

Get SpamAssassin now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

SpamAssassin by Alan Schwartz