CSL: LearnSpam

LearnSpam: A System to Automate SpamAssassin Spam Filter Training

In the current state of affairs in the War on Spam, using SpamAssassin's optional Bayesian probability-analysis classification capability is one of the best methods for increasing the effectiveness of spam recognition and filtering. We've set up the LearnSpam system to help make it easy to use this feature.

Setting up and using SpamAssassin's Bayesian Spam Filtering (BSF) is technical and tedious, requires ongoing time and attention on a per-user basis, and can consume a lot of your home directory disk quota. That's why we've created the LearnSpam system: to try to make this approach much more practicable for busy department members.

Typically, BSF requires:

This is a lot to expect every user to do on an ongoing basis, plus the spam can occupy a lot of your home disk space! LearnSpam will not totally free you of all involvement, but it will reduce it substantially, and it will also reduce the amount of your disk space that gets used to store spam.

Note: SpamAssassin's BSF programs create and maintain a database in your ~/.spamassassin/ directory; this will take up from 5 to 10 MB of your home directory disk space.

If you're in a hurry, you can skip the rest of this page and just run the command learnspam from a Unix prompt; much of this will be explained as the program leads you through the setup.

LearnSpam structure:

There are several components to LearnSpam:

What to do:

To set up LearnSpam, you will need to run the learnspam program from a UNIX window on the trusted network:

      % learnspam

When you run learnspam for the first time, it will provide a lot of the same information you're reading here, will prompt you for the needed information, and will then record your responses to a configuration file. The configuration file will then be used by the various LearnSpam programs.

When you run learnspam again, it will run the BSF learning programs. Most users will only need to run learnspam once: After the initial setup, it will be run automatically by the system.

When you run learnspam the first time, you will need to provide:

After that, you will need to save both ham and spam on a regular basis. And that's all! (But see below for some guidelines on ham and spam.)

The spam that you save will be regularly moved by the system into a system spam repository. Each user can only read or access their own spam, but the Lab staff will also provide a large amount of spam in the repository available to everyone. So, it is not totally necessary that you save any (or much) spam at all, which is why this part is optional. But the BSF will probably work better for you if it has some of your own spam to learn from.

For ham, you can just provide the names of some mailboxes that you regularly save legitimate (ie, non-spam) e-mail to, and/or you can save a lot of legitimate e-mail that you would otherwise delete into a separate mailbox called, eg, "ham". Your ham will not be moved or rotated by the system; your ham will still occupy your own home directory space.

Remember: It is very important to train the BSF system with ham, too, not just spam!

The details:

The first time you run learnspam, it will set up your configuration file at:

        ~/.spamassassin/learnspam.conf

You will be able to edit that file at any time to add or remove ham and spam mailboxes for LearnSpam to use.

You will also be able to use the configuration file to turn this system off and on for your account. If the system is on (the default setup), then everything happens automatically on a regular basis:

If the system is off, then you can manage your own spam mailboxes, and run learnspam or the BSF programs yourself however you wish.

Advanced users:

You can tailor this system to be as automatic or as manual as you like. You can set the automatic moves and learning to "off", and you can run movespam and learnspam by hand, or set up your own cron jobs. See movespam -h and learnspam -h for additional details.

About ham and spam:

Bayesian spam filtering does not live by spam alone!   ;-)

It is very important to teach the BSF system what legitimate e-mail looks like, too. You'll need to provide the names of one or more ham mailboxes that preferably have e-mail saved to them fairly regularly, and that total between 1000 and 5000 messages. It is probably best to avoid mailboxes that contain a lot of large attachments (eg, Word docs.), since these will take longer to process and will be of little added value.

The spam setup can actually be a little more complicated than the ham setup, since you might need to specify two separate groupings:

The simplest (but not necessarily the best) option for these is to just use the system spam repository, and not worry about saving spam at all (this will be presented as an option). Alternatively, you can provide responses for the two groupings above, including the use of the system spam repository.

If you don't want any spam mailboxes automatically moved by the system, then you can enter nothing for the first grouping. The advantages of letting the system handle your spam are that the spam mailboxes you've specified will periodically be moved to the system repository, will not take up space in your home directory, and the older spam in the repository will periodically be expired (newer spam is better than older spam for BSF training).

For the second grouping, this is the spam that will actually be used for BSF training. It is quite acceptable to just use the system spam repository for this. If you do that, then your filtering will be trained from a combination of your saved (moved) spam in the repository and a considerable collection maintained there by the Lab staff.

These choices for ham and spam mailboxes, system repository use, etc., can always be changed later by editing your ~/.spamassassin/learnspam.conf file with a text editor.

LearnSpam uses two timestamp files to keep track of when it last processed your ham and spam, and will not reprocess a mailbox that has not been modified since the last run. Therefore, if you have a ham mailbox - or a spam mailbox that doesn't get automoved - that is big, it makes sense to occasionally save the older messages to a secondary file. For example, if you have a large mailbox named saved-messages used for ham processing, then the older messages could occasionally be saved to saved-messages-old. This would greatly speed up runs of BSF training.

Managing your spam:

Here is a recommendation for managing your spam:

If you need help setting up your filtering, please contact the Lab staff.

Differences between "spam to move" and "spam to use":

"Spam to move" is one or more mailboxes that will be moved to the system spam repository. These should consist of known spam only.

"Spam to use" is one or more mailboxes (and/or the system spam repository, see above) that will be used by the filter training programs; this is what will make your spam filtering more successful. These should consist of known spam only.

Most commonly, a user will just specify one or two of their mailboxes as "spam to move", and then just specify the system repository (it will be presented as an option) as the "spam to use", when prompted by the learnspam program.

You can list the same mailboxes in both categories; or you can maintain your own, long-term spam repository; or some combination. But we recommend that you follow the common usage outlined in the previous paragraph.

It's automatic:

The LearnSpam system will arrange to move your spam and run the BSF learning automatically. By default, these features will be turned on. You can change the settings anytime by editing your ~/.spamassassin/learnspam.conf file with a text editor. Also, removing or renaming your configuration file will disable any actions by LearnSpam for your account.


Please send any comments, corrections, or suggestions to .