Spam is the popular term for junk email, also known more formally as unsolicited bulk mail.
Mail filtering refers to the ability to have actions automatically performed to your email before you read it. Such actions will be triggered if the filtering software recognizes certain patterns in a message.
A common example of mail filtering is where messages from mailing lists you subscribe to are delivered into folders other than your inbox, avoiding cluttering the inbox up and allowing the lists to be read at leisure.
SpamAssassin is a popular open-source software package, which applies a variety of textual and other tests to messages in order to estimate the likelihood that they are spam. This likelihood is represented as a number, the spam score. So SpamAssassin assigns a score to each message it sees, which can subsequently be used to determine the message's disposition.
It is impossible for SpamAssassin (or any other software package) to determine automatically for certain whether or not a message is spam. The indications that it searches for in messages can appear in nonspam messages too. Spams are usually advertisements of one sort or another; many legitimate mailing lists pay for themselves by carrying advertising in their messages, and these adverts often use the same sort of language as spam. Similar language can also appear in legitimate marketing messages and alerts that people have signed up to receive. Newsletters often use the same presentational techniques (formatting, colour, etc) that are commonly used in spam. It's even possible that a single message will be regarded as spam by one person and nonspam by another. In the end it's a personal judgement. We can't make that judgement for you, but we can supply information to help in making that judgement, and the facilities to use it.
The spam score assigned to any message is not a certain judgement but is instead an estimate of the likelihood that it's spam. The higher the score, the more likely it is that it's spam, and the lower the score the less likely it is. But it's quite possible for a nonspam to score highly and for a spam to score lowly.
Since the software can't determine for sure if a message is spam or not, you can't use it to "magically" make all spam disappear while leaving all nonspam mail unaffected. What you can do is to use it to keep your inbox relatively free of the clutter and annoyance of the floods of spam that the Internet is nowadays full of, by having filtering software divert the messages that are likely spam into a mail folder other than your inbox.
Apart from the annoyance of it all, the biggest problems with spam are the danger of deleting a legitimate message along with all the spam in the inbox, and the time that is required to do this cleaning up carefully. By using mail filters to look at SpamAssassin's scores on incoming messages and to divert those that are likely spam to another mail folder so they don't appear in the inbox at all, you should be able to largely avoid these problems. Some spam will of course be given low scores by SpamAssassin, and so make it into your inbox, but the use of filtering should greatly reduce the amount that makes it through, and make it that much easier to deal with.
SpamAssassin tries very hard to make sure that nonspam messages tend to score lowly, so as to minimize the possibility that they may be filed along with the real spam. But as we have explained it's impossible to do this with complete accuracy while still catching enough spam to make the whole process useful. It's pretty certain that this will happen occasionally. This means that from time to time you have to look at the folder where likely spam is filed to make sure that it really is all spam. Anything that isn't spam can be filed back in the inbox, and the rest, which you judge to be spam, can then be safely deleted. Although this does not relieve you of the necessity of dealing with spam, it does mean that you only have to do it once in a while rather than continually, and it minimizes the chances of discarding a nonspam.
Having filtered with the
SpamAssassin markup for a while, and no messages having been incorrectly filed
in the likely spam folder, it's tempting to think you should be able to
automatically delete all such messages. We can't recommend that you do this as
it assumes that the situation remains stable over time. The nature of spam
changes as spammers attempt to circumvent the latest versions of anti-spam
software. To keep up with this, we may modify SpamAssassin's rules, and
certainly we will be installing new versions, as it seems useful to do so. While
Spamassassin is carefully tuned with a hopefully representative sample of email
to minimize the chance of falsely filing legitimate messages, the effect of such
changes on an individual's mail is unpredictable, as it may well deviate
significantly from the sample. It's also impossible to predict the effect on
messages that you haven't seen before - new mailing list subscriptions, a first
contact, and so
on. Even blindly cleaning out filed spam that's older than, say, two weeks means that you have the most recent two weeks' worth to search if it seems that some message you were expecting has gone missing.
Checking your likely spam folder once a day for legitimate messages would probably be sufficient for most people.
The main control over the spam filtering is a number called the threshold. Messages rated with a score equal to or greater than the threshold are filed in the likely spam folder; messages rated less than the threshold are not affected, and will be delivered to your inbox (unless there are subsequent filtering rules in place which might affect it).
At any given threshold there is always a chance that a spam message is filed in your inbox (a false negative) and a chance that a nonspam is filed in the likely spam folder (a false positive). If you increase the filtering threshold, the chance of false negatives increases, so more spam gets through to your inbox. At the same time, the chance of false positives decreases, so less nonspam mail is filed along with the spam.
Where the threshold should be set depends on the sort of email that each user receives. If you receive mail that tends to score highly, such as HTML-formatted newsletters, commercial announcements, and so on, you may prefer a higher threshold to allow this mail through to your inbox while accepting that a higher amount of spam will get through with it. If you receive only relatively "clean" mail, you may prefer a lower threshold. You may also prefer a lower threshold if you are prepared to check your likely spam folder often, while a higher threshold would allow you to check it less often.
So it is up to each user to set a threshold, which meets their needs as best as can be achieved.
A threshold of around 4 seems to be used by many SpamAssassin users, so you may wish to start there and adjust it if you are not happy with the results. A more conservative approach for those who are concerned with nonspam messages being filtered as likely spam would be to start with a threshold of 10 and adjust down if too much spam is still getting through to the inbox. After running with the initial threshold for a while, by observing how much spam is still making it through to your inbox and scanning your likely spam folder for any legitimate messages being misfiled, you should be able to judge whether a higher or lower value is better for you. You should then adjust the threshold by a small amount (we would recommend only one or two points at a time) and again observe what the results are.
In this way, you should gradually arrive at a threshold, which is best for you and the particular character of the email you receive. This will be different for different users, but we would imagine that typically many people would arrive at a threshold of between 5 and 10. You should also be able to develop some feel for how often you need to check your likely spam folder for any misfiled nonspam messages. This might be several times a day, or it might be every other day, but we would not recommend that you entirely ignore the contents. For one thing, there is always the chance of nonspam ending up there. For another, it will grow and eat into your disk quota, so will require periodic cleaning out.
You may have regular sources of mail such as mailing lists, company announcements, or notifications from services such as Amazon or Ebay, which send you messages that tend to score highly. Having such messages mixed into your general mail flow may require you to use a higher spam threshold than you might like, in order to avoid them being filed in your likely spam folder. This results in more spam getting through to your inbox. If you could trap these messages before they reached the spam test, you might be able to lower your spam threshold and keep more spam out of your inbox. If the mail filtering software you use allows, this can be achieved by filtering such messages prior to the spam test, or by specifying messages that are not tested at all.
The general technique is to install filtering rules for such mail before the point where high scoring messages are filtered to your likely spam folder. If you use a powerful filtering agent such as procmail on Unix servers or those provided by some mail clients, this would work in exactly the same way as rules you may already have for filing mailing lists in separate folders. These rules are installed before the rule that tests the spam score. See the documentation for your filtering client.
If you use the IMP webmail interface for the staffmail or sms servers, a simplified facility known as whitelisting (the opposite idea to that of a blacklist) is provided. This allows you to specify that mail from certain addresses is not tested for its spam score and so is never filtered to the likely spam folder. Such messages may of course be subsequently filtered by another rule.
Questions or Concerns: email@example.com