Stopping Spam

August 2003

There are many ways to fight spam. Which works best? So far the best single solution is probably Bayesian filtering. But you don't have to choose just one. Many of the following solutions could be used in combination.

Index:

Complaining to Spammer's ISPs
Mail Server Blacklists
Signiture-Based Filtering
Bayesian Filtering

Complaining to Spammers' ISPs

Good: Raises cost of spamming.
Bad: Laborious.
Role: Partial solution, for experts.

This was the original spam solution. Believe it or not, complaints can have an effect. True, spammers expect to be shut down, and already have fresh accounts lined up. Constantly switching providers is just a cost of doing business. But the faster they get booted due to complaints, the greater this cost becomes.

Complaining effectively is difficult. Most spammers forge the headers of their emails to disguise their origin. You have to learn to interpret headers to understand where a spam really came from.

Another option is to complain to the ISP hosting the site advertised in the spam (or, if the ISP is a spam hosting service, their upstream provider). But again, it can take some effort to figure out who this is.

Mail Server Blacklists

Good: Block spam right at the server.
Bad: Incomplete, sometimes irresponsible.
Role: A first pass to eliminate up to 50& of spam early on.

Groups of volunteers maintain blacklists of mail servers either used by spammers, or that have security holes that would let spammers use them. Some ISPs subscribe to such blacklists, and automatically refuse any mail from servers on them.

Blacklists have two downsides. One is that they never manage to list more than about half the servers that spam comes from. Another is that a blacklist is only as good as the people running it. Some blacklists are run by vigilantes who shoot first and ask questions later. Using the wrong blacklist could mean bouncing a lot of legitimate mail.

Blacklists are useful as at the ISP level, as long as you (a) use a responsible one (if there are any) and (b) don't expect it to be more than a first cut at the problem.

Signiture-Based Filtering

Good: Rarely blocks legitimate mail.
Bad: Catches only 50-70% of spam.
Role: A first-pass filer on big email services.

Signature-Based filters work by comparing incoming email to known spams. Brightmail does it by maintaining a network of fake email addresses. Any email sent to these addresses must be spam. So when they see the same email sent to an address they're protecting, they know they can filter it out.

In order to tell whether two emails are the same, these systems calculate "signatures" for them. One way to calculate a signature for an email would be to assign a number to each character, then add up all the numbers. It would be unlikely that a different email would have exactly the same signature.

The way to attack a signature-based filter is to add random stuff to each copy of a spam, to give it a distinct signature. When you see random junk in the subject line of a spam, that's why it's there-- to trick signature-based filters.

The spammers have always had the upper hand in the battle against signature-based filters. As soon as the filter developers figure out how to ignore one kind of random insertion, the spammers switch to another. So signature-based filters have never had very good performance.

Bayesian (aka Statistical) Filtering

Good: Catch 99% to 99.9% of spam, low false positives.
Bad: Have to be trained.
Role: Best current solution for individual users.

Bayesian filters are the latest in spam filtering technology. They recognize spam by looking at the words (or "tokens") they contain.

A Bayesian filter starts with two collections of mail, one of spam and one of legitimate mail. For every word in these emails, it calculates a spam probability based on the proportion of spam occurrences. In my own email, "Guaranteed" has a spam probability of 98%, because it occurs mostly in spam; "This" has a spam probability of 43%, because it occurs about equally in spam and legitimate mail; and "deduce" has a spam probability of only 3%, because it occurs mostly in legitimate email.

When a new mail arrives, the filter collects the 15 or 20 words whose spam probabilities are furthest (in either direction) from a neutral 50%, and calculates from these an overall probability that the email is a spam.

Because they learn to distinguish spam from legitimate mail by looking at the actual mail sent to each user, Bayesian filters are extremely accurate, and adapt automatically as spam evolves.

Bayesian filters vary in performance. As a rule you can count on filtering rates of 99%. Some, like SpamProbe, deliver filtering rates closer to 99.9%.

Bayesian filters are particularly good at avoiding "false positives"-- legitimate email misclassified as spam. This is because they consider evidence of innocence as well as evidence of guilt. A Bayesian filter is unlikely to reject an otherwise innocent email that happens to contain the word "sex", as a rule-based filter might.

The disadvantage of Bayesian filters is that they need to be trained. The user has to tell them whenever they misclassify a mail. Of course, after the filter has seen a couple hundred examples, it rarely guesses wrong, so in the long term there is little extra work involved.

Another disadvantage of Bayesian filters is that they're new. The technology only became widespread in 2003. Most commercial spam filters are still rule-based.