Gnus Manual: The problem of spam

9.16.1 The problem of spam

First, some background on spam.

If you have access to e-mail, you are familiar with spam (technically termed UCE, Unsolicited Commercial E-mail). Simply put, it exists because e-mail delivery is very cheap compared to paper mail, so only a very small percentage of people need to respond to an UCE to make it worthwhile to the advertiser. Ironically, one of the most common spams is the one offering a database of e-mail addresses for further spamming. Senders of spam are usually called spammers, but terms like vermin, scum, sociopaths, and morons are in common use as well.

Spam comes from a wide variety of sources. It is simply impossible to dispose of all spam without discarding useful messages. A good example is the TMDA system, which requires senders unknown to you to confirm themselves as legitimate senders before their e-mail can reach you. Without getting into the technical side of TMDA, a downside is clearly that e-mail from legitimate sources may be discarded if those sources can’t or won’t confirm themselves through the TMDA system. Another problem with TMDA is that it requires its users to have a basic understanding of e-mail delivery and processing.

The simplest approach to filtering spam is filtering, at the mail server or when you sort through incoming mail. If you get 200 spam messages per day from ‘random-address@vmadmin.com’, you block ‘vmadmin.com’. If you get 200 messages about ‘VIAGRA’, you discard all messages with ‘VIAGRA’ in the message. If you get lots of spam from Bulgaria, for example, you try to filter all mail from Bulgarian IPs.

This, unfortunately, is a great way to discard legitimate e-mail. The risks of blocking a whole country (Bulgaria, Norway, Nigeria, China, etc.) or even a continent (Asia, Africa, Europe, etc.) from contacting you should be obvious, so don’t do it if you have the choice.

In another instance, the very informative and useful RISKS digest has been blocked by overzealous mail filters because it contained words that were common in spam messages. Nevertheless, in isolated cases, with great care, direct filtering of mail can be useful.

Another approach to filtering e-mail is the distributed spam processing, for instance DCC implements such a system. In essence, N systems around the world agree that a machine X in Ghana, Estonia, or California is sending out spam e-mail, and these N systems enter X or the spam e-mail from X into a database. The criteria for spam detection vary—it may be the number of messages sent, the content of the messages, and so on. When a user of the distributed processing system wants to find out if a message is spam, he consults one of those N systems.

Distributed spam processing works very well against spammers that send a large number of messages at once, but it requires the user to set up fairly complicated checks. There are commercial and free distributed spam processing systems. Distributed spam processing has its risks as well. For instance legitimate e-mail senders have been accused of sending spam, and their web sites and mailing lists have been shut down for some time because of the incident.

The statistical approach to spam filtering is also popular. It is based on a statistical analysis of previous spam messages. Usually the analysis is a simple word frequency count, with perhaps pairs of words or 3-word combinations thrown into the mix. Statistical analysis of spam works very well in most of the cases, but it can classify legitimate e-mail as spam in some cases. It takes time to run the analysis, the full message must be analyzed, and the user has to store the database of spam analysis. Statistical analysis on the server is gaining popularity. This has the advantage of letting the user Just Read Mail, but has the disadvantage that it’s harder to tell the server that it has misclassified mail.

Fighting spam is not easy, no matter what anyone says. There is no magic switch that will distinguish Viagra ads from Mom’s e-mails. Even people are having a hard time telling spam apart from non-spam, because spammers are actively looking to fool us into thinking they are Mom, essentially. Spamming is irritating, irresponsible, and idiotic behavior from a bunch of people who think the world owes them a favor. We hope the following sections will help you in fighting the spam plague.