Spam Filtering

Like many ISPs, I have had an ongoing and frustrating battle with spam. While you might think that spammers would realize that the system accounts such as “root,” “postmaster” and “webmaster” are the probably savvy enough not to fall for the advertised schemes and are probably the least likely individuals to be infected with a virus or enticed to install that unknown software, even those accounts are being flooded with a significant quantity of unsolicited email. All in all, I was receiving several hundred pieces of spam each day. While I have no issues deleting those messages — in fact, it's easy and somewhat therapeutic. Unfortunately, the messages also tend to obscure the email I actually care about, particularly when the sender doesn't choose a descriptive subject.

I have tested a number of spam filters, most notably Spam Assassin and dspam, with mixed results. Spam Assassin was not consistent enough; I had a large number of false negatives and some false positives. While the system is configurable, the process of tuning is not simple or obvious; it isn't obvious what knob to turn to produce the desired effect. Worse, I found that it required significant system resources to run; mail was taking minutes to get through the filter during peak times. So Spam Assassin simply didn't meet my needs.

I had higher hopes for dspam. Dspam claims to be scalable and enterprise ready. While it uses a relational database on the back end, it still requires a filesystem store for certain data, something I would rather not use since the multiple SMTP servers on the network should ideally be independent of one another. The filter took some time to train, but once trained ran fast and did an adequate job of filtering spam with very few false positives. Things looked good for about six months, after which it corrupted its database — and need to be retrained. It did the same thing after another six months. I switched databases and started running a fast hash database for storage, but encountered the same problem. It obviously lacks stability, at least as I used the tool.

The Solution — spamd

Spamd, not to be confused with dspam, which didn't work out for me, has turned out to be a fantastic first-level spam filter. The software, which is part of the OpenBSD UNIX distribution, implements a combination of blacklisting and greylisting that filters mail based on adherence to the SMTP protocol rather than by content.

The blacklist support is pretty much what one might expect — provided a list of sites, any and all email from them will be rejected. Greylisting is the process of filtering out those clients that don't adhere to the mail transfer specification. Any system that hasn't exchanged email with the network in the past month or so will be sent a temporary failure response. “Real” mail servers will accept the response and try again later. When they do, the email is processed correctly, and the remote server is added to the whitelist — a set of machines from which we accept mail without testing.

Why don't spammers pass this test? I speculate that there are three reasons: first, if you're sending 10,000,000 messages, you tend take shortcuts. Anything that delays transmission will be ignored — time is literally money. Second, much of the spam is probably being sent from a botnet. While this is a powerful network of machines, they're desktops, not servers, and any individual machine can be powered down or rebooted randomly. That makes it even more important to send your burst of messages as fast as possible. Last, retries take memory and planning. You need to keep track of which messages were delivered, which failed, and which need to be retried. Then, after a delay, you need to resend those that had temporary failures. While this isn't a big deal for legitimate email, it is a significant complication for large quantities of spam.

Regardless of whether this speculation is accurate, the results have been impressive. Rather than the hundreds of spam messages I used to personally receive each day, I now get one. Every few days.

Entertainment

Among other functionality, spamd provides a few tools for determining whether a server is sending spam. You can record addresses of machines that are delivering to unadvertised MTAs, for example; or that send mail to a list of non-existent accounts. Once such a server has been flagged, it is moved to a temporary blacklist; the blacklist entry expires 24 hours after the last trigger occurs.

Better yet, when a blacklisted machine attempts to send email, the connection becomes very, very slow. One character per second slow. And then, if the sender was foolish enough to remain connected, the mail is rejected anyway. Watching those logs is my new favorite hobby.

For those that are interested, statistics on the spam filter performance over the previous 24 hours is available at http://www.wtech-llc.com/spam.html. It is interesting to watch the greylist graph; there will be a periodic storm of email from unrecognized servers, but nearly all of it will expire from the greylist after four hours, the retry cutoff time.

Search

Static links

RSS Feed RSS Feed

About this site

Our host

Dynamic links

None