Filtering Spam With Lamson

Lamson supports initial use of the SpamBayes spam filter library for filtering spam. What Lamson provides is a set of easy to use decorators that you attach to your state functions which indicate that you want spam filtered. It also uses the default SpamBayes configuration files and database formats as you configure, so if you have an existing SpamBayes setup you should be able to use it right away.

Using lamson.spam

Lamson gives you a simple decorator to place on any state functions that should block spam. Typically you do not want spam filtering on your entire application, since that would prevent legitimate registrations and put too much burden on your system. It’s better to put spam filtering on the “insider” parts, and to have confirmation emails on “outsider” pieces.

Instead, what you want is to indicate that your “choke points” are filtering spam using lamson.spam.spam_filter so that when a spam is received they are put into a “spam black hole”.

Here’s an trivial example where the user is in the POSTING state, and you want everything to work like normal, but if they spam then you flip them into a SPAMMING state.

@route(".+")
def SPAMMING(message):
    # the spam black hole
    pass

@route("(anything)@(host)", anything=".+", host=".+")
@spam_filter("run/spamdb", "run/.hammierc", "run/spam", next_state=SPAMMING)
def POSTING(message, **kw):
    print "Ham message received."
    ... 

The line to look at is obviously the spam_filter line, which tells Lamson that you will:

  1. Use the SpamBayes training database run/spamdb for the detection.
  2. Use the SpamBayes run/.hammierc file for your config (optional and ignored if it is not there).
  3. Use run/spam as the dumping ground for anything classified as spam.
  4. The next_state to transition to if they send a spam message. This is optional, but very helpful.

With this, the spam_filter then wraps your state function, and every message is fed to SpamBayes. If SpamBayes says it’s spam then Lamson will dump it into your run/spam and transition to SPAMMING *without running your POSTING state*.

Once you are in this new SPAMMING state (or any state you like) you can do whatever you want. You can leave them there, or you can have an external tool that let’s you un-block someone. Pretty much any spam handling scheme you want is available.

Since your spam is placed into a queue you can inspect it later and check for any accidentally miscategorized mail, then use the SpamBayes tools to retrain for the misdetection.

Lamson only classifies mail that is marked as actual spam by looking at the 'X-Spambayes-Classification’ header and seeing if it starts with 'spam’. If it is 'unsure’ or 'ham’ it will let it through.

Effectiveness

I’ve been running a variant of this since the middle of May 2009 and it works great. The code I run is a custom version that fits the weirdness of my email setup but the principles are the same. I’m currently using the above spam filtering, some gray listing, and a few other tricks to block most of my incoming spam.

With all the spam block measures I’ve managed to cut down my spam to about 2-3 a day out of about 100-200 I receive. The majority of the “spam” that gets through is actually email that’s classified as “unsure” which I then use to retrain SpamBayes to make it stronger.

However, that’s my personal server, so in the case of a Lamson application you’ll want to be careful that your spam blocking activities don’t prevent too much legitimate use.

Changing What “Spam” Means

You can also change how spam is determined by sub-classing lamson.spam.spam_filter and doing your own implementation of the spam method.

Using SpamBayes

An important point about SpamBayes is that it comes with all the command line tools you need to configure and train your database using a corpus of spam you might have. All Lamson needs to do is read this database to determine if it is spam or not.

With mutt, I save the message to “=spam”, which places the spam in Mail/spam along with all of the others. Then I run this command:

sb_mboxtrain.py -s ~/Mail/spam -d run/spamdb

This goes through the spam mailbox, and any emails that SpamBayes has not already classified get used for training.

SpamBayes comes with other commands you can read about on their site (if you can find it).

Autotraining

Lamson doesn’t support “autotraining” directly, since it’s not clear in each situation what is obviously spam. In my personal setup I know that any email not for registered users is obviously spam, so I can autotrain those.

If you want to implement autotraining for a part of your application, then look at the API for lamson.spam.Filter and simply use it in the right state function.

Configuration

Finally, the above sample code is not the best way to configure the spam filter. It’s better to put the configuration in config/settings.py and simply reference it from there.

In your config/settings.py put this:

SPAM = {'db': 'run/spamdb', 'rc': 'run/spamrc', 'queue': 'run/spam'}

Then change your handler code to be this:

from config.settings import SPAM

@spam_filter(SPAM['db'], SPAM['rc'], SPAM['queue'], next_state=SPAMMING)
def START(message, ...):
   # this is the better way to do your config

With that you can then change up the configuration as needed in your deployments without having to change your code.