All Back and Moved In

7/11/2007

It’s great to be married! It’s kind of hard to re-start a blog with a typical post after a complete change in circumstances, but the show must go on.

Despite my continuous “file cabinet accessories” (yes, I didn’t know they existed until recently either) problems, server stuff is going well. Right now the battle regards spam filtration, which is a tough little nut to crack in many cases.

SpamAssassin (SA) is not as good as I thought it was… especially when you’re running the setup I am (which is essentially a command-line mail client being run every five minutes and personally forwarding mail to an aggregated local mailbox). I believe this is because spam filters like to look at where spam is coming from. Since it all comes from my machine (most recently), SA tends to think the best of things.

So, like everyone else who is not greylisting (somewhat evil) or collaboratively filtering (fairly ineffective in my experience) I went Bayesian. I trained with Kristin’s and my few thousand of archived legitimate mails, augmenting that dataset with a publicly-available recent spam repository of a few more thousand.

It was a total disaster.

The most obvious pharmacy mails were getting passed with a very big thumbs-up from the filter. So, I dumped the wordlist and trained with about 10-20 mails in Kristin’s inbox and only the spam that has arrived since. Difficult spam is getting a 0.5000 score (probably best expressed as “perfectly unsure”), and somewhat repetitive stuff (even with some random noise added by the bots) is already getting pared out. I would stress that this is a very small training dataset, and accuracy is slightly impressive at this point.

I wrote a custom shell script that, when invoked for a certain user name, will empty the user’s “Junk” folder into the trash can after training the spam filter. This could be run on a scheduled basis, but for now I prefer to invoke it by hand.

I’m very pleased with all of this. More updates to follow as the dataset increases.