Bayesian filter for blog comments

I don’t get much comments spam myself right now (maybe a message a week or so), but the problem is definitely getting worse.

For Movable Type installations, there are several solutions available, such as an option to provide a “delete this comment” link with every “new comment” email, and a combined url blocker/comments hider technique. Also, some people have proposed collaborative blacklists, or collaborative authentication for comments posters.

I’m surprised that no-one seems to have suggested Bayesian filtering for comments, though. I get about 15-20 spam messages via email every day, but the SpamBayes plugin for Outlook routes almost all of them straight into a “Spam” folder. I never see them in my inbox. Maybe one or two message in a hundred make it through the filter, and I haven’t had any false positives for ages. It doesn’t involve maintaining blacklists, and it’s a lot less effort than deleting every single junk message.

In Movable Type, it you could have a “bayesfilter” property on the MTComments template tag: <MTComments bayesfilter="1">. All comments would have to pass through the filter, and only those that were not spam would make it on to the page.

You’d need some additional mechanism to “train” the system, and somewhere to put the statistical knowledge base the filter uses to tell spam from genuine comments. Finally, you’d need a way of correcting the system after the initial training, so that any spam that does make it through can be deleted with prejudice, and so that false positives can be corrected.

This would be a nice anti-spam comments system. It would involve a Movable Type plugin, and some hacking to the Movable Type application itself. Unfortunately I don’t have time to do this right now, and even if I did have time, I’ve sworn off perl. (Did you know that “perl” is an anagram of “pain”?) But I wonder if the Lazyweb could do it for me, or if the nice people at Six Apart would be so kind as to include this feature in MT Pro?

2 Replies to “Bayesian filter for blog comments”

  1. Have you considered banning any comment with a URL hotlink in it? Or something like that? Most (though obviously not all) spam has one, and I would have thought most messages from people commenting on your posts don’t need to.

  2. Some of the comments spam has URLs in the body, but some of them seem to be satisfied with putting their URL in the “home page” field. The body of the message will typically read something like “I like this blog very much”, or “I agree completely. I’ve written about the same thing on my own web site.” They could almost be normal comments, but the fact that the URLs go to porn sites or zip code database resellers makes me doubt it.

    I could withdraw the “home page” field, but I’ve found it useful and fun to be able to follow links back to people who have commented here. Likewise, when I read comments on other blogs, I like to see who is posting there. It’s part of the blogging “experience,” and I’d be reluctant to lose it.

    Then again, I may change my mind if the spamming here gets worse… 🙂

Comments are closed.