Bayesian filter for blog comments

I don’t get much comments spam myself right now (maybe a message a week or so), but the problem is definitely getting worse.

For Movable Type installations, there are several solutions available, such as an option to provide a “delete this comment” link with every “new comment” email, and a combined url blocker/comments hider technique. Also, some people have proposed collaborative blacklists, or collaborative authentication for comments posters.

I’m surprised that no-one seems to have suggested Bayesian filtering for comments, though. I get about 15-20 spam messages via email every day, but the SpamBayes plugin for Outlook routes almost all of them straight into a “Spam” folder. I never see them in my inbox. Maybe one or two message in a hundred make it through the filter, and I haven’t had any false positives for ages. It doesn’t involve maintaining blacklists, and it’s a lot less effort than deleting every single junk message.

In Movable Type, it you could have a “bayesfilter” property on the MTComments template tag: <MTComments bayesfilter="1">. All comments would have to pass through the filter, and only those that were not spam would make it on to the page.

You’d need some additional mechanism to “train” the system, and somewhere to put the statistical knowledge base the filter uses to tell spam from genuine comments. Finally, you’d need a way of correcting the system after the initial training, so that any spam that does make it through can be deleted with prejudice, and so that false positives can be corrected.

This would be a nice anti-spam comments system. It would involve a Movable Type plugin, and some hacking to the Movable Type application itself. Unfortunately I don’t have time to do this right now, and even if I did have time, I’ve sworn off perl. (Did you know that “perl” is an anagram of “pain”?) But I wonder if the Lazyweb could do it for me, or if the nice people at Six Apart would be so kind as to include this feature in MT Pro?

Sidelined Protagonist Syndrome

Sidelined Protagonist Syndrome (SPS) is what happens when a writer gets to the end of a story, finds that the Protagonist doesn’t have the means to resolve (or even influence) the final conflict themselves, and therefore pulls in an Outside Agency to do it for them. The Protagonist may skulk around the periphery of the action and deliver a running commentary on events, or they may get called in for the mopping-up scene, where they find out how the Outside Agency put the pieces together and finally came through to pull the Protagonist’s nuts out of the fire.

Key questions to ask to find out if a story is suffering from SPS:

  • If the Outside Agency had not stepped in, would the final conflict have turned out the same way, or would the outcome have been completely different?
  • Did the Protagonist issue direct instructions for the Outside Agency to act, or did the Agency come in of their own accord? (Having the Outside Agency ignore dire warnings from the Protagonist, only to come through in the end, may offset the worst effects of SPS.)
  • Once the Outside Agency stepped in, did they need the help of the Protagonist in order to emerge victorious, or was the Protagonist just another concerned onlooker (aka JAFO)?

The worst case of SPS I’ve come across recently was Vitals by Greg Bear. Nasty. If you can think of any, please zap ’em in the comment section.

One for the laydeez

Alex still cries when I drop him off at nursery in the mornings. Or at least, he does most of the time. He was okay-ish this Monday–no all-out despair and wailing–but on Tuesday I saw a side of him that doesn’t usually come out when I leave him there.

He was looking exceptionally cool that morning: blue long-sleeved shirt, tan corduroys, new brown boots, denim jacket, and orange sunglasses. After I took off his jacket and shades, he poked his head through the doorway into the toddler room and scoped out the joint. Then he walked back to me and put his hands up in the universal “cuddle” gesture. So far, so normal.

I picked him up, and together we walked into the room. He rested his chin on my shoulder, but he didn’t seem as upset as he often does. One of the nursery assistants came up to us, said hello, and asked Alex if he wanted a cuddle from her. He nodded quietly, and clambered over into her arms.

I was just starting to wave goodbye to him, when I noticed two young girls, both a little older than Alex, maybe 3 or 3 and a half, walking over. They waved at him and said “Alex! Alex! Hello Alex!”

He looked down at them, and then back at me. His early morning frown slowly turned into a sly, cheeky grin, as if to say, “Look what I’ve found.”

There he was, my two-year-old son, already at the centre of attention of gangs of adoring older women. He may be Super Tantrum Toddler Man at home, but he can turn the charm all the way up to 11 when he wants to.

What a boy.

Old music

No way is it ten years since Sting released his album Ten Summoners Tales. Nuh-uh. I mean, 1993? Come on. That’s ten years ago. I think someone has been in to our house, and doctored the sleeves on our CDs. Whoever it was is very good at it. I can hardly tell the difference between the rest of the text on the CD sleeve, and the ficticious release date they inserted. Still, it creeps me out that someone broke into our house just to change that one date…

…wait a minute…

They’ve been at my Sarah McLachlan CDs as well! No way was Surfacing released back in 1997. Nuh-uh. I don’t believe it.

Dutch car license plates and traffic control

Dutch license plate: 01-LX-RP

The picture above shows a Dutch car license plate. Notice anything strange about it? Something odd about the letters “R” and “P”? The strokes of the letters are not completely joined up.

I’ve been trying to dig up some information about this typographic oddity, but I haven’t found anything concrete yet. My best guess is that styling these two letters this way makes computerised license plate identification easier. I’m no expert on optical character recognition, but it I’m sure that every little visual cue that distinguishes one character from another, helps in improving accuracy and processing speed. In the case of license plates, they must still remain easily readable by humans, so small adjustments, like the gaps in the letters “R” and “P” are probably a good compromise.

If you’re just taking a “wet film” image, or a digital snapshot of a car and its plates for later processing, then the speed and accuracy of identification of important, but not critical. If the computer system cataloguing the photos has trouble reading a given plate, it can flag an exception, and get a human being to check its work.

However, the Dutch police and Ministry of Transport are experimenting with traffic control systems where speed and accuracy of identification is critical. On the A-2 motorway between Amsterdam and Utrecht, they no longer just measure a car’s speed with a simple radar or laser speed camera. They use a connected network of cameras and detectors to identify cars at multiple places, so that the system is aware of the car’s average speed over that stretch of road.

It’s no longer any good just slowing down if you know a speed camera is up ahead, and then speeding up immediately afterwards. By measuring your average speed, you have to keep your speed down over that stretch of road.

Because you now need two cameras to monitor a stretch of road (one at the start, one at the end), you will be taking twice the number of images, and processing twice the number of license plates. If the accurancy of recognition remains the same as before, then you will also be generating twice as many exceptions, which requires twice as much human input to correct.

Also, because you have to correlate multiple data points, you have to identity the license plate correctly at both the entry and exit point in order to measure the car’s average speed.

Looking at this from the point of view of a motorist, say you have a 90% chance of being accurately identified at each speed camera. If you have to be identified by both cameras, then you have two chances of slipping through that 10% gap. The overall probability of getting caught therefore comes down to just 81%. (Note: 90% is just a number pulled out of thin air to make the point. The real-life identification rate will be different.)

Looking at it from the point of view of the system designers, if you want to maintain a 90% accuracy rate over the two cameras combined, then you have to increase the accuracy of each individual camera to about 95%. So accurancy becomes a hot issue.

Speed of processing also becomes much more important, because you now have to identify the license plate of every car that passes the camera, not just the ones radar tells you are breaking the speed limit. The number of license plates that need to be processed every takes an enormous leap upwards….

…which brings us back to the need for accuracy. Because the more identifications you make, the more error reports will be generated, and the more manpower will be needed to verify them. If the system is not accurate enough, it either becomes a manpower nightmare, or you end up with enough violaters slipping through the net that it’s no more effective than single-point speed cameras.

Does it work? Oh yes. Here are some statistics, from the Dutch Institute for Traffic Care (ITC):

  • 99.7% accuracy over the course of 24/7 operation
  • Number of speeding violations dropped by 90%, from 6% of traffic to 0.6%
  • Average speed on the monitored stretches of road has dropped from 72mph (115km/h) to 66mph (105km/h)

Accident rates have dropped, and congestion has decreased.

The gaps in the letters “R” and “P” on Dutch number plates are therefore just one element in a very interesting and highly advanced transport strategy.

Related Links

BT Midband: Just like ordinary dial-up, only without the good bits (part 3)

Well, after all the initial hassles of getting my parents hooked up to BT’s Midband/ISDN service (see parts 1 and 2 of this story), it’s finally working. And guess what? Once it’s up and running, it’s actually pretty good.

When I last left my parents, they had a freshly installed ISDN box with all the relevant sockets and wires…but no service. The main problems were that BT Midband doesn’t support Windows 2000 server through its USB interface, and that there was too much line noise on the ISDN line for a non-Win2K machine to connect at all.

Since that last time, they have had a BT engineer out who sorted out the line noise. On Wednesday I went up to visit them again, armed with an ISDN Terminal Adapter (TA). A TA is the only way you will get Midband to work with Windows 2000 Server. I bought a bog-standard BT Speedway PCI card. I could have got a non-branded card for less money, but after all of the earlier problems, I wanted something that would give me the least chance of being incompatibile.

The card was fairly easy to install in the server. The installation process tries to get you to install lots of voice, fax, voicemall, and data transfer widgets, but all I really needed were the device drivers to make it act like a modem. Once I had that going, I set up the dial-up networking connection, crossed my fingers, and….

…it worked! First time! Yay!

Continue reading “BT Midband: Just like ordinary dial-up, only without the good bits (part 3)”