Bogosity

5/17/2007

I have been jawing around about cobbling a few pet technologies together to make an interesting/uninteresting filter for RSS news. The reasoning behind this is the whole “Web 2.0” thing, where user-rated content seems to be all the rage. The problem I see with user-rated content is that you need to align with the opinions of other users if these evaluations are to have any context and applicability.

This leads me to the problem of going to Digg and having to search through perhaps 10% of postings within the “OMG!!! Bush has a booger. LOL!!! Worst President EVER!!!” ilk, despite my general interest in the technology and “oddly enough” stories typically featured on that site (moreso than Fark, etc.).

So, I had this idea. Combine a caching RSS parser with a Bayesian spam filter, using no spam corpus to train the filter initially. Mark interesting stories as “not spam” and uninteresting stories as “spam.” Start training the filter in this way until predictions arise.

I originally thought to do this with Python (thanks to its somewhat holy union of easy memory management, an http client, and an XML parser). However, PHP features somewhat unholier capabilities on these vectors, and I plugged in MagpieRSS last night for a grab-and-go RSS dirty work agent.

I used the venerable bogofilter, effective in both naming and function, to perform the Bayesian classification. The thing is running as a local web application on joey.home.local, which is magically running here at home.

The prototype of the whole thing is running in about 20 lines of PHP. It’s about as inefficient as you get; each RSS feed item forks a separate bogofilter instance from the shell (yeah, I know, the term “orders of magnitude of performance decreases” comes to mind). There’s also no foreseeable way to have multiple databases – as in, a database of Bayesian classifications for each user in a large group of users. This completely defeats the stated purpose of personalization (save for myself alone), but it’s just a mock-up using the path of least resistance.

The verdict is that the darn thing refuses to classify new content until the database is built up to be more robust. These filters tend to count on a large corpus of both spam and not-spam to use as training before being installed in the wild. Since I’m not using that (can’t: all not-spam is not intrinsically interesting), it will take a while to know if it’s even a good idea.