spambayes

Spam Bayesian filtering

General Plugin info Add Spam Bayesian filtering to your weblog to catch spam
Author:xiffy
Current Version: 1.1.0 (beta)
Download: SpamBayes version 1.1.0
older:
SpamBayes version 1.0.5
SpamBayes version 1.0.4
SpamBayes version 1.0.3
SpamBayes version 1.0.2
SpamBayes version 1.0.1
SpamBayes version 1.0
Code: not available; please download
Demo: none
Forum Thread: Plugin announced

This plugin will add Spam bayesian filtering to your weblog. This can be an effective tool to fight Comment and Trackback spam. However you need to understand the basic principles before you should consider using this plugin. Spam Bayesian is not the answer to every spam problem. The first problem you will encounter is the need for a trained filter. And only you can train your filter effectively. That’s why this plugin does not come with any filters pre-installed. I considered this but it’s impossible to determine wheter you would consider some words as spam or not.

Spam bayesian the principles

So let’s see what Spam bayesian can do for you. Consider that you have trained your filter. The plugin provides some tools to feed the filter with both ‘ham’ and ‘spam’1).

What happens when Spam Bayes receives a message for checking? It determines two probabilities. The first is wheter the message could be considered ham, based on word frequencies. Second it considers wheter the message could be considered spam again based on word frequencies. That’s why you need to train SpamBayes with both ham and spam messages. The word frequencies can only be determined after you taught it enough words to determine the most likely category the message belongs to.

Spam Bayes trainig

You should be aware by now of the importance of trainig. After installation the plugin menu shows you an option to train all available comments as ham ‘Train HAM (not spam) with all comments’. You can use this option to feed the filter all your available comments as ham. Make sure you don’t have any spam comments on your site otherwise you will feed the filter the wrong messages as ham.

You can undo this action with the option ‘Remove all comments from the HAM (not spam)’.

Once you’ve trained your filter with your comments you can remove the menu item from the list so you won’t accidently click this option again. To do this choose ‘Spam Bayes Options’ from the menu and set ‘Show SpamBayes train all ham in menu?’ to no.

At this moment there are two ways to train your filter with spam messages.

  1. The most important one is Spam Bayes training in the menu. You can put any text you like in the textarea and tell the filter what type of message you are feeding it.
  2. The second option is only available if you have have logging enabled. The overview of logged events will show two links beside the event logged. One for training ham and one for training spam.

If you accidently choose the wrong category for a training session you can undo this. Choose Spam Bayes untraining from the menu. You’ll see all trained messages (except the trained comments which are invisible in the overview. You can untrain those as explained earlier). When you want a document untrained simply click on the link untrain and the document is deleted from the list and all the word counts are updated to reflect the untraining of the document.

Spam Bayes logging

When enabled (defaults to no) all events that are captured by the plugin are logged. This logging is done in a separate database table and contains all the information that the plugin receives. You can view the events from the administration interface. The view is limited to 10 events per page. Each logged event has two handy links to train either spam or ham. That way you can quickly train the filter if a message is considered spam when it isn’t or when a message is considered ham and it isn’t.

Remember that although Spam Bayes might capture spam events, they are not automaticly added to the filter. So even if Spam Bayes did a correct categorisation it is still a good idea to feed the filter some of the correct guesses as well.

Spam Bayes Options

There are a couple of options available for this plugin.

  • Score at which point we sould consider a text as spam? This sets the score that is needed to determine wheter a message is considered spam.
  • To which URL should spammers be redireted? If a message is posted and considered spam, Spam Bayes will redirect the spammer to this URL. Defaults to http://127.0.0.1/ You could redirect them to http://yoursite.com/explain.htm and explain on that page what happened.
  • Which words should not be taken into consideration? This is a list of common words and words that would lead to false positives. On my site for example I had a lot of spamming for .org domains. This led the filter to believe that all .org domains should be considered spam. Adding ‘org’ to the ignore list resolved this issue.
  • Show SpamBayes train all ham in menu? If this option is enabled (default) there are two extra menu items available for Spam Bayes. Giving you the option to submit all available comments on your weblog as HAM messages. When set to no the menu options won’t be visible.
  • Show SpamBayes in quickmenu? If set to yes, Spam Bayes will show up in the left menu of your Nucleus Administration giving you a quick easy access to the Spam Bayes administration. Default: no
  • Use SpamBayes action logging? If set to yes, Spam Bayes will log all it’s actions to the database. This is a very convenient option but it can be harmfull. When a spamrun hits your site a lot of MySQL queries are done for Spam Bayes. Logging intensifies this MySQL usage. The second downside of this option is space. The complete message is stored when logging is done. So a big spamrun could easily add 10MB of data to your database. (You can of course clear the log from the administration interface). Defaults to no.

Things undone

My current wishlist for the plugin. Maybe I’ll implement these in a next version, maybe I’ll never get round to this. You can leave a note if you’d like to see one of these (or maybe some unthought of by me) implemented

  • Won’t do for version 1.0: Maybe a clear reference documents. This will disable you to untrain the filter but it will save some DB size.

Version History

06 sept 2006 : Initial Release (1.0)
10 sept 2006 : Small update (1.0.1)

  • NAN problem solved. When huge spam messages with high ranked spam words came round it was possible to get a NAN count. These were luckily identified as spam. Now calculation stops when Infinity looms ...
  • Now showing ham spam totals when viewing the spam log

15 sept 2006 : Small update (1.0.2)

  • All new functionality regards the log facility of SpamBayes. So upgrade is only relevant if you have loging enabled. There arenow two types of filter that can be applied to your view. The know ‘ham’ and ‘spam’ filter, but the event type is added to the form to select only trackbacks, comments, referrers, mailtoafriends etc. The amount of types depends on the amount of different plugins actually call spambayes. So if you have trackback installed, it will be added as a logtype to the list. Same goes for mailtoafriend etc. Only types that have 1 or more logged events will show up on the list.
  • Delete logs is now reduced to two buttons; clear all or clear current filtered logs which will clear all trackback / referrer / comment spam from the logs while maintaining all the other logs for further investigation.
  • Train / Untrain links take you back to the current view instead to page 1 with all filters disabled.

All this functionality has been added on my own behalf. I know i get a lot of spam (10.000 logged events on a weekly basis) and this way I can quickly scan all logged events to see if any false positives are inside on type or the other. The default log screen became unusable with over 200 events or so.

19 sept 2006 : Small update (1.0.3)

  • Logging was enabled by default and the option did not do anything. This bug was spotted by Verbal jam
  • Added the option to train all untrained comments. This solves both the timeout problems you can have with the train all option, as well as the need to update your filter with comments added to your site after you first trained your filter.

26 sept 2006 : Bugfix (1.0.4)

  • Fixed problem of getting plugin options in PHP version 4 (thank you pepiino)

10 oct 2006 : Small update (1.0.5)

  • Update probabilities is done now automaticly. It does not take a large cpu and DB hit as expected and users tend to forget to click the option in the menu.

07 jan 2007 : Huge logging overhaul (1.1)

  • Added explain function. This will show the indivudual wordscores of a logged evenet.
  • Added the option to publish a false positive to your weblog and item. The timestamp will be of the moment the comment was added logged by spambayes
  • Amount per page added as an extra field on the logpage to enhanche browsing
  • added Batch events; Spam, Ham or delete for selected logitems.

Plugin review

NP_SpamBayes version 1.1.0 works with Nucleus CMS 3.31 - 2007-10-29 admun

1) Ham is considered the opposite of Spam in the Spam fighting world. So ham is what you would like to see as a comment or trackback to your items. Spam on the other hand is what is considered unsollicitated (bulk) messages posted on your weblog as a comment or Trackback
 
spambayes.txt · Last modified: 2007/10/29 16:22 by admun