[Mimedefang] Re: SPAM/HAM Trap

Wed May 23 16:24:59 EDT 2007

From: "Daniel Aquino" <mr.danielaquino at gmail.com>

> > * First and foremost, you should understand some issues related to email archiving:
> >
> > The privacy of your email clients - you should coordinate such actions with the company manager(s),
> > and I recommend also to inform all the users about it.
>   

> Well I'm not going to read the emails...  I just want to collect some
> detected spam/ham to train bayes.. I do have automatic training
> enabled but doesn't bayes need a kick start ?

It is good enough to start with an empty database and build it from scratch.
On a small volume system it takes few days to get automaticaly learned 200 ham + 200 spam,
and on high volume it should take only few hours.
So, you don't need the kick start even on a new machine.

> > It is recommended to train bayes against the actual and current email traffic,
> > not against historic private or public corpus.
>   
> So does automatic learning work even before bayes has 200 emails ?

Automatic learning starts working as soon as you enable it.
The scoring action is paused until you get the minimum which is by default 200 ham + 200 spam.

> How can I verify that the bayes training is taking place ?
Use the following commands:
man sa-learn
sa-learn --dbpath /home/defang/.spamassassin --dump magic
(Pleae check my syntax for mistakes and set dbpath to fit your system).

> > You can also send all emails to the same journal at localhost address,
> > then use a delivery filter (procmail, cyrus seive, etc) to sort the messages into > different folders using the X-SpamScore header.
>   

> Wouldn't the multi user approach be easier ?

It is up to you to decide and select the best for you.

I think that you also have to understand one major point here:

Automatic bayes training works without any need to collect messages and 
take manual actions.
So if you have planned to collect the messages and run "sa-learn" with a 
script - what's the point?
It can be done "on the fly" as the message is scanned the first time!

Collecting the messages can only be usefull for other things such as:
* Correcting auto learn mistakes, by manual (human) sorting of messages 
and then running sa-learn against the sorted corpus or selected messages 
that were false-positive or false-negative.
* Releasing blocked messages (for example false-positive).
* Other reasons for archiving mails unrelated to bayes training such as 
forensic security investigations, troubleshooting, company archiving 
policy, etc.

Again - if you plan to use those messages for automatic bayes training, 
then don't.
Use bayes auto learn instead.

Yizhar Hurwitz
http://yizhar.mvps.org