[Mimedefang] summary of site-wide bayes with SA 2.5 and MD?

Thu Feb 27 12:30:00 EST 2003

> > From then on it should use autolearn to build the bayes db.  (SA score
> > of -2 or less = ham, SA score of 15 or more = spam)

> What is the theoretical basis to justify auto-learning?  I do not think
> it makes sense.  SA already "correctly" categorizes the mail, so there's
> no point in modifying the Bayes statistics.  And on the odd chance
> that an outlying e-mail is accidentally misclassified, you pollute
> your statistical pool, and make SA more likely to misclassify such
> e-mail in the future.

> The whole point of learning is that you teach the discriminator
> the "absolute truth" as decided by a human being.  Auto-learning,
> in my opinion, violates that important principle.

Is this an opinion based on having thought long and hard about it, perhaps built a test classifier and read all the stuff at the spam bayes website...?

(To paraphrase some deep discussions there: I think their point is that if you only train on *wrong* classifications then it takes the system a long time to improve - It's well worth a read)

However, I agree that a lot of this stuff in un-intuitive.  I would agree that auto-learning is unattractive...  There is also some evidence that over-training hurts the accuracy (so common in many statistical systems).  However, hang around the spam-bayes list for some pretty good discussion backed up with trials on real data.