[Mimedefang] summary of site-wide bayes with SA 2.5 and MD?
Edward Wildgoose
Edward.Wildgoose at FRMHedge.com
Thu Feb 27 12:30:00 EST 2003
> > From then on it should use autolearn to build the bayes db. (SA score
> > of -2 or less = ham, SA score of 15 or more = spam)
> What is the theoretical basis to justify auto-learning? I do not think
> it makes sense. SA already "correctly" categorizes the mail, so there's
> no point in modifying the Bayes statistics. And on the odd chance
> that an outlying e-mail is accidentally misclassified, you pollute
> your statistical pool, and make SA more likely to misclassify such
> e-mail in the future.
> The whole point of learning is that you teach the discriminator
> the "absolute truth" as decided by a human being. Auto-learning,
> in my opinion, violates that important principle.
Is this an opinion based on having thought long and hard about it, perhaps built a test classifier and read all the stuff at the spam bayes website...?
(To paraphrase some deep discussions there: I think their point is that if you only train on *wrong* classifications then it takes the system a long time to improve - It's well worth a read)
However, I agree that a lot of this stuff in un-intuitive. I would agree that auto-learning is unattractive... There is also some evidence that over-training hurts the accuracy (so common in many statistical systems). However, hang around the spam-bayes list for some pretty good discussion backed up with trials on real data.
More information about the MIMEDefang
mailing list