[Mimedefang] Bayesian analysis (was re: site-wide...)

Thu Feb 27 12:38:01 EST 2003

On Thu, 27 Feb 2003, Edward Wildgoose wrote:

> Is this an opinion based on having thought long and hard about it,
> perhaps built a test classifier and read all the stuff at the spam
> bayes website...?

No, it's just intuition.

> (To paraphrase some deep discussions there: I think their point is
> that if you only train on *wrong* classifications then it takes the
> system a long time to improve - It's well worth a read)

Oh, you definitely want to train on correct classifications.  But you
don't want to risk mis-training on wrong classifications.  I realize
the risk is really low, but it's there.

Has anyone run tests to see if Bayesian analysis is even worth doing?
SA by itself is very good, and if you augment it with custom rules
when an e-mail is misclassified, I bet you can do as well in
real-world tests as a pure Bayesian filter.  Training Bayesian filters
is a pain in the neck, and if you have per-user databases, the disk
space grows enormously.

I'm also skeptical that Bayesian analysis is scalable.  We're trying
to sell our commercial CanIt-PRO solution to customers with upwards of
25K users, and I just can't see maintaining 25,000 separate Bayesian
databases as being feasible or even desirable.  I also don't see one
site-wide database being much better than plain-vanilla SpamAssassin.

Someone please prove me wrong. :-)

--
David.