Spamconfernece (was: Re: [Mimedefang] MIMEDefang/Bogofilter)

Wed Feb 5 12:52:25 EST 2003

On Wed, 5 Feb 2003, Michael Sofka wrote:

> The presenter was Michael Salib.  He used a method called LMMSE (Linear
> Mean Square Estimation according to my notes--I"m not sure where the
> extra M comes from).  The function finds ``optimal weights for a linear
> combination of heuristic tests.'' (Quoting my notes.)  The LMMSE is
> used for detection over noisy channels in electronics.  (I assume the
> function is a discriminate to separate spam from ham., as with SA.)

OK, but the objective function (one would assume) would be:

	MINIMIZE F = Wn * Nn + Wp * Np

Where Nn is the number of false negatives, Np is the number of false positives,
and Wn and Wp are weights that trade off how much you fear false positives
versus how much you're irritated by false negatives.  Unfortunately, the
Nn and Np are not purely linear functions of the rule weights; the problem
is really a mixed-integer-linear-programming problem, because of the
threshold that determines whether SA calls something spam or ham.
There are well-known ways to solve MILPs approximately, but finding the
truly optimal solution is NP-complete.

I took a look at Salib's slides; his results for SA seemed much lower
than real-world results.  He claims that his method lets you calculate
weights quickly, so you can constantly be tuning the SA rule weights for
your particular e-mail scheme.  I'm rather skeptical given that ultimately,
you're trying to solve a MILP problem, and there are no known fast ways
to do that.

--
David.