Spamconfernece (was: Re: [Mimedefang] MIMEDefang/Bogofilter)

Thu Feb 6 10:11:01 EST 2003

On Wednesday 05 February 2003 12:25, David F. Skoll wrote:
> On Wed, 5 Feb 2003, Michael Sofka wrote:
> > The presenter was Michael Salib.  He used a method called LMMSE (Linear
> > Mean Square Estimation according to my notes--I"m not sure where the
> > extra M comes from).  The function finds ``optimal weights for a linear
> > combination of heuristic tests.'' (Quoting my notes.)  The LMMSE is
> > used for detection over noisy channels in electronics.  (I assume the
> > function is a discriminate to separate spam from ham., as with SA.)
>
> OK, but the objective function (one would assume) would be:
>
> 	MINIMIZE F = Wn * Nn + Wp * Np
>
> Where Nn is the number of false negatives, Np is the number of false
> positives, and Wn and Wp are weights that trade off how much you fear false
> positives versus how much you're irritated by false negatives. 
> Unfortunately, the Nn and Np are not purely linear functions of the rule
> weights; the problem is really a mixed-integer-linear-programming problem,
> because of the threshold that determines whether SA calls something spam or
> ham.
> There are well-known ways to solve MILPs approximately, but finding the
> truly optimal solution is NP-complete.

True, and various solutions have different time vs. accuracy tradeoffs. 
Salib's point, as you note is, that he can adjust weights more frequently, 
and get about the same results. For an individual, the marginal return on a 
more accurate GA solution may be outweighed by the more frequent weight 
adjustment of a faster solution. (For a server with 10,000 accounts, the 
tradeoff may be different.)

GAs, NN, etc. excel when linear methods get stuck in local minima. It is 
impossible to tell from the SA weights if this is often the case, and how bad 
the local minima is. (But, Salib suggests it isn't that bad.)

My guess is that SA amounts to a sub-set of a statistical filter (the 
patterns), augmented with penalties for black lists, and header hygiene 
violations.

I remain agnostic, and return to the large point of `more data is needed.' 
That, and a systematic approach (an algebra of spam detection) might reveal a 
fast algorithm for a good-enough solution.  At least until spammers adjust.

In the end, I suspect ISPs will have to subcontract spam rules, just as they
now subcontract virus detection patterns.

Mike

-- 
Michael D. Sofka              sofkam at rpi.edu
C&CT Sr. Systems Programmer    AFS/DFS, email, usenet, TeX, epistemology.
Rensselaer Polytechnic Institute, Troy, NY.  http://www.rpi.edu/~sofkam/