[Mimedefang] learner indicated ham

Tue Aug 12 13:08:03 EDT 2014

On 11 Aug 2014, at 10:22, Justin Edmands wrote:

> Bill,
> Thank you very much for the response. The detail is much appreciated.
> As Ged mentioned, not vague, helpful to say the least. The part about
> highly trusted rules caught my attention:
>
> "Another way to increase autolearning without going all the way to the
> "learn on error" behavior is to flag rules that you trust highly as
> "autolearn_force" so that messages matching them won't ever be
> excluded from autolearning based on the existing Bayes DB disagreeing
> with the deterministic rules."
>
> I think these will get me started:
>
> tflags URIBL_DBL_SPAM autolearn_force
> tflags URIBL_JP_SURBL autolearn_force
> tflags URIBL_BLACK autolearn_force
> tflags INVALID_DATE autolearn_force
>
> Any others that are definites?

That's a hard question for anyone to answer without knowing your 
mailstream's quirks. I can't tell you who your users are and what sort 
of mail they want that matches which rules. The default SA rules have 
mostly low scores because they are all individually highly error-prone.

I'm especially wary about putting too much trust in individual rules 
because I get lots of mail that talks about spam, often with things like 
lists of evil domains that trigger URIBL rules. And INVALID_DATE shows 
up in a surprising number of ethically upstanding but technically sordid 
messages (e.g. Terminix customer notices.) This is why I reserve 
autolearn_force for meta-rules, since it carries a risk of turning a few 
false positives into a bad Bayes DB. The specific example of what I 
described that I can share is this locally-defined rule:

describe URIBL_MULTI1 Multiple URIBL  hits	
meta URIBL_MULTI1 URIBL_DBL_SPAM + URIBL_RED + URIBL_BLACK + URIBL_SBL + 
URIBL_WS_SURBL + URIBL_OB_SURBL + URIBL_JP_SURBL + URIBL_SC_SURBL > 2
score URIBL_MULTI1 10
tflags URIBL_MULTI1 autolearn_force

That means that if 3 or more of 8 different URIBL tests hit on a 
message, In tack on an extra 10 point and override the learner 
protections. I should add a note of warning by example: last week a 
thread in the Postfix users list was started with a message including a 
long list of spammer domains, causing the original message and any that 
fully quoted it to match *6* of those URIBLs. If your mailstream 
includes mail discussing spam, you have to take precautions to protect 
from such things ruining your Bayes DB.

My other autolearn_force rules are also meta-rules that bundle multiple 
rules, but I unfortunately cannot freely share their details as the 
constituent rules come from private (i.e. encumbered) sources. The 
general process I use is to look for clusters of rules (positive OR 
negative) that often hit together on mail that gets a Bayes score in the 
opposite direction. Before SA 3.4 I just set high scores on those 
meta-rules to assure rejection, but autolearn_force improves on that.