[Mimedefang] SpamAssassin via mimedefang is slow

Sun Nov 9 11:00:47 EST 2008

Jeff Rife wrote:
> This is an O(n) operation requiring a full table scan, because there is 
> no index you can use.  The only way the database can return the 
> "closest" hash is to compute the XOR of the new hash with *every* hash 
> in the database.
I would agree that without keeping a column of the amount of bits in the 
database a full table scan would be needed, however I keep an extra 
column in the database containing the total amount of 1's in that hash. 
This column does have an index and can range from 0 - 255 ( as the hash 
is 256 bits ).   If my message hash has say 200 bits in them, and I am 
looking for a difference of 10 bits I can select only the rows in the 
database that are between 190 and 210 bits, and only process them.  A 
full table scan is no longer needed as only a part of the table is 
needed ofc you would have to scan all of them records, if you have just 
a few rows in the database this "gain" would be negligible. But on 
database with every possible combination this "gain" will be huge. Also 
the closest hit will occur, on average, with the rows that have the 
smallest difference in bits.

> I'm averaging less than 2 seconds per virus plus SA scan, so there is 
> no way this would help significantly.
Oh I'm not saying that it would help you in this case, only that it 
helps me a lot.

> You need to look into methods that keep you from getting to the data 
> phase at all.  Only 25% of outside connections to my servers get to the 
> data phase, and half of those are from known good sender/recipient/IP 
> address tuples.  I scan those anyway because my load is light, but you 
> could skip them (or just run SA on a small percentage).
I do need more methods to prevent "evil" messages from getting to the 
data phase but we don't use graylisting. Our users are spoiled and 
starting a graylist at this point in time on our production servers will 
make ppl call the support line because customers can't seem to get the 
idea that mail is not instantaneous.

> If you *must* accept the data before you do anything, you can still 
> skip SA scanning by rejecting/tempfailing at that point.  If you are 
> running spam scans on more than about 30% of your connections, that's 
> way too much.
Yea about 35% of the connections are scanned, I could reduce this by 
tweaking the cutoff point from my cache and my scanners, my hash 
database will block at a score of say 15 with 10 messages seen and my 
spamassassin will drop (after learning it ofc) all above 9 or so .. So 
by tweaking this I would probably only have to scan 20% or less, but my 
hash caching is still kinda beta so I don't want to push the tweaking 
too far at this point.

> Since you are using so few records in the hash table, what you are 
> probably stopping with this hash is the spam runs that send the same 
> thing.  Those just don't get through for me.  As an example, here's a 
> bunch that never came back after the greylist tempfail (no rcpt_to to 
> protect privacy):
My hash database is a kinda of graylist [only I tempfail at the data 
phase not the rcpt to phase as graylist does], except it triggers on bad 
messages only. Not on all connections we receive as graylisting does ( 
well it has to learn the white senders, but I'm working on that).  You 
are right saying that this code stops spamruns, it does, spam run comes 
in scan 1st 10 messages (drop them) in the worst case scenario and 
tempfail all others, works wonders :).

-- 
Michiel Brandenburg