[Mimedefang] SpamAssassin via mimedefang is slow
apex at xepa.nl
Sun Nov 9 11:00:47 EST 2008
Jeff Rife wrote:
> This is an O(n) operation requiring a full table scan, because there is
> no index you can use. The only way the database can return the
> "closest" hash is to compute the XOR of the new hash with *every* hash
> in the database.
I would agree that without keeping a column of the amount of bits in the
database a full table scan would be needed, however I keep an extra
column in the database containing the total amount of 1's in that hash.
This column does have an index and can range from 0 - 255 ( as the hash
is 256 bits ). If my message hash has say 200 bits in them, and I am
looking for a difference of 10 bits I can select only the rows in the
database that are between 190 and 210 bits, and only process them. A
full table scan is no longer needed as only a part of the table is
needed ofc you would have to scan all of them records, if you have just
a few rows in the database this "gain" would be negligible. But on
database with every possible combination this "gain" will be huge. Also
the closest hit will occur, on average, with the rows that have the
smallest difference in bits.
> I'm averaging less than 2 seconds per virus plus SA scan, so there is
> no way this would help significantly.
Oh I'm not saying that it would help you in this case, only that it
helps me a lot.
> You need to look into methods that keep you from getting to the data
> phase at all. Only 25% of outside connections to my servers get to the
> data phase, and half of those are from known good sender/recipient/IP
> address tuples. I scan those anyway because my load is light, but you
> could skip them (or just run SA on a small percentage).
I do need more methods to prevent "evil" messages from getting to the
data phase but we don't use graylisting. Our users are spoiled and
starting a graylist at this point in time on our production servers will
make ppl call the support line because customers can't seem to get the
idea that mail is not instantaneous.
> If you *must* accept the data before you do anything, you can still
> skip SA scanning by rejecting/tempfailing at that point. If you are
> running spam scans on more than about 30% of your connections, that's
> way too much.
Yea about 35% of the connections are scanned, I could reduce this by
tweaking the cutoff point from my cache and my scanners, my hash
database will block at a score of say 15 with 10 messages seen and my
spamassassin will drop (after learning it ofc) all above 9 or so .. So
by tweaking this I would probably only have to scan 20% or less, but my
hash caching is still kinda beta so I don't want to push the tweaking
too far at this point.
> Since you are using so few records in the hash table, what you are
> probably stopping with this hash is the spam runs that send the same
> thing. Those just don't get through for me. As an example, here's a
> bunch that never came back after the greylist tempfail (no rcpt_to to
> protect privacy):
My hash database is a kinda of graylist [only I tempfail at the data
phase not the rcpt to phase as graylist does], except it triggers on bad
messages only. Not on all connections we receive as graylisting does (
well it has to learn the white senders, but I'm working on that). You
are right saying that this code stops spamruns, it does, spam run comes
in scan 1st 10 messages (drop them) in the worst case scenario and
tempfail all others, works wonders :).
More information about the MIMEDefang