[Mimedefang] SpamAssassin via mimedefang is slow
apex at xepa.nl
Sat Nov 8 17:52:37 EST 2008
Jeff Rife wrote:
> On 8 Nov 2008 at 18:42, Michiel Brandenburg wrote:
>> Basically Nilsimsa is a sounds like hash. Ie it makes a 256 bit
>> fingerprint of any binary data, like the message :). Now if you scan 2
>> messages and they are similar the hash will not look the same in hex but
>> looking at them in their binary form they would look the same. So ..
>> count_amount_of_1_bits((hash message 1) XOR (hash message 2)) = A
>> if A is smaller than say 10 ( so the hashes differ in 10 places or less
>> ) the message is nearly the same.
> OK, so it turns into an O(n) algorithm, where you need to retrieve each
> hash you have already computed, then compare the hash of the current
> message against that. After that, you add it to the database for
> future messages.
Well not really you can do XOR's in the database itself meaning u can
tell the database to return the 1st record closest to the hash you are
looking for with a maximum difference of say 10 bits. This is one query
resulting in 1 or 0 hits. You update the same hash as the one you found
and insert it if it was not found also resulting in 1 query. So all you
would need is one compute hash, 1 select and 1 insert (or update on key
collision). This would mean a table search .. but I have found a way to
speed this up by adding an extra column in the database containing the
total amount of 1 bits in each hash this would work as follows.
Message M's hash has m 1 bits.
Search the database for hashes that have say 10 differences ( a table
search ) order by the difference and return the 1st one [ pseudo: select
* from `table` where ((`hash` XOR M) <= 10 ) order by (`hash` XOR M) asc
limit 1 ]
Now if you were to also add a statement only looking at rows that have
m-10 till m+10 bits in them will speed up the query dramatically (and as
an extra bonus the mysql key cache kicks in now), as a full table search
is no longer needed and mysql (in my case) can use the index on the
amount_of_bits column [ pseudo: select * from `table` where ((`hash` XOR
M) <= 10 ) AND (difference in bits of hash and m is between 10 and -10)
order by (`hash` XOR M) asc limit 1 ] .
> I don't think anything that has to retrieve every record in a database
> table will be significantly better than just running SA, since I'm
> averaging about 2 seconds to run all the content scans (virus, SA,
> etc.). >
See above only one record (or no record) is actually returned. But this
method is WAY faster .. to running SA and the rest of the heavy
scanners. In my work cluster 1 mimedefang per server is usually busy
handling about 1.5 messages per sec on average ( while the other 19 are
just ideling or not even spawned). My heavy network enabled scanners
take about 3-5 seconds per message if I did not have this type of
"prescanner" I would need a lot more power to handle the same amount.
> In theory, it could help a much larger site than mine, but as the site
> gets busier, you would need to keep even more hashes for longer periods
> of time, so it's probably a wash.
My database at home holds 6000 hashes, the one at work (cluster) holds
9000 hashes, while the volume that the work cluster processes is a
factor 100 or more than my home machine. The amount of hashes you would
need to keep is not dependent on volume but more on the "type" of
message you receive. With this in mind the chance that a server will
see the same type of message 2 times or more will depend on the volume
of traffic it receives. So you are right .. more volume over a server
the higher the benefits of this type of approach, but then this is also
true for the bayes filter of spamassassin which on my work cluster holds
about 6 million records.
The hashes are kept in the database for a certain amount of time from
last being updated. When a certain threshold is reached I tempfail that
message and it will expire after a certain time. It's not the best
approach I know but hey .. it's kept my servers alive we halved our
needed infrastructure while doubling the amount of messages it can handle.
If someone wants more information on how to do this .. feel free to
contact me offlist :) If any other ppl have nice tweaks to weed out
spam in a way not usually talked about lets have it :)
More information about the MIMEDefang