[Mimedefang] SpamAssassin via mimedefang is slow

Michiel Brandenburg apex at xepa.nl
Sat Nov 8 17:52:37 EST 2008


Jeff Rife wrote:
> On 8 Nov 2008 at 18:42, Michiel Brandenburg wrote:
> 
>> Basically Nilsimsa is a sounds like hash. Ie it makes a 256 bit 
>> fingerprint of any binary data, like the message :). Now if you scan 2 
>> messages and they are similar the hash will not look the same in hex but 
>> looking at them in their binary form they would look the same.  So .. 
>> count_amount_of_1_bits((hash message 1) XOR (hash message 2)) = A
>> if A is smaller than say 10 ( so the hashes differ in 10 places or less 
>> ) the message is nearly the same.
> 
> OK, so it turns into an O(n) algorithm, where you need to retrieve each 
> hash you have already computed, then compare the hash of the current 
> message against that.  After that, you add it to the database for 
> future messages.
Well not really you can do XOR's in the database itself meaning u can 
tell the database to return the 1st record closest to the hash you are 
looking for with a maximum difference of say 10 bits. This is one query 
resulting in 1 or 0 hits.  You update the same hash as the one you found 
and insert it if it was not found also resulting in 1 query.  So all you 
would need is one compute hash, 1 select and 1 insert (or update on key 
collision).  This would mean a table search .. but I have found a way to 
speed this up by adding an extra column in the database containing the 
total amount of 1 bits in each hash this would work as follows.

Message M's hash has m 1 bits.
Search the database for hashes that have say 10 differences ( a table 
search ) order by the difference and return the 1st one [ pseudo: select 
* from `table` where ((`hash` XOR M) <= 10 ) order by (`hash` XOR M) asc 
limit 1 ]
Now if you were to also add a statement only looking at rows that have 
m-10 till m+10 bits in them will speed up the query dramatically (and as 
an extra bonus the mysql key cache kicks in now), as a full table search 
is no longer needed and mysql (in my case) can use the index on the 
amount_of_bits column [ pseudo: select * from `table` where ((`hash` XOR 
M) <= 10 ) AND (difference in bits of hash and m is between 10 and -10) 
order by (`hash` XOR M) asc limit 1 ] .

> I don't think anything that has to retrieve every record in a database 
> table will be significantly better than just running SA, since I'm 
> averaging about 2 seconds to run all the content scans (virus, SA, 
> etc.). >
See above only one record (or no record) is actually returned. But this 
method is WAY faster .. to running SA and the rest of the heavy 
scanners. In my work cluster 1 mimedefang per server is usually busy 
handling about 1.5 messages per sec on average ( while the other 19 are 
just ideling or not even spawned). My heavy network enabled scanners 
take about 3-5 seconds per message if I did not have this type of 
"prescanner" I would need a lot more power to handle the same amount.

> In theory, it could help a much larger site than mine, but as the site 
> gets busier, you would need to keep even more hashes for longer periods 
> of time, so it's probably a wash.
My database at home holds 6000 hashes, the one at work (cluster) holds 
9000 hashes, while the volume that the work cluster processes is a 
factor 100 or more than my home machine. The amount of hashes you would 
need to keep is not dependent on volume but more on the "type" of 
message you receive.  With this in mind the chance that a server will 
see the same type of message 2 times or more will depend on the volume 
of traffic it receives.  So you are right .. more volume over a server 
the higher the benefits of this type of approach, but then this is also 
true for the bayes filter of spamassassin which on my work cluster holds 
about 6 million records.
The hashes are kept in the database for a certain amount of time from 
last being updated. When a certain threshold is reached I tempfail that 
message and it will expire after a certain time.  It's not the best 
approach I know but hey .. it's kept my servers alive we halved our 
needed infrastructure while doubling the amount of messages it can handle.

If someone wants more information on how to do this .. feel free to 
contact me offlist :)  If any other ppl have nice tweaks to weed out 
spam in a way not usually talked about lets have it :)

--
Michiel Brandenburg



More information about the MIMEDefang mailing list