[Mimedefang] SpamAssassin via mimedefang is slow

Sat Nov 8 12:42:35 EST 2008

Jeff Rife wrote:
> On 8 Nov 2008 at 0:53, Michiel Brandenburg wrote:
> 
>> Another tip: take a look at Digest::Nilsimsa (in my implementation I can 
>> detect 60% of the spam at the data phase without restoring to heavy 
>> scanners, like spamassassin, and temp fail it).
> 
> OK, I took a look at it.
> 
> It seems to be just another digest system, although I don't understand 
> what it uses to generate the hash.  But, once you have a hash, I can't 
> see how that would magically detect spam.
Basically Nilsimsa is a sounds like hash. Ie it makes a 256 bit 
fingerprint of any binary data, like the message :). Now if you scan 2 
messages and they are similar the hash will not look the same in hex but 
looking at them in their binary form they would look the same.  So .. 
count_amount_of_1_bits((hash message 1) XOR (hash message 2)) = A
if A is smaller than say 10 ( so the hashes differ in 10 places or less 
) the message is nearly the same.

The flow of the data part is as follows
1. create hash
2. lookup hash somewhere to see if we have a match with X differences
3. if we have a result and that result has a spam score of more than Ya 
and we have seen the message Za times drop it, bounce it whatever :)
4. if we have a result and the score is lower than Yb and we have seen 
the message Zb times accept without scanning (it's not spam), btw I'm 
not doing this but u might.
5. if we don't have a result scan it and place it in the cache.

> Now, it does appear that it might be able to recognize it as similar to 
> spam you have already received, but there is zero documentation on how 
> to do that.  And, it would mostly be useful as a distributed database, 
> but there doesn't appear to be anything like that available, either.
Well the result set would not be that large :) at home I have about 8k 
records at work (accepting waaaay more messages) about 10k.

> So, how does one use this in the real world?
> 
> BTW, only 25% of connections to my mail servers result in SA running, 
> so I don't think it will help me much, but it's probably worth some 
> time to look at.
I was refering to all connections ending up in the data phase ( so valid 
connections to real users ). I would have to scan all them and now only 
40% of them.

Hope that helps :)

PS: not all databases can handle a binary xor of 256 bits so watch out.
--
Michiel Brandenburg