[Mimedefang] SpamAssassin via mimedefang is slow
Michiel Brandenburg
apex at xepa.nl
Sat Nov 8 12:42:35 EST 2008
Jeff Rife wrote:
> On 8 Nov 2008 at 0:53, Michiel Brandenburg wrote:
>
>> Another tip: take a look at Digest::Nilsimsa (in my implementation I can
>> detect 60% of the spam at the data phase without restoring to heavy
>> scanners, like spamassassin, and temp fail it).
>
> OK, I took a look at it.
>
> It seems to be just another digest system, although I don't understand
> what it uses to generate the hash. But, once you have a hash, I can't
> see how that would magically detect spam.
Basically Nilsimsa is a sounds like hash. Ie it makes a 256 bit
fingerprint of any binary data, like the message :). Now if you scan 2
messages and they are similar the hash will not look the same in hex but
looking at them in their binary form they would look the same. So ..
count_amount_of_1_bits((hash message 1) XOR (hash message 2)) = A
if A is smaller than say 10 ( so the hashes differ in 10 places or less
) the message is nearly the same.
The flow of the data part is as follows
1. create hash
2. lookup hash somewhere to see if we have a match with X differences
3. if we have a result and that result has a spam score of more than Ya
and we have seen the message Za times drop it, bounce it whatever :)
4. if we have a result and the score is lower than Yb and we have seen
the message Zb times accept without scanning (it's not spam), btw I'm
not doing this but u might.
5. if we don't have a result scan it and place it in the cache.
> Now, it does appear that it might be able to recognize it as similar to
> spam you have already received, but there is zero documentation on how
> to do that. And, it would mostly be useful as a distributed database,
> but there doesn't appear to be anything like that available, either.
Well the result set would not be that large :) at home I have about 8k
records at work (accepting waaaay more messages) about 10k.
> So, how does one use this in the real world?
>
> BTW, only 25% of connections to my mail servers result in SA running,
> so I don't think it will help me much, but it's probably worth some
> time to look at.
I was refering to all connections ending up in the data phase ( so valid
connections to real users ). I would have to scan all them and now only
40% of them.
Hope that helps :)
PS: not all databases can handle a binary xor of 256 bits so watch out.
--
Michiel Brandenburg
More information about the MIMEDefang
mailing list