[Mimedefang] Caching the results of a SpamAssassin scan

Thu Apr 3 00:48:01 EST 2003

I probably shouldn't type this email while it's late and I'm tired, but
hopefully I'll be able to keep my thoughts straight here and make some
sense...

I've been tuning performance on my about-to-go-live mail server.  Just for
kicks I decided to remove certain portions of my MIMEDefang filter to see
where the intense processing was really happening.  It was no surprise that
the SpamAssassin scan was the CPU hog.  So much so that the rest of the
script is insignificant by comparison.

While testing my server I simulated load by sending a group of 5 messages 3
times each to 10 different accounts on the system.  I sent it in such a way
that each message (all 150 of them) would be delivered via separate SMTP
connection.  As a result, my load average went up, MIMEDefang ran out of
slaves and many of the messages were temp failed.

This got me to thinking.  I know that most MTAs will split a message up by
domain when sending a message that has multiple recipients and try to
deliver the same message to a particular domain as one SMTP session.  But I
expect that spammers would often run a script that would open a separate
connection per recipient, effectively flooding my server the same way my
test did.

Even though these were all separate SMTP sessions, each message has the
exact same message-ID.  I thought it would be really nice if SpamAssassin
could cache the results of it's scan for a particular message, and use the
cached result if the same message-ID came through within a specified
interval.

I've seen this technique used on web pages to cache the results of dynamic
pages that change, but infrequently enough that they would benefit from
caching.  I would imagine it going something like this:

1. SpamAssasin receives a message and extracts the message-ID.  It looks in
it's cache directory for a file with the same message-ID who's modtime is
less than the interval set (say, 24 hours).  If it finds it, it gets the
results from the file rather than re-scanning the message.  (It could also
delete any files with that message-ID whose modtime was greater than 24
hours, or a separate cron job could run to clear out old cache files)
2. If it doesn't find it, it scans the message as normal, saves the result
in the cache directory, and returns it.

I apologize if this has been discussed here before, but I didn't see it in
the archives.  This is probably something that is better suited for a
SpamAssassin list, but it is indirectly related to MIMEDefang, so I thought
I'd see if anyone had any thoughts.  TIA...

___________________________________________
Michael Sims
Project Analyst - Information Technology
Crye-Leike Realtors
Office: (901)758-5648  Pager: (901)769-3722
___________________________________________