[Mimedefang] Problem is happening right now (was: My MD install went wacko)

Wed Jun 11 14:49:00 EDT 2003

On 11 Jun 2003, Bill Randle wrote:

> I'm afraid I don't have an answer for you. My configuration is
> similar to yours and it's been working for close to a week now
> without problems. As I said, in my case it was the upgrade to
> MIMEDefang 2.43-BETA-5, but I didn't see anything obvious in
> diffing the mimedefang code that would make things start working -
> and in your case, that didn't seem to make any difference anyway.
> 
> I'm not even sure of the exact failure mechanism. E.g., because it's
> hung, do the child mimedefang processes do their normal idle timeout
> thing and exit? If so, what prevents new processes from being spawned?
> Is the initial "timeout before data read" error the cause or just a
> symptom of it already being in the failure state? I suspect the later
> messages "to error state" and "init failed to open" are symptoms of
> it already being hung, but what causes it to get there in the first
> place?

Best I can tell is that they don't idle timeout and respawn (or just die).  
Now I don't know the answer to this question but I'd assume that it's the
multiplexor that keeps track of the slaves processes for gathering stats
used in idle time and total number of messages processed, not the slaves
themselves.  That's my assumption and it could certainly be wrong.  If 
that's the case then it's quite possible that the slave processes are 
still working like normal and that it's the multiplexor that's caught in a 
stupor.  Sendmail calls the multiplexor and not the individual slaves so 
if the multiplexor goes mentally MIA then the slaves are just left there 
to rot.  Sound right?

> I beginning to wonder if it isn't some kind of race condition, as
> changing various pieces of the overall filtering process seem to make
> a difference in how often in happens. At one time, it was happening
> to me as often as every few minutes for a short period of time, then
> would go 30-40 minutes before hanging again. This is with an incoming
> mail load of 800-1200 messages/hour. As I said, each time I changed
> something it seemed to improve: tmpfs for /var/spool/MIMEDefang,
> SA-2.60 CVS, newer kernel, MIMEDefang 2.43-BETA-5. That's what makes
> me think it's a timing / timeout problem. But where?

When I first noticed this problem it was happening a couple times an hour.  
I'd have a term tailing the maillog and I'd keep and eye out for the
patterns that indicated it was hung.  When it got really bad I'd simply
have a cron restart MD and Sendmail every 5 minutes.  Oddly enough the
problem was frequently happening seconds before the 5 minute cron ran.  
Then after I first replied to you I had a good two days of no problems.  
That included time when I was manually bouncing thousands of queued spam
to the FTC and NANAS.  Now all the box handles is my own mailing list 
mail, mail from my all my ISP's customers scoring >= 10, and mail for my 
spamtraps.  The latter two get munged and auto-forwarded to the FTC and 
NANAS as well as reported via Pyzor and Razor.

One thing I did notice today was that SA wasn't being compiled and
installed from source nightly like I'd set it up to.  In 2.60 the
configure script asks you to define what website you want users to be
directed to for information about the spam filtering.  Their hope is that
the individual mail admins will set up their own pages to answer the
questions of their users directly rather than directing them to SA's
website.  Unfortunately there isn't a configure switch I can use to define
CONTACT_ADDRESS in a script.  I now how to configure it by hand or whip up
an expect script.  I asked the SA devs to add this option to the configure 
scripts.

> Oh, one other thing I changed that seemed to incrementally help (based
> on a posting someone made to this list, or maybe the SA list): I
> increased the sendmail Timeout.quit (confTO_QUIT) to 5m, Timeout.misc
> (confTO_MISC) to 4m and Timeout.control (confTO_CONTROL) to 4m.

http://www.sendmail.org/m4/tweaking_config.html

I'm not sure if TO_QUIT and TO_MISC have much bearing on milters, do they?  
TO_CONTROL might though.  I'm not sure.  I'll give them a try.  I can't 
figure out how everything worked peachy and then suddenly BOOM.  Now I 
have a problem.  Odd.

> My next step (if it hadn't started working) would have been to
> instrument the mimedefang code with additional debug print messages
> to see if I could find why the child processes were dying off and
> not restarting.

Ah.  This would be beyond my skills.  I wonder if there's a debug mode for 
the multiplexor...  I had increased my milter log level to 13 after I 
first replied to you but I could never catch MD going nuts so I never got 
to see if there was anything useful there.

> Sorry I can't be much more help. You might try the sendmail timeouts,
> though, and see if they make any difference.

You've given me a lot more ideas than I would have come up with on my own.  
Thanks!

Justin