[Mimedefang] odd behavior under load from boucnes

Mon Dec 5 12:27:04 EST 2005

Hi Folks!

I'm having an issue with mimedefang running on one of our systems, namely
our listserv machine.  It appears that whenever we get a high number of
boucnes from a list post, we see a good number of slaves get started by
the multiplexor, they'll process a message or two, and then seem to loose
communication with the miltiplexor and be killed causing their connection
to get tempfailed.

A log entry for one slave looks like this:
> Dec  5 10:48:51 box mimedefang-multiplexor[22221]: stats 1133797731.857 StartSlave slave=30 nslaves=17 nbusy=16 reason="About to execute command 'relayok 65.54.248.219 omc1-s17.bay6.hotmail.com '"
> Dec  5 10:48:52 box sendmail[14980]: jB5FmY9I014980: from=<>, size=2614, class=0, nrcpts=1, msgid=<10512051107.AA12213364 at unitedwayhallcounty.org>, proto=ESMTP, daemon=MTA, relay=204-15-55-37.redclayinteractive.com [204.15.55.37] (may be forged)
> Dec  5 10:48:59 box mimedefang[15043]: MDLOG,jB5FmY9I014980,mail_in,,,<>,<owner-hallcolibrary-l at listserv.uga.edu>, Undeliverable Mail
> Dec  5 10:49:08 box sendmail[14980]: jB5FmY9I014980: Milter add: header: X-Scanned-By: Digested by UGA Mail Gateway on 128.192.1.75
> Dec  5 10:49:07 box mimedefang[22235]: jB5FmY9I014980: Filter time is 14099ms
> Dec  5 10:49:18 box sendmail[15191]: jB5FmY9I014980: to="|lsv_amin spool -t owner-hallcolibrary-l", ctladdr=<owner-hallcolibrary-l at LISTSERV.UGA.EDU> (8/0), delay=00:00:26, xdelay=00:00:01, mailer=prog, pri=32924, dsn=2.0.0, stat=Sent
> Dec  5 10:49:19 box mimedefang-multiplexor[22221]: handleSlaveReceivedCommand(35): Timeout or error: Flag = 3: Connection timed out
> Dec  5 10:49:19 box mimedefang-multiplexor[22221]: Killing busy slave 35 (pid 15043): Error talking to slave process
> Dec  5 10:49:19 box mimedefang-multiplexor[22221]: stats 1133797759.076 KillSlave slave=35 nslaves=25 nbusy=23 reason="Error talking to slave process"
> Dec  5 10:49:20 box mimedefang-multiplexor[22221]: Reap: Killed slave 35 (pid 15043) exited due to SIGTERM/SIGKILL as expected.
> Dec  5 10:49:20 box mimedefang-multiplexor[22221]: Slave 35 resource usage: req=2, scans=1, user=1.770, sys=0.110, nswap=0, majflt=593, minflt=6091, maxrss=0, bi=0, bo=0

As you can see, the filter runs fine and mail is delivered, but then seems
to hang.  Generally, we'd see this happen under any form of load until
building mimedefang using the --add-pthread-flag so it gets linked with
NPTL.  Now it only happens during bounce storms (30ish concurrent
connections) and is much quicker to heal its' self.  Load for the system
during this time gets high, around 6ish.

I guess my questions are:
  - Where would/does the handleSlaveReceivedCommand come from and what does
    this mean?  (Google yields very little other than many people asking
    this same question :)

  - Since building with --add-pthread-flag helped some, this seems to me
    like a threading bug, but could it be a config issue?

  - Where would one go about looking for more information about this?

System is 
- RHEL3, patched, with a SMP kernel
- mimedefang-2.53 built with the --add-pthread-flag for NPTL compliance

   The multiplexor is started like so:
   - LOG_FILTER_TIME=yes
   - MX_USER=defang
   - MX_RELAY_CHECK=yes
   - MX_LOG=yes
   - SYSLOG_FACILITY=mail
   - MX_SLAVE_DELAY=3
   - MX_MIN_SLAVE_DELAY=0
   - MX_STATS_SYSLOG=yes
   - MX_REQUESTS=500
   - MX_MINIMUM=10
   - ALLOW_NEW_CONNECTIONS_TO_QUEUE=yes
   - MX_MAXIMUM=50
   - MX_STATUS_UPDATES=yes
   - MX_IDLE=300
   - MX_BUSY=600
   - MX_QUEUE_SIZE=30
   - MX_QUEUE_TIMEOUT=30
   - MD_EXTRA="-x 'Digested by UGA Mail Gateway'"

   The mimedefang-filter is a slightly modified version of the sample one
   distributed, mainly, filter_multipart() simple returns action_accept()
   at the beginning because, at this begining point, we're not interested
   in manipulating message bodies.

- clamav/clamd 0.87.1
   - using message_contains_virus_clamd() in mimedefang-filter
- spamassassin-3.0.4
   - AWL and Bayes are turned off to prevent locking issues
- ProLiant ML370 G2 (2x 1.2Ghz p3)
- 4GB RAM
- 3 Disk RAID5 of 72k RPM disks, on which everything sits including the
   MIMEDefang spool.

Thanks for any help y'all can give.

-n

-- 
-------------------------------------------
nathan hruby <nhruby at uga.edu>
uga enterprise information technology services
core services  support
-------------------------------------------
"In 1972 a crack commando unit was sent to
  prison by a military court for a crime they
  didn't commit...."