More imapd Storms
[posted 2000/04/28]
Mainly because of our ongoing mail server crisis, it's been another harrowing week. To make a long story short, we are seeing more imapd storms, but at a reduced frequency. The afore-described ProcNumChk PIKT scripts are doing their job, archiving imperiled user mail files and killing off excess imapds. One interim "fix"--making our mailserver an NIS slave server (so that it could NIS serve itself)--blew up in our faces. A documented bug introduced in the Solaris 2.7 NIS code gives rise, under occasional and still mysterious circumstances, to "ypserv storms"--where ypserv processes will multiply like crazy to the point of a system crash. Moving this NIS slave service off to a newly made Solaris 2.6 machine fixed that problem. Oy vey!!
This is a followup to the ProcNumChk scripts I posted earlier in the week. Below, you will see a revised ProcNumChkEmergency, also a new ProcNumChkRed. For the latter, I have created a new "Red Alert" that runs on the mail server every five minutes, give or take one minute (because I have set the timing drift to 1). Note that moscow is our mailserver and nantes is our new NIS server.
#if moscow | nantes
ProcNumChkEmergency
init
status active
level emergency
task "Report unusually high numbers of per-user processes."
input proc "=ps -eo user,comm | =sort | =uniq -c"
dat #count 1
dat $user 2
dat $proc 3
rule // for gathering diagnostic stats
if #count >=
# if moscow
20
# elseif nantes
10
# endif
output log "=logdir/ProcNumChkEmergency.log" $inline
fi
rule // report if the per-user proc count exceeds a certain
// threshold
# if moscow
if $proc =~ "imapd"
if #count >= 60
output mail $inline
fi
next
fi
if $proc =~ "sendmail"
if #count >= 60
output mail $inline
fi
next
fi
if $proc =~ "http" // httpd|httpsd
if #count >= 40
output mail $inline
fi
next
fi
# endif // moscow
// the default case (including "ypserv")
if #count >=
# if moscow
20
# elseif nantes
10
# endif
output mail $inline
next
fi
#endif // moscow | nantes
-------------------------------------------------------------------------------
#if moscow
ProcNumChkRed // with higher thresholds than ProcNumChkEmergency, and
// additional corrective steps; runs more often, too
init
status active
level emergency
task "Report unusually high numbers of per-user processes."
input proc "=ps -eo user,comm | =sort | =uniq -c"
dat #count 1
dat $user 2
dat $proc 3
rule // for gathering diagnostic stats
if #count >= 30
output log "=logdir/ProcNumChkRed.log" $inline
fi
rule // report if the per-user proc count exceeds a certain
// threshold
// in the case of exceedingly high imapd counts,
// archive the user mail file and kill off all imapd's
// for that user also
if $proc =~ "imapd"
if #count >= 90
output mail $inline
=archive_mail_file($user, #true())
=kill_user_proc("imapd", $user, #count,
90, #true())
fi
next
fi
if $proc =~ "sendmail"
if #count >= 90
output mail $inline
fi
next
fi
if $proc =~ "http" // httpd|httpsd
if #count >= 60
output mail $inline
fi
next
fi
// the default case (including "ypserv")
if #count >= 30
output mail $inline
next
fi
#endif // moscow
The ProcNumChkRed script refers to the following two new macros (defined in macros.cfg):
archive_mail_file(U, M) // archive a user's mail file
// (U) is the user (e.g., $user)
// (M) is whether or not to output mail
// (e.g., #true())
if -e "/var/mail/(U)"
exec wait "=cp -p /var/mail/(U)
/var/mail/arc/(U)" .
"." . $text(#now())
if (M)
output mail "saved user
mail file as
/var/mail/arc/(U)" .
"." . $text(#now())
fi
fi
if -e "/var/mail/." . (U) . ".pop"
// #now()+1 to guarantee no conflict
// with the preceding cp
exec wait "=cp -p /var/mail/." . (U) .
".pop /var/mail/arc/(U)" .
"." . $text(#now()+1)
if (M)
output mail "saved user
mail file as
/var/mail/arc/(U)" .
"." . $text(#now()+1)
fi
fi
///////////////////////////////////////////////////////////////////////////////
kill_user_proc(P, U, C, T, M) // kill off all instances of process for a
// given user if the instance count exceeds
// a given threshold
// (P) is the process name (e.g., "imapd")
// (U) is the user (e.g., $user)
// (C) is the instance count (e.g., #count)
// (T) is the instance threshold (e.g., 100)
// (M) is whether or not to output mail
// (e.g., #true())
if (C) >= (T)
set #killcount = (C)
while #killcount > 0
set #killcount = 0
do #popen(KILL, "=ps -eo
pid,user,comm", "r")
while #read(KILL) > 0
if #parse($readline) != 3
cont
fi
// save in case we do a
// subsequent regexp op
set $p = $1
set $u = $2
set $c = $3
if $u eq (U)
&& $c eq (P)
exec wait
"=kill -9 $p"
set #killcount
+= 1
fi
endwhile
do #pclose(KILL)
endwhile
if (M)
output mail "killed all " . (U) .
" " . (P) .
" processes"
fi
fi
These are still works-in-progress, and they still have an ad hoc quality about them. I haven't considered yet applying them beyond the two machines (moscow, the mail server, and nantes, the new NIS server) and generalizing them accordingly. Also, I'd like to further the use of macros here, but I want to do things sensibly and after some reflection.
The important point is that they are doing the job and helping us to cope
with these imapd storms.
For more examples, see Developer's Notes.