More imapd Storms
[posted 2000/04/28]
Mainly because of our ongoing mail server crisis, it's been another harrowing week. To make a long story short, we are seeing more imapd storms, but at a reduced frequency. The afore-described ProcNumChk PIKT scripts are doing their job, archiving imperiled user mail files and killing off excess imapds. One interim "fix"--making our mailserver an NIS slave server (so that it could NIS serve itself)--blew up in our faces. A documented bug introduced in the Solaris 2.7 NIS code gives rise, under occasional and still mysterious circumstances, to "ypserv storms"--where ypserv processes will multiply like crazy to the point of a system crash. Moving this NIS slave service off to a newly made Solaris 2.6 machine fixed that problem. Oy vey!!
This is a followup to the ProcNumChk scripts I posted earlier in the week. Below, you will see a revised ProcNumChkEmergency, also a new ProcNumChkRed. For the latter, I have created a new "Red Alert" that runs on the mail server every five minutes, give or take one minute (because I have set the timing drift to 1). Note that moscow is our mailserver and nantes is our new NIS server.
#if moscow | nantes ProcNumChkEmergency init status active level emergency task "Report unusually high numbers of per-user processes." input proc "=ps -eo user,comm | =sort | =uniq -c" dat #count 1 dat $user 2 dat $proc 3 rule // for gathering diagnostic stats if #count >= # if moscow 20 # elseif nantes 10 # endif output log "=logdir/ProcNumChkEmergency.log" $inline fi rule // report if the per-user proc count exceeds a certain // threshold # if moscow if $proc =~ "imapd" if #count >= 60 output mail $inline fi next fi if $proc =~ "sendmail" if #count >= 60 output mail $inline fi next fi if $proc =~ "http" // httpd|httpsd if #count >= 40 output mail $inline fi next fi # endif // moscow // the default case (including "ypserv") if #count >= # if moscow 20 # elseif nantes 10 # endif output mail $inline next fi #endif // moscow | nantes ------------------------------------------------------------------------------- #if moscow ProcNumChkRed // with higher thresholds than ProcNumChkEmergency, and // additional corrective steps; runs more often, too init status active level emergency task "Report unusually high numbers of per-user processes." input proc "=ps -eo user,comm | =sort | =uniq -c" dat #count 1 dat $user 2 dat $proc 3 rule // for gathering diagnostic stats if #count >= 30 output log "=logdir/ProcNumChkRed.log" $inline fi rule // report if the per-user proc count exceeds a certain // threshold // in the case of exceedingly high imapd counts, // archive the user mail file and kill off all imapd's // for that user also if $proc =~ "imapd" if #count >= 90 output mail $inline =archive_mail_file($user, #true()) =kill_user_proc("imapd", $user, #count, 90, #true()) fi next fi if $proc =~ "sendmail" if #count >= 90 output mail $inline fi next fi if $proc =~ "http" // httpd|httpsd if #count >= 60 output mail $inline fi next fi // the default case (including "ypserv") if #count >= 30 output mail $inline next fi #endif // moscow
The ProcNumChkRed script refers to the following two new macros (defined in macros.cfg):
archive_mail_file(U, M) // archive a user's mail file // (U) is the user (e.g., $user) // (M) is whether or not to output mail // (e.g., #true()) if -e "/var/mail/(U)" exec wait "=cp -p /var/mail/(U) /var/mail/arc/(U)" . "." . $text(#now()) if (M) output mail "saved user mail file as /var/mail/arc/(U)" . "." . $text(#now()) fi fi if -e "/var/mail/." . (U) . ".pop" // #now()+1 to guarantee no conflict // with the preceding cp exec wait "=cp -p /var/mail/." . (U) . ".pop /var/mail/arc/(U)" . "." . $text(#now()+1) if (M) output mail "saved user mail file as /var/mail/arc/(U)" . "." . $text(#now()+1) fi fi /////////////////////////////////////////////////////////////////////////////// kill_user_proc(P, U, C, T, M) // kill off all instances of process for a // given user if the instance count exceeds // a given threshold // (P) is the process name (e.g., "imapd") // (U) is the user (e.g., $user) // (C) is the instance count (e.g., #count) // (T) is the instance threshold (e.g., 100) // (M) is whether or not to output mail // (e.g., #true()) if (C) >= (T) set #killcount = (C) while #killcount > 0 set #killcount = 0 do #popen(KILL, "=ps -eo pid,user,comm", "r") while #read(KILL) > 0 if #parse($readline) != 3 cont fi // save in case we do a // subsequent regexp op set $p = $1 set $u = $2 set $c = $3 if $u eq (U) && $c eq (P) exec wait "=kill -9 $p" set #killcount += 1 fi endwhile do #pclose(KILL) endwhile if (M) output mail "killed all " . (U) . " " . (P) . " processes" fi fi
These are still works-in-progress, and they still have an ad hoc quality about them. I haven't considered yet applying them beyond the two machines (moscow, the mail server, and nantes, the new NIS server) and generalizing them accordingly. Also, I'd like to further the use of macros here, but I want to do things sensibly and after some reflection.
The important point is that they are doing the job and helping us to cope with these imapd storms.
For more examples, see Developer's Notes.