We are in the midst of an email crisis stemming from our merging our 32,000+ user alumni mail operation with our main 4,000+ current user mail operation, also from providing new mail services such as Web-based email access.
I'll spare you most of the details, except these: We are running IMP, the "Imap webMail Program", put out by the Horde Project (www.horde.org). A bug in our version of IMP (2.0) has the imapd, under occasional and still mysterious circumstances, spawning instances of itself every second or so. The good news is that this is a known bug that will be fixed in the next version of IMP (still in beta). The bad news is that, for a handful of users, we are seeing occasional "imapd storms" with per-user imapd counts reaching into the dozens, hundreds, and sometimes even thousands. The highest recorded count so far is 3816! =:0
Not only do these imapd storms risk losing the user's mail file, they also imperil the entire system, as you can imagine.
What to do? Ideally, we/they fix the underlying software. Or come up with some configurational tweaks to allay the problem. But discovering/implementing these take time. In the meantime, PIKT to the rescue!
The following is a new PIKT script we have put into operation on our mail server to deal with these problems:
ProcNumChkEmergency init status active level emergency task "Report unusually high numbers of per-user procs." input proc "=ps -eo user,comm | =sort | =uniq -c" dat #count 1 dat $user 2 dat $proc 3 rule // for gathering diagnostic stats if #count >= 20 output log "=logdir/ProcNumChkEmergency.log" $inline fi rule // report if the per-user proc count exceeds a certain // threshold; in the case of exceedingly high imapd // counts, kill off all imapd's for that user also if $proc =~ "sendmail" if #count >= 60 output mail $inline fi elseif $proc =~ "imapd" if #count >= 60 output mail $inline // archive the user's mail file if -e "/var/mail/$user" exec wait "=cp -p /var/mail/$user /var/mail/arc/$user" . "." . $text(#now()) output mail "saved user mail file as /var/mail/arc/$user" . "." . $text(#now()) fi // #now()+1 to guarantee no conflict // with the preceding cp if -e "/var/mail/." . $user . ".pop" exec wait "=cp -p /var/mail/." . $user . ".pop /var/mail/arc/$user" . "." . $text(#now()+1) output mail "saved user mail file as /var/mail/arc/$user" . "." . $text(#now()+1) fi // kill off all imapd's if count is exceedingly // high if #count >= 100 set #killcount = #count while #killcount > 0 set #killcount = 0 do #popen(KILL, "=ps -eo pid,user,comm", "r") while #read(KILL) > 0 if #split($readline) != 3 cont fi // save in case we do a // subsequent regexp op set $p = $1 set $u = $2 set $c = $3 if $u eq $user && $c eq "imapd" exec wait "=kill -9 $p" set #killcount += 1 fi endwhile do #pclose(KILL) endwhile output mail "killed all $user imapd processes" fi fi elseif $proc =~ "httpsd|httpd" if #count >= 40 output mail $inline fi else // all other procs if #count >= 20 output mail $inline fi fi
This is a work-in-progress. I'd like to render some common elements as macros, and perhaps refer to a *.obj file matching proc names to instance counts. Still, this bandaid is working well enough in the short run.
The input statement
input proc "=ps -eo user,comm | =sort | =uniq -c"
yields input like
1 harpo imapd 1 root /opt/apache/bin/httpd 1 root /opt/apache_secure/bin/httpsd 34 root /usr/lib/sendmail 10 root /opt/apache/bin/httpd 2 daemon /usr/lib/ab2/dweb/sunos5/bin/dwhttpd 10 webown /opt/apache_secure/bin/httpsd 1 groucho imapd 1 chico imapd 1 zeppo imapd ...
The first rule logs some diagnostic stats, stats we will study later to discern a pattern (hopefully).
In the second rule, for certain processes of special interest, if their per-user process count exceeds the threshold, we send alert mail reporting that.
In the case of imapd only, if it exceeds a threshold, we archive the user's mail file by cp'ing it (or its .$user.pop variant) to the /var/mail/arc directory.
If the per-user process count exceeds 100, it's time to kill that user's imapd's. We do a "ps -eo pid,user,comm", and for every $user-imapd pair, we "kill -9" the corresponding pid.
Because more imapd's may have spawned in the time it takes to kill off the first batch, we keep looping until the #killcount is 0--i.e., there were no kills done in the last go-around.
I can see where this might be made more efficient (e.g., pass the inner ps output through a grep of the user and imapd first). But, we've put this up quickly. It's working well enough so far.
I am also considering adding to the alarm message a dump of syslog with the relevant lines for that user-proc.
Here is a sample alert message:
PIKT ALERT Tue Apr 25 00:17:02 2000 moscow URGENT: ProcNumChkUrgent Report unusually high numbers of per-user procs. 404 groucho imapd saved user mail file as /var/mail/arc/groucho.956639822 killed all groucho imapd processes _______________________________________________ Systems mailing list Systems@moscow.uppity.edu http://moscow.uppity.edu/mailman/listinfo/systems
We still don't have an understanding of this problem, much less a fix, but at least we are not losing any more user email, and our mail server is coping. (Most users are unaware of these difficulties.)
Ultimately this is a configurational issue, in a sense, in that properly written software and/or properly tweaked configuration files (perhaps extending to base system files, not just the configuration files for IMP, imapd, etc.), would fix the problem. On the other hand, even with the best of software, and "correct" configurations, zaniness might erupt from time to time, and for the life of you, you can't just figure it all out. This is why, it seems to me, you need a good system monitoring tool, with auto-corrective capabilities, beyond tools just to help you with your configuration management. Of course, PIKT does both!
We had a working, stable mail setup until the alumni merge several weeks ago. Words to the wise: If it ain't broke, don't fix it!
But if it is broke, mend it in the end, but consider applying PIKT bandaids
(tourniquets?) while the battle still rages.
For more examples, see Developer's Notes.