imapd Storms
[posted 2000/04/26]
We are in the midst of an email crisis stemming from our merging our 32,000+ user alumni mail operation with our main 4,000+ current user mail operation, also from providing new mail services such as Web-based email access.
I'll spare you most of the details, except these: We are running IMP, the "Imap webMail Program", put out by the Horde Project (www.horde.org). A bug in our version of IMP (2.0) has the imapd, under occasional and still mysterious circumstances, spawning instances of itself every second or so. The good news is that this is a known bug that will be fixed in the next version of IMP (still in beta). The bad news is that, for a handful of users, we are seeing occasional "imapd storms" with per-user imapd counts reaching into the dozens, hundreds, and sometimes even thousands. The highest recorded count so far is 3816! =:0
Not only do these imapd storms risk losing the user's mail file, they also imperil the entire system, as you can imagine.
What to do? Ideally, we/they fix the underlying software. Or come up with some configurational tweaks to allay the problem. But discovering/implementing these take time. In the meantime, PIKT to the rescue!
The following is a new PIKT script we have put into operation on our mail server to deal with these problems:
ProcNumChkEmergency
init
status active
level emergency
task "Report unusually high numbers of per-user procs."
input proc "=ps -eo user,comm | =sort | =uniq -c"
dat #count 1
dat $user 2
dat $proc 3
rule // for gathering diagnostic stats
if #count >= 20
output log "=logdir/ProcNumChkEmergency.log" $inline
fi
rule // report if the per-user proc count exceeds a certain
// threshold; in the case of exceedingly high imapd
// counts, kill off all imapd's for that user also
if $proc =~ "sendmail"
if #count >= 60
output mail $inline
fi
elseif $proc =~ "imapd"
if #count >= 60
output mail $inline
// archive the user's mail file
if -e "/var/mail/$user"
exec wait "=cp -p /var/mail/$user
/var/mail/arc/$user" .
"." . $text(#now())
output mail "saved user mail file as
/var/mail/arc/$user" .
"." . $text(#now())
fi
// #now()+1 to guarantee no conflict
// with the preceding cp
if -e "/var/mail/." . $user . ".pop"
exec wait "=cp -p /var/mail/." . $user .
".pop /var/mail/arc/$user" .
"." . $text(#now()+1)
output mail "saved user mail file as
/var/mail/arc/$user" .
"." . $text(#now()+1)
fi
// kill off all imapd's if count is exceedingly
// high
if #count >= 100
set #killcount = #count
while #killcount > 0
set #killcount = 0
do #popen(KILL, "=ps -eo
pid,user,comm", "r")
while #read(KILL) > 0
if #split($readline) != 3
cont
fi
// save in case we do a
// subsequent regexp op
set $p = $1
set $u = $2
set $c = $3
if $u eq $user
&& $c eq "imapd"
exec wait "=kill
-9 $p"
set #killcount += 1
fi
endwhile
do #pclose(KILL)
endwhile
output mail "killed all $user
imapd processes"
fi
fi
elseif $proc =~ "httpsd|httpd"
if #count >= 40
output mail $inline
fi
else // all other procs
if #count >= 20
output mail $inline
fi
fi
This is a work-in-progress. I'd like to render some common elements as macros, and perhaps refer to a *.obj file matching proc names to instance counts. Still, this bandaid is working well enough in the short run.
Some elaboration:
The input statement
input proc "=ps -eo user,comm | =sort | =uniq -c"
yields input like
1 harpo imapd
1 root /opt/apache/bin/httpd
1 root /opt/apache_secure/bin/httpsd
34 root /usr/lib/sendmail
10 root /opt/apache/bin/httpd
2 daemon /usr/lib/ab2/dweb/sunos5/bin/dwhttpd
10 webown /opt/apache_secure/bin/httpsd
1 groucho imapd
1 chico imapd
1 zeppo imapd
...
The first rule logs some diagnostic stats, stats we will study later to discern a pattern (hopefully).
In the second rule, for certain processes of special interest, if their per-user process count exceeds the threshold, we send alert mail reporting that.
In the case of imapd only, if it exceeds a threshold, we archive the user's mail file by cp'ing it (or its .$user.pop variant) to the /var/mail/arc directory.
If the per-user process count exceeds 100, it's time to kill that user's imapd's. We do a "ps -eo pid,user,comm", and for every $user-imapd pair, we "kill -9" the corresponding pid.
Because more imapd's may have spawned in the time it takes to kill off the first batch, we keep looping until the #killcount is 0--i.e., there were no kills done in the last go-around.
I can see where this might be made more efficient (e.g., pass the inner ps output through a grep of the user and imapd first). But, we've put this up quickly. It's working well enough so far.
I am also considering adding to the alarm message a dump of syslog with the relevant lines for that user-proc.
Here is a sample alert message:
PIKT ALERT
Tue Apr 25 00:17:02 2000
moscow
URGENT:
ProcNumChkUrgent
Report unusually high numbers of per-user procs.
404 groucho imapd
saved user mail file as /var/mail/arc/groucho.956639822
killed all groucho imapd processes
_______________________________________________
Systems mailing list
Systems@moscow.uppity.edu
http://moscow.uppity.edu/mailman/listinfo/systems
We still don't have an understanding of this problem, much less a fix, but at least we are not losing any more user email, and our mail server is coping. (Most users are unaware of these difficulties.)
Ultimately this is a configurational issue, in a sense, in that properly written software and/or properly tweaked configuration files (perhaps extending to base system files, not just the configuration files for IMP, imapd, etc.), would fix the problem. On the other hand, even with the best of software, and "correct" configurations, zaniness might erupt from time to time, and for the life of you, you can't just figure it all out. This is why, it seems to me, you need a good system monitoring tool, with auto-corrective capabilities, beyond tools just to help you with your configuration management. Of course, PIKT does both!
We had a working, stable mail setup until the alumni merge several weeks ago. Words to the wise: If it ain't broke, don't fix it!
But if it is broke, mend it in the end, but consider applying PIKT bandaids
(tourniquets?) while the battle still rages.
For more examples, see Developer's Notes.