imapd Storms

[posted 2000/04/26]

We are in the midst of an email crisis stemming from our merging our 32,000+ user alumni mail operation with our main 4,000+ current user mail operation, also from providing new mail services such as Web-based email access.

I'll spare you most of the details, except these:  We are running IMP, the "Imap webMail Program", put out by the Horde Project (www.horde.org).  A bug in our version of IMP (2.0) has the imapd, under occasional and still mysterious circumstances, spawning instances of itself every second or so.  The good news is that this is a known bug that will be fixed in the next version of IMP (still in beta).  The bad news is that, for a handful of users, we are seeing occasional "imapd storms" with per-user imapd counts reaching into the dozens, hundreds, and sometimes even thousands.  The highest recorded count so far is 3816!   =:0

Not only do these imapd storms risk losing the user's mail file, they also imperil the entire system, as you can imagine.

What to do? Ideally, we/they fix the underlying software.  Or come up with some configurational tweaks to allay the problem.  But discovering/implementing these take time.  In the meantime, PIKT to the rescue!

The following is a new PIKT script we have put into operation on our mail server to deal with these problems:

ProcNumChkEmergency

        init
                status active
                level emergency
                task "Report unusually high numbers of per-user procs."
                input proc "=ps -eo user,comm | =sort | =uniq -c"
                dat #count 1
                dat $user  2
                dat $proc  3

        rule    // for gathering diagnostic stats
                if #count >= 20
                        output log "=logdir/ProcNumChkEmergency.log" $inline
                fi

        rule    // report if the per-user proc count exceeds a certain
                // threshold; in the case of exceedingly high imapd
                // counts, kill off all imapd's for that user also
                if $proc =~ "sendmail"
                        if #count >= 60
                                output mail $inline
                        fi
                elseif $proc =~ "imapd"
                        if #count >= 60
                                output mail $inline
                                // archive the user's mail file
                                if -e "/var/mail/$user"
                                        exec wait "=cp -p /var/mail/$user
                                                  /var/mail/arc/$user" .
                                                  "." . $text(#now())
                                        output mail "saved user mail file as
                                                    /var/mail/arc/$user" .
                                                    "." . $text(#now())
                                fi
                                // #now()+1 to guarantee no conflict
                                // with the preceding cp
                                if -e "/var/mail/." . $user . ".pop"
                                        exec wait "=cp -p /var/mail/." . $user .
                                                  ".pop /var/mail/arc/$user" .
                                                  "." . $text(#now()+1)
                                        output mail "saved user mail file as
                                                    /var/mail/arc/$user" .
                                                    "." . $text(#now()+1)
                                fi
                                // kill off all imapd's if count is exceedingly
                                // high
                                if #count >= 100
                                        set #killcount = #count
                                        while #killcount > 0
                                                set #killcount = 0
                                                do #popen(KILL, "=ps -eo
                                                          pid,user,comm", "r")
                                                while #read(KILL) > 0
                                                        if #split($readline) != 3
                                                                cont
                                                        fi
                                                        // save in case we do a
                                                        // subsequent regexp op
                                                        set $p = $1
                                                        set $u = $2
                                                        set $c = $3
                                                        if    $u eq $user
                                                           && $c eq "imapd"
                                                               exec wait "=kill
                                                                          -9 $p"
                                                               set #killcount += 1
                                                        fi
                                                endwhile
                                                do #pclose(KILL)
                                        endwhile
                                        output mail "killed all $user
                                                     imapd processes"
                                fi
                        fi
                elseif $proc =~ "httpsd|httpd"
                        if #count >= 40
                                output mail $inline
                        fi
                else    // all other procs
                        if #count >= 20
                                output mail $inline
                        fi
                fi

This is a work-in-progress.  I'd like to render some common elements as macros, and perhaps refer to a *.obj file matching proc names to instance counts.  Still, this bandaid is working well enough in the short run.

Some elaboration:

The input statement

                input proc "=ps -eo user,comm | =sort | =uniq -c"

yields input like

      1    harpo imapd
      1     root /opt/apache/bin/httpd
      1     root /opt/apache_secure/bin/httpsd
     34     root /usr/lib/sendmail
     10     root /opt/apache/bin/httpd
      2   daemon /usr/lib/ab2/dweb/sunos5/bin/dwhttpd
     10   webown /opt/apache_secure/bin/httpsd
      1  groucho imapd
      1    chico imapd
      1    zeppo imapd
      ...

The first rule logs some diagnostic stats, stats we will study later to discern a pattern (hopefully).

In the second rule, for certain processes of special interest, if their per-user process count exceeds the threshold, we send alert mail reporting that.

In the case of imapd only, if it exceeds a threshold, we archive the user's mail file by cp'ing it (or its .$user.pop variant) to the /var/mail/arc directory.

If the per-user process count exceeds 100, it's time to kill that user's imapd's.  We do a "ps -eo pid,user,comm", and for every $user-imapd pair, we "kill -9" the corresponding pid.

Because more imapd's may have spawned in the time it takes to kill off the first batch, we keep looping until the #killcount is 0--i.e., there were no kills done in the last go-around.

I can see where this might be made more efficient (e.g., pass the inner ps output through a grep of the user and imapd first).  But, we've put this up quickly.  It's working well enough so far.

I am also considering adding to the alarm message a dump of syslog with the relevant lines for that user-proc.

Here is a sample alert message:

                                PIKT ALERT
                         Tue Apr 25 00:17:02 2000
                                  moscow

URGENT:
    ProcNumChkUrgent
        Report unusually high numbers of per-user procs.

         404  groucho imapd
        saved user mail file as /var/mail/arc/groucho.956639822
        killed all groucho imapd processes

_______________________________________________
Systems mailing list
Systems@moscow.uppity.edu
http://moscow.uppity.edu/mailman/listinfo/systems

We still don't have an understanding of this problem, much less a fix, but at least we are not losing any more user email, and our mail server is coping.  (Most users are unaware of these difficulties.)

Ultimately this is a configurational issue, in a sense, in that properly written software and/or properly tweaked configuration files (perhaps extending to base system files, not just the configuration files for IMP, imapd, etc.), would fix the problem.  On the other hand, even with the best of software, and "correct" configurations, zaniness might erupt from time to time, and for the life of you, you can't just figure it all out.  This is why, it seems to me, you need a good system monitoring tool, with auto-corrective capabilities, beyond tools just to help you with your configuration management. Of course, PIKT does both!

We had a working, stable mail setup until the alumni merge several weeks ago.  Words to the wise:  If it ain't broke, don't fix it!

But if it is broke, mend it in the end, but consider applying PIKT bandaids (tourniquets?) while the battle still rages.

Open Hand For more examples, see Developer's Notes.

 
Home | FAQ | News | Intro | Samples | Tutorial | Reference | Software
Developer's Notes | Licensing | Authors | Pikt-Users | Pikt-Workers | Related Projects | Site Index | Privacy Policy | Contact Us
Page best viewed at 1024x768 or greater.   Page last updated 2008-02-27.   This site is PIKT® powered.
PIKT® is a registered trademark of the University of Chicago.   Copyright © 1998-2008 Robert Osterlund. All rights reserved.
Home FAQ News Intro Samples Tutorial Reference Software
PIKT Logo
PIKT Page Title
View sample
output scheduling
macros