More imapd Storms

[posted 2000/04/28]

Mainly because of our ongoing mail server crisis, it's been another harrowing week.  To make a long story short, we are seeing more imapd storms, but at a reduced frequency.  The afore-described ProcNumChk PIKT scripts are doing their job, archiving imperiled user mail files and killing off excess imapds.  One interim "fix"--making our mailserver an NIS slave server (so that it could NIS serve itself)--blew up in our faces.  A documented bug introduced in the Solaris 2.7 NIS code gives rise, under occasional and still mysterious circumstances, to "ypserv storms"--where ypserv processes will multiply like crazy to the point of a system crash. Moving this NIS slave service off to a newly made Solaris 2.6 machine fixed that problem.  Oy vey!!

This is a followup to the ProcNumChk scripts I posted earlier in the week. Below, you will see a revised ProcNumChkEmergency, also a new ProcNumChkRed. For the latter, I have created a new "Red Alert" that runs on the mail server every five minutes, give or take one minute (because I have set the timing drift to 1).  Note that moscow is our mailserver and nantes is our new NIS server.

#if moscow | nantes

ProcNumChkEmergency

        init
                status active
                level emergency
                task "Report unusually high numbers of per-user processes."
                input proc "=ps -eo user,comm | =sort | =uniq -c"
                dat #count 1
                dat $user  2
                dat $proc  3

        rule    // for gathering diagnostic stats
                if #count >=
#  if moscow
                             20
#  elseif nantes
                             10
#  endif
                        output log "=logdir/ProcNumChkEmergency.log" $inline
                fi

        rule    // report if the per-user proc count exceeds a certain
                // threshold
#  if moscow
                if $proc =~ "imapd"
                        if #count >= 60
                                output mail $inline
                        fi
                        next
                fi
                if $proc =~ "sendmail"
                        if #count >= 60
                                output mail $inline
                        fi
                        next
                fi
                if $proc =~ "http"      // httpd|httpsd
                        if #count >= 40
                                output mail $inline
                        fi
                        next
                fi
#  endif  // moscow
                // the default case (including "ypserv")
                if #count >=
#  if moscow
                             20
#  elseif nantes
                             10
#  endif
                        output mail $inline
                        next
                fi

#endif  // moscow | nantes

-------------------------------------------------------------------------------

#if moscow

ProcNumChkRed   // with higher thresholds than ProcNumChkEmergency, and
                // additional corrective steps; runs more often, too

        init
                status active
                level emergency
                task "Report unusually high numbers of per-user processes."
                input proc "=ps -eo user,comm | =sort | =uniq -c"
                dat #count 1
                dat $user  2
                dat $proc  3

        rule    // for gathering diagnostic stats
                if #count >= 30
                        output log "=logdir/ProcNumChkRed.log" $inline
                fi

        rule    // report if the per-user proc count exceeds a certain
                // threshold
                // in the case of exceedingly high imapd counts,
                // archive the user mail file and kill off all imapd's
                // for that user also
                if $proc =~ "imapd"
                        if #count >= 90
                                output mail $inline
                                =archive_mail_file($user, #true())
                                =kill_user_proc("imapd", $user, #count,
                                                 90, #true())
                        fi
                        next
                fi
                if $proc =~ "sendmail"
                        if #count >= 90
                                output mail $inline
                        fi
                        next
                fi
                if $proc =~ "http"      // httpd|httpsd
                        if #count >= 60
                                output mail $inline
                        fi
                        next
                fi
                // the default case (including "ypserv")
                if #count >= 30
                        output mail $inline
                        next
                fi

#endif  // moscow

The ProcNumChkRed script refers to the following two new macros (defined in macros.cfg):

archive_mail_file(U, M)         // archive a user's mail file
                                // (U) is the user (e.g., $user)
                                // (M) is whether or not to output mail
                                //     (e.g., #true())
                                if -e "/var/mail/(U)"
                                        exec wait "=cp -p /var/mail/(U)
                                                   /var/mail/arc/(U)" .
                                                  "." . $text(#now())
                                        if (M)
                                                output mail "saved user
                                                             mail file as
                                                             /var/mail/arc/(U)" .
                                                             "." . $text(#now())
                                        fi
                                fi
                                if -e "/var/mail/." . (U) . ".pop"
                                        // #now()+1 to guarantee no conflict
                                        // with the preceding cp
                                        exec wait "=cp -p /var/mail/." . (U) .
                                                  ".pop /var/mail/arc/(U)" .
                                                  "." . $text(#now()+1)
                                        if (M)
                                                output mail "saved user
                                                             mail file as
                                                             /var/mail/arc/(U)" .
                                                            "." . $text(#now()+1)
                                        fi
                                fi

///////////////////////////////////////////////////////////////////////////////

kill_user_proc(P, U, C, T, M)   // kill off all instances of process for a
                                // given user if the instance count exceeds
                                // a given threshold
                                // (P) is the process name (e.g., "imapd")
                                // (U) is the user (e.g., $user)
                                // (C) is the instance count (e.g., #count)
                                // (T) is the instance threshold (e.g., 100)
                                // (M) is whether or not to output mail
                                //        (e.g., #true())
                                if (C) >= (T)
                                        set #killcount = (C)
                                        while #killcount > 0
                                                set #killcount = 0
                                                do #popen(KILL, "=ps -eo
                                                          pid,user,comm", "r")
                                                while #read(KILL) > 0
                                                        if #parse($readline) != 3
                                                                cont
                                                        fi
                                                        // save in case we do a
                                                        // subsequent regexp op
                                                        set $p = $1
                                                        set $u = $2
                                                        set $c = $3
                                                        if    $u eq (U)
                                                           && $c eq (P)
                                                                exec wait
                                                                  "=kill -9 $p"
                                                                set #killcount
                                                                  += 1
                                                        fi
                                                endwhile
                                                do #pclose(KILL)
                                        endwhile
                                        if (M)
                                                output mail "killed all " . (U) .
                                                            " " . (P) .
                                                            " processes"
                                        fi
                                fi

These are still works-in-progress, and they still have an ad hoc quality about them.  I haven't considered yet applying them beyond the two machines (moscow, the mail server, and nantes, the new NIS server) and generalizing them accordingly.  Also, I'd like to further the use of macros here, but I want to do things sensibly and after some reflection.

The important point is that they are doing the job and helping us to cope with these imapd storms.

For more examples, see Developer's Notes.

 
Home | FAQ | News | Intro | Samples | Tutorial | Reference | Software
Developer's Notes | Licensing | Authors | Pikt-Users | Pikt-Workers | Related Projects | Site Index | Privacy Policy | Contact Us
Page best viewed at 1024x768 or greater.   Page last updated 2019-01-12.   This site is PIKT® powered.
Copyright © 1998-2019 Robert Osterlund. All rights reserved.
Home FAQ News Intro Samples Tutorial Reference Software
PIKT Logo
PIKT Page Title
View sample
mail routing
macros