System Downages

[posted 1999/11/17]

One of our mission critical systems crashed last night.  The PIKT SysDownEmergency alarm caught this and dutifully sent alert e-mail to the sysadmins, but we (I) didn't read the e-mail until about a half hour later, whereupon there ensued a scramble to call and page persons closer to the scene...  It's a long story.  Suffice it to say that now is the time to chip away some more at that PIKT ToDo Mountain and implement paging.

The following system downages example will be a bit convoluted.  It makes use of several optional PIKT components that once understood make my life (at any rate) easier.

First, I created a new #include file, /pikt/lib/configs/systems/misscritsys_systems.cfg, with the following content:

        vienna berlin moscow warsaw munich milan athens2

Then I created a new host group in systems.cfg:

misscritsys
        members
#include <systems/misscritsys_systems.cfg>

Note that #include files in systems.cfg is a new feature supported only in the latest developers release (currently, pikt-991114 aka pikt-dev or pikt-1.8.0pre) available for download at the PIKT Web site.  Except for AIX and IRIX, this developers release should work fine on all other supported systems.

(If you are wondering why #ifdef's and #if's are forbidden in systems.cfg, this is why:  piktc processes systems.cfg first, defines.cfg second.  piktc can't deal with #ifdef's in systems.cfg because it doesn't know about any #define's yet.  Also, we can't make use of #if's yet because we are still in the process of defining our systems.)

I also created a new macro in macros.cfg that invokes this very same #include file:

misscritsys
#  include <systems/misscritsys_systems.cfg>

I will need to specify mission critical systems in both #if preprocessor statements and =misscritsys macro references.  With the single #include file, I have to specify these systems in just one place.

In alerts.cfg, I run the more important alerts more frequently for the mission critical systems.  So, for the EMERGENCY, Urgent, and Critical alerts, I made use of the misscritsys host group as follows, for example:

EMERGENCY               // things that require immediate attention

#if misscritsys
        timing          10,25,40,55 * * * *
#else
        timing          10,40 * * * *
#endif

I also added another #define in defines.cfg giving me the ability to turn paging on and off globally at will:

page TRUE               // if TRUE, then issue pages, else keep silent

Now, on to the heart of the matter.  Before, I had a single SysDownEmergency alarm to alert us about system downages.  For various reasons, I decided it was better to break this into two separate alarms, a revised SysDownEmergency and a new SysDownWarning:

///////////////////////////////////////////////////////////////////////////////

#if piktmaster

SysDownEmergency

        init
                status active
                level emergency
                task "Detect system crashes, or systems going off the network"
                input file "=hostinfo_obj"
                dat $host 1
                // ignore the rest of the fields in HostInfo.obj
                keys $host

        begin
                set $timeout = "20"     // yes, string var here
                =set_timenow
                =set_hr
                =set_dow
                // bypass weekly reboot period
                if =reboot_period
                        quit
                endif

        rule    // exclude systems known to be down
                if " =downsys " =~ " $host "
                        next
                endif

        rule    // report if system goes down; repeat only if system goes up
                // then back down again; for certain mission-critical systems,
                // report every time (issue repeated nagmail), also page but
                // just once per downage incident
#  if linux | freebsd
                if $command("=ping -c 1 $host | =tail -2 | =head -1")
                        =~ " 0% packet loss"
#  elif hpux
                if $command("=ping $host -n 1 | =tail -2 | =head -1")
                        =~ " 0% packet loss"
#  elif solaris | sunos
                if $command("=ping $host $timeout") =~ "is alive"
#  endif
                        set $state = "+"
                else
                        set $state = "-"
                        if " =misscritsys " =~ " $host "
                                output mail "$host is down, or off the network"
#  ifdef page
                                if    ! #defined(%state)
                                   || $state ne %state
                                        exec wait "echo '$host is down' |
                                                   =mailx -s '$host is down'
                                                   pagemozart\@egbdf
                                                   pagebrahms\@egbdf
                                                   pageliszt\@egbdf"
                                endif
#  endifdef
                        elseif    ! #defined(%state)
                               || $state ne %state
                                output mail "$host is down, or off the network"
                        endif
                endif

#endif  // piktmaster

///////////////////////////////////////////////////////////////////////////////

#if piktmaster

SysDownWarning

        init
                status active
                level warning
                task "Detect systems down or off the network"
                input file "=hostinfo_obj"
                dat $host 1
                // ignore the rest of the fields in HostInfo.obj

        begin
                set $timeout = "20"     // yes, string var here

        rule    // report if system doesn't respond to ping
#  if linux | freebsd
                if $command("=ping -c 1 $host | =tail -2 | =head -1")
                        =~ " 0% packet loss"
#  elif hpux
                if $command("=ping $host -n 1 | =tail -2 | =head -1")
                        =~ " 0% packet loss"
#  elif solaris | sunos
                if $command("=ping $host $timeout") =~ "is alive"
#  endif
                        // do nothing
                else
                        output mail "$host is down, or off the network"
                endif

#endif  // piktmaster

///////////////////////////////////////////////////////////////////////////////

Some commentary:

In SysDownEmergency, the first rule aborts the alarm during our weekly reboot period.  (For maintenance and general system checks, we reboot many of our systems once weekly.  Other sites like to keep their systems up as long as possible, then brag about their uptimes.  Your mileage may vary.)

The second rule has us bypass systems known to be down for an extended period.

In the third rule, we determine whether systems are up by means of the ping command.  Now, a system might be pingable but still hosed.  Eventually, I will create other auxiliary alarms to give us advance warning when systems are sick but still short of totally dead.

If a system is up (pingable), we set the $state to "+" and eventually store that in the history database.  If a system is down, we set $state to "-" and store that for later recall.

I strongly suggest that whenever you reference a history (%) value, you check whether it is #define()'ed.  If you don't, the history mechanism might not work like you expect.

For down systems, if they are members of the misscritsys (mission critical systems) set, we always emit nagmail, every time this alarm runs.  If we have #define'd page (and we have, in macros.cfg), and if the system's $state has changed from "+" (last time) to "-" (this time), we issue a page just this once to the sysadmins.  (It only now occurs to me that I could shorten the addressee to "[email protected]", where "pagesysadmins" is a mail alias resolving to the invidual personal page aliases.)  We implement paging via mail aliases (which resolve to a special executable set up for this purpose.  (Again, your mileage may vary.)

For non mission critical systems, we don't page.  Instead, we send out e-mail if a system is newly down.  Contrast this to the nagmail (sent each and every time) for the mission critical systems.

Note the following differences in the much simpler SysDownWarning alarm:  There is no reboot_period bypass, because we run this alarm just once at a different time of day.  There is no downsys bypass, so that we can get a daily reminder of all system downages.  Finally, in all cases, mission critical and not so, a warning message is sent, but just once a day, because this alarm is scheduled to run just once a day.

This is a suggested example of how you might handle the SysDown problem.  Modify to suit your needs.  (Your mileage *will* vary.)

For more examples, see Developer's Notes.

 
Home | FAQ | News | Intro | Samples | Tutorial | Reference | Software
Developer's Notes | Licensing | Authors | Pikt-Users | Pikt-Workers | Related Projects | Site Index | Privacy Policy | Contact Us
Page best viewed at 1024x768 or greater.   Page last updated 2019-01-12.   This site is PIKT® powered.
Copyright © 1998-2019 Robert Osterlund. All rights reserved.
Home FAQ News Intro Samples Tutorial Reference Software
PIKT Logo
PIKT Page Title
View sample
named.conf
config file