System Failures

[posted 1999/11/24]

This is a followup to the message sent out last week about our revised SysDownEmergency alarm.

The system that inspired the SysDown revision has crashed, for still undetermined reasons, several times in the last month.  In every case, and independent of the earlier SysDown alarm (which informed us of the downage via e-mail because the system was not pingable), we knew about the system failure because this system is our main applications server, and NFS crossmounts were hanging all across our domain.

I would be online, and most everything would lock up.  When I tried my usual rlogin (or rsh) to get onto the server, that hung also.  Fortunately, on most occasions, I was able to ssh commands to the sick system, e.g., 'sync; reboot'.  (sshd runs independently of inetd.  Perhaps it was inetd that was hosed.)  Unfortunately, there were a couple of times when sshd was hosed, too, and one of us sysadmins had to drive down to the site and reboot the system firsthand.

If we can add a test of RPC services to SysDown, this might give us advance warning of total system failures.

Here is a suitably revised SysDown alarm:

///////////////////////////////////////////////////////////////////////////////

#if piktmaster

SysDownEmergency

        init
                status active
                level emergency
                task "Detect system crashes, or systems going off the network"
                input file "=hostinfo_obj"
                dat $host 1
                // ignore the rest of the fields in HostInfo.obj
                keys $host

        begin
                set $timeout = "20"     // yes, string var here
                =set_timenow
                =set_hr
                =set_dow
                // bypass weekly reboot period
                if =reboot_period
                        quit
                endif

        rule    // exclude systems known to be down
                if " =downsys " =~ " $host "
                        next
                endif

        rule    // for mission-critical systems, check the state of their
                // rpc services (which are often the first sign of system
                // trouble); report if rpc services are down, also page
                // but just once per downage incident; if rpc services are
                // down, we might still be able to ssh to the sick machine,
                // for example to issue a reboot command (this invariably
                // fixes rpc problems, and usually other problems as well)
                if    " =misscritsys " =~ " $host "
                   && " seville " !~ " $host "  // no rsh services
                        // the $command() below returns the hostname followed
                        // by a linefeed, hence the '!~ $host'; we could
                        // also do:
                        // if $chop($command(...)) ne $host
                        if $command("=maxtime $timeout =rsh $host hostname |
                                     =behead") !~ $host
                                set $state = "-"
                                output mail "$host is sick, possibly down,
                                             or off the network"
#  ifdef page
                                if    ! #defined(%state)
                                   || $state ne %state
                                        exec wait "echo '$host is sick/down' |
                                                   =mailx -s
                                                   '$host is sick/down'
                                                   pagemozart\@egbdf
                                                   pagebrahms\@egbdf
                                                   pageliszt\@egbdf"
                                endif
#  endifdef
                                next    // bypass ping test for this host
                        endif
                endif

        rule    // report if system goes down; repeat only if system goes up
                // then back down again; for certain mission-critical systems,
                // report every time (issue repeated nagmail), also page but
                // just once per downage incident
#  if linux | freebsd
                if $command("=ping -c 1 $host | =tail -2 | =head -1")
                        =~ " 0% packet loss"
#  elif hpux
                if $command("=ping $host -n 1 | =tail -2 | =head -1")
                        =~ " 0% packet loss"
#  elif solaris | sunos
                if $command("=ping $host $timeout") =~ "is alive"
#  endif
                        set $state = "+"
                else
                        set $state = "-"
                        if " =misscritsys " =~ " $host "
                                output mail "$host is down, or off the network"
#  ifdef page
                                if    ! #defined(%state)
                                   || $state ne %state
                                        exec wait "echo '$host is down' |
                                                   =mailx -s '$host is down'
                                                   pagemozart\@egbdf
                                                   pagebrahms\@egbdf
                                                   pageliszt\@egbdf"
                                endif
#  endifdef
                        elseif    ! #defined(%state)
                               || $state ne %state
                                output mail "$host is down, or off the network"
                        endif
                endif

#endif  // piktmaster

///////////////////////////////////////////////////////////////////////////////

The only difference from the SysDown version sent out last week is the addition of the second rule.

For the "mission critical" systems (specified in the systems.cfg #include file <systems/misscritsys_systems.cfg>; see last week's message for more detail), but not for any systems (in this case seville only) that don't support RPC services (because we have commented them out of inetd.conf), issue a 'rsh $host hostname' command to the current host.  Because the host's RPC services might be down, and since we don't want this alarm to hang indefinitely, we put a '=maxtime $timeout ...' wrapper around the rsh command.  And because =maxtime issues a 'spawn rsh <host> hostname' as its first line of output, we have to =behead that first line.

Interestingly, if a system responds, the $command() returns the hostname with an extra linefeed tacked on the end.  We can either $chop() that off and do a direct string compare (using 'ne'), or we can attempt a regexp match on the $command() output (using '!~').

Before we move on, note one more thing:  the $command() includes a 'hostname', not a '=hostname', since we want the hostname command on the remote system, not on the invoking piktmaster.  By referencing 'hostname' we get the hostname command in the remote system's root PATH.  If this is not sufficient, we could reference a database of hostname commands that we set up in our hosts.cfg (or perhaps figure out a way to get the appropriate hostname command from the <macros/unixcmds_*_macros.cfg databases).

Moving on, if we don't get the current host's hostname back from the $command(), we get back for example the empty string, we conclude that the system's RPC services at least are down.  We then follow the sequence described in last week's message, but with the following nuances: (1) as before with a failed ping, we get paged only once (saying the system is "sick/down"), but we get e-mailed only once (coding this to send out "nagmail" in the sick system case is left as an exercise for the reader); (2) if the system's RPC services are sick, there is no sense also doing the ping test; (3) if the RPC services are healthy, we go on to do the ping test for that host anyway as a doublecheck (note that the 'next' command to bypass the ping test is inside the "sick system" if ... endif).

A funny thing happened on the way to fine-tuning this alarm.  I missed the subtlety about the extra linefeed tacked onto the end of the remote hostname output, whereupon my initial tests were sending off pages right and left (well, seven for each test anyway, for each of our seven mission critical systems).

I quickly vi'ed the EMERGENCY.alt file and employed a little known feature of Pikt scripts.  By changing 'status active' to 'status inactive'-- in the .alt file, not in the alarms.cfg!--I was able to turn the SysDown alarm off temporarily.  (The rest of the alarms in the EMERGENCY alert package continued to run during the debug.)

Then I cp'ed the EMERGENCY.alt file to Test.alt and proceeded to edit and debug the Test.alt directly--again bypassing the alarms.cfg file. Once I had the Test.alt done just right, I mirrored the modifications in the alarms.cfg, reinstalled EMERGENCY.alt (using 'piktc -iv +A EMERGENCY +H piktmaster'), then rm'ed the Test .alt, .hst & .log files.

We are now well prepared to deal with the possible impending downage of our critical systems.  With luck, that second rule will activate in time to give us the chance to at least issue a remote reboot command, sparing us the ordeal of having to drive over to the site and reboot firsthand.

This is another good example of PIKT's flexibility and extensibility.  Note how we are able to tailor the SysDown alarm exactly to our situation and liking.  This flexibility and extensibility is one of things that makes PIKT such a useful tool, IMHO.

For more examples, see Developer's Notes.

 
Home | FAQ | News | Intro | Samples | Tutorial | Reference | Software
Developer's Notes | Licensing | Authors | Pikt-Users | Pikt-Workers | Related Projects | Site Index | Privacy Policy | Contact Us
Page best viewed at 1024x768 or greater.   Page last updated 2019-01-12.   This site is PIKT® powered.
Copyright © 1998-2019 Robert Osterlund. All rights reserved.
Home FAQ News Intro Samples Tutorial Reference Software
PIKT Logo
PIKT Page Title
View sample
service downage
script macro