System Failures
[posted 1999/11/24]
This is a followup to the message sent out last week about our revised SysDownEmergency alarm.
The system that inspired the SysDown revision has crashed, for still undetermined reasons, several times in the last month. In every case, and independent of the earlier SysDown alarm (which informed us of the downage via e-mail because the system was not pingable), we knew about the system failure because this system is our main applications server, and NFS crossmounts were hanging all across our domain.
I would be online, and most everything would lock up. When I tried my usual rlogin (or rsh) to get onto the server, that hung also. Fortunately, on most occasions, I was able to ssh commands to the sick system, e.g., 'sync; reboot'. (sshd runs independently of inetd. Perhaps it was inetd that was hosed.) Unfortunately, there were a couple of times when sshd was hosed, too, and one of us sysadmins had to drive down to the site and reboot the system firsthand.
If we can add a test of RPC services to SysDown, this might give us advance warning of total system failures.
Here is a suitably revised SysDown alarm:
/////////////////////////////////////////////////////////////////////////////// #if piktmaster SysDownEmergency init status active level emergency task "Detect system crashes, or systems going off the network" input file "=hostinfo_obj" dat $host 1 // ignore the rest of the fields in HostInfo.obj keys $host begin set $timeout = "20" // yes, string var here =set_timenow =set_hr =set_dow // bypass weekly reboot period if =reboot_period quit endif rule // exclude systems known to be down if " =downsys " =~ " $host " next endif rule // for mission-critical systems, check the state of their // rpc services (which are often the first sign of system // trouble); report if rpc services are down, also page // but just once per downage incident; if rpc services are // down, we might still be able to ssh to the sick machine, // for example to issue a reboot command (this invariably // fixes rpc problems, and usually other problems as well) if " =misscritsys " =~ " $host " && " seville " !~ " $host " // no rsh services // the $command() below returns the hostname followed // by a linefeed, hence the '!~ $host'; we could // also do: // if $chop($command(...)) ne $host if $command("=maxtime $timeout =rsh $host hostname | =behead") !~ $host set $state = "-" output mail "$host is sick, possibly down, or off the network" # ifdef page if ! #defined(%state) || $state ne %state exec wait "echo '$host is sick/down' | =mailx -s '$host is sick/down' pagemozart\ pagebrahms\ pageliszt\" endif # endifdef next // bypass ping test for this host endif endif rule // report if system goes down; repeat only if system goes up // then back down again; for certain mission-critical systems, // report every time (issue repeated nagmail), also page but // just once per downage incident # if linux | freebsd if $command("=ping -c 1 $host | =tail -2 | =head -1") =~ " 0% packet loss" # elif hpux if $command("=ping $host -n 1 | =tail -2 | =head -1") =~ " 0% packet loss" # elif solaris | sunos if $command("=ping $host $timeout") =~ "is alive" # endif set $state = "+" else set $state = "-" if " =misscritsys " =~ " $host " output mail "$host is down, or off the network" # ifdef page if ! #defined(%state) || $state ne %state exec wait "echo '$host is down' | =mailx -s '$host is down' pagemozart\ pagebrahms\ pageliszt\" endif # endifdef elseif ! #defined(%state) || $state ne %state output mail "$host is down, or off the network" endif endif #endif // piktmaster ///////////////////////////////////////////////////////////////////////////////
The only difference from the SysDown version sent out last week is the addition of the second rule.
For the "mission critical" systems (specified in the systems.cfg #include file <systems/misscritsys_systems.cfg>; see last week's message for more detail), but not for any systems (in this case seville only) that don't support RPC services (because we have commented them out of inetd.conf), issue a 'rsh $host hostname' command to the current host. Because the host's RPC services might be down, and since we don't want this alarm to hang indefinitely, we put a '=maxtime $timeout ...' wrapper around the rsh command. And because =maxtime issues a 'spawn rsh <host> hostname' as its first line of output, we have to =behead that first line.
Interestingly, if a system responds, the $command() returns the hostname with an extra linefeed tacked on the end. We can either $chop() that off and do a direct string compare (using 'ne'), or we can attempt a regexp match on the $command() output (using '!~').
Before we move on, note one more thing: the $command() includes a 'hostname', not a '=hostname', since we want the hostname command on the remote system, not on the invoking piktmaster. By referencing 'hostname' we get the hostname command in the remote system's root PATH. If this is not sufficient, we could reference a database of hostname commands that we set up in our hosts.cfg (or perhaps figure out a way to get the appropriate hostname command from the <macros/unixcmds_*_macros.cfg databases).
Moving on, if we don't get the current host's hostname back from the $command(), we get back for example the empty string, we conclude that the system's RPC services at least are down. We then follow the sequence described in last week's message, but with the following nuances: (1) as before with a failed ping, we get paged only once (saying the system is "sick/down"), but we get e-mailed only once (coding this to send out "nagmail" in the sick system case is left as an exercise for the reader); (2) if the system's RPC services are sick, there is no sense also doing the ping test; (3) if the RPC services are healthy, we go on to do the ping test for that host anyway as a doublecheck (note that the 'next' command to bypass the ping test is inside the "sick system" if ... endif).
A funny thing happened on the way to fine-tuning this alarm. I missed the subtlety about the extra linefeed tacked onto the end of the remote hostname output, whereupon my initial tests were sending off pages right and left (well, seven for each test anyway, for each of our seven mission critical systems).
I quickly vi'ed the EMERGENCY.alt file and employed a little known feature of Pikt scripts. By changing 'status active' to 'status inactive'-- in the .alt file, not in the alarms.cfg!--I was able to turn the SysDown alarm off temporarily. (The rest of the alarms in the EMERGENCY alert package continued to run during the debug.)
Then I cp'ed the EMERGENCY.alt file to Test.alt and proceeded to edit and debug the Test.alt directly--again bypassing the alarms.cfg file. Once I had the Test.alt done just right, I mirrored the modifications in the alarms.cfg, reinstalled EMERGENCY.alt (using 'piktc -iv +A EMERGENCY +H piktmaster'), then rm'ed the Test .alt, .hst & .log files.
We are now well prepared to deal with the possible impending downage of our critical systems. With luck, that second rule will activate in time to give us the chance to at least issue a remote reboot command, sparing us the ordeal of having to drive over to the site and reboot firsthand.
This is another good example of PIKT's flexibility and extensibility. Note how we are able to tailor the SysDown alarm exactly to our situation and liking. This flexibility and extensibility is one of things that makes PIKT such a useful tool, IMHO.
For more examples, see Developer's Notes.