System Downages
[posted 1999/11/17]
One of our mission critical systems crashed last night. The PIKT SysDownEmergency alarm caught this and dutifully sent alert e-mail to the sysadmins, but we (I) didn't read the e-mail until about a half hour later, whereupon there ensued a scramble to call and page persons closer to the scene... It's a long story. Suffice it to say that now is the time to chip away some more at that PIKT ToDo Mountain and implement paging.
The following system downages example will be a bit convoluted. It makes use of several optional PIKT components that once understood make my life (at any rate) easier.
First, I created a new #include file, /pikt/lib/configs/systems/misscritsys_systems.cfg, with the following content:
vienna berlin moscow warsaw munich milan athens2
Then I created a new host group in systems.cfg:
misscritsys members #include <systems/misscritsys_systems.cfg>
Note that #include files in systems.cfg is a new feature supported only in the latest developers release (currently, pikt-991114 aka pikt-dev or pikt-1.8.0pre) available for download at the PIKT Web site. Except for AIX and IRIX, this developers release should work fine on all other supported systems.
(If you are wondering why #ifdef's and #if's are forbidden in systems.cfg, this is why: piktc processes systems.cfg first, defines.cfg second. piktc can't deal with #ifdef's in systems.cfg because it doesn't know about any #define's yet. Also, we can't make use of #if's yet because we are still in the process of defining our systems.)
I also created a new macro in macros.cfg that invokes this very same #include file:
misscritsys # include <systems/misscritsys_systems.cfg>
I will need to specify mission critical systems in both #if preprocessor statements and =misscritsys macro references. With the single #include file, I have to specify these systems in just one place.
In alerts.cfg, I run the more important alerts more frequently for the mission critical systems. So, for the EMERGENCY, Urgent, and Critical alerts, I made use of the misscritsys host group as follows, for example:
EMERGENCY // things that require immediate attention #if misscritsys timing 10,25,40,55 * * * * #else timing 10,40 * * * * #endif
I also added another #define in defines.cfg giving me the ability to turn paging on and off globally at will:
page TRUE // if TRUE, then issue pages, else keep silent
Now, on to the heart of the matter. Before, I had a single SysDownEmergency alarm to alert us about system downages. For various reasons, I decided it was better to break this into two separate alarms, a revised SysDownEmergency and a new SysDownWarning:
/////////////////////////////////////////////////////////////////////////////// #if piktmaster SysDownEmergency init status active level emergency task "Detect system crashes, or systems going off the network" input file "=hostinfo_obj" dat $host 1 // ignore the rest of the fields in HostInfo.obj keys $host begin set $timeout = "20" // yes, string var here =set_timenow =set_hr =set_dow // bypass weekly reboot period if =reboot_period quit endif rule // exclude systems known to be down if " =downsys " =~ " $host " next endif rule // report if system goes down; repeat only if system goes up // then back down again; for certain mission-critical systems, // report every time (issue repeated nagmail), also page but // just once per downage incident # if linux | freebsd if $command("=ping -c 1 $host | =tail -2 | =head -1") =~ " 0% packet loss" # elif hpux if $command("=ping $host -n 1 | =tail -2 | =head -1") =~ " 0% packet loss" # elif solaris | sunos if $command("=ping $host $timeout") =~ "is alive" # endif set $state = "+" else set $state = "-" if " =misscritsys " =~ " $host " output mail "$host is down, or off the network" # ifdef page if ! #defined(%state) || $state ne %state exec wait "echo '$host is down' | =mailx -s '$host is down' pagemozart\ pagebrahms\ pageliszt\" endif # endifdef elseif ! #defined(%state) || $state ne %state output mail "$host is down, or off the network" endif endif #endif // piktmaster /////////////////////////////////////////////////////////////////////////////// #if piktmaster SysDownWarning init status active level warning task "Detect systems down or off the network" input file "=hostinfo_obj" dat $host 1 // ignore the rest of the fields in HostInfo.obj begin set $timeout = "20" // yes, string var here rule // report if system doesn't respond to ping # if linux | freebsd if $command("=ping -c 1 $host | =tail -2 | =head -1") =~ " 0% packet loss" # elif hpux if $command("=ping $host -n 1 | =tail -2 | =head -1") =~ " 0% packet loss" # elif solaris | sunos if $command("=ping $host $timeout") =~ "is alive" # endif // do nothing else output mail "$host is down, or off the network" endif #endif // piktmaster ///////////////////////////////////////////////////////////////////////////////
Some commentary:
In SysDownEmergency, the first rule aborts the alarm during our weekly reboot period. (For maintenance and general system checks, we reboot many of our systems once weekly. Other sites like to keep their systems up as long as possible, then brag about their uptimes. Your mileage may vary.)
The second rule has us bypass systems known to be down for an extended period.
In the third rule, we determine whether systems are up by means of the ping command. Now, a system might be pingable but still hosed. Eventually, I will create other auxiliary alarms to give us advance warning when systems are sick but still short of totally dead.
If a system is up (pingable), we set the $state to "+" and eventually store that in the history database. If a system is down, we set $state to "-" and store that for later recall.
I strongly suggest that whenever you reference a history (%) value, you check whether it is #define()'ed. If you don't, the history mechanism might not work like you expect.
For down systems, if they are members of the misscritsys (mission critical systems) set, we always emit nagmail, every time this alarm runs. If we have #define'd page (and we have, in macros.cfg), and if the system's $state has changed from "+" (last time) to "-" (this time), we issue a page just this once to the sysadmins. (It only now occurs to me that I could shorten the addressee to "pagesysadmins", where "pagesysadmins" is a mail alias resolving to the invidual personal page aliases.) We implement paging via mail aliases (which resolve to a special executable set up for this purpose. (Again, your mileage may vary.)
For non mission critical systems, we don't page. Instead, we send out e-mail if a system is newly down. Contrast this to the nagmail (sent each and every time) for the mission critical systems.
Note the following differences in the much simpler SysDownWarning alarm: There is no reboot_period bypass, because we run this alarm just once at a different time of day. There is no downsys bypass, so that we can get a daily reminder of all system downages. Finally, in all cases, mission critical and not so, a warning message is sent, but just once a day, because this alarm is scheduled to run just once a day.
This is a suggested example of how you might handle the SysDown problem. Modify to suit your needs. (Your mileage *will* vary.)
For more examples, see Developer's Notes.