Systems Down Macro
The systems_down_alarms_macros.cfg is a script macro for reporting when systems go down or off the network and, optionally, when they come back up.
/////////////////////////////////////////////////////////////////////////////// // // systems_down_alarms_macros.cfg // /////////////////////////////////////////////////////////////////////////////// systems_down(SYS) init status =piktstatus level =piktlevel task "Report systems down or off the network" input proc "(SYS)" dat $host 1 keys $host begin // initialize the mission-critical systems list =initmisscrit if $alert() =~ "DownSystems|DownServers|DownClients" set #interactive = #true() else set #interactive = #false() fi rule // determine if host is mission-critical =setmisscrit rule =bypass_server_reboots rule // if high-level alert, bypass non mission-critical systems if =highlevelalert if ! #misscrit set $state = %state next fi fi rule // if low-level alert, bypass mission-critical systems if ! =highlevelalert if #misscrit set $state = %state next fi fi #ifdef debug rule output $host #elsedef rule if #interactive output $host output =newline fi #endifdef rule // initialize messages set $dnmsg = "$host is down, or off the network (ping failure)" set $upmsg = "$host is (back) up" rule // initially, assume system is up set $state = "+" rule // ping the host // do the initial poll quickly if =pingfail($host, 1, 1) // if the first ping failed, try again with more // retries and longer timeouts if =pingfail($host, 3, 5) set $state = "-" fi fi rule if $state eq "+" #ifdef verbose // for high-level alerts, report if systems back up if =highlevelalert // if state was "-", is now "+", report change if #defined(%state) && $state ne %state output mail $upmsg fi fi #endifdef next fi // only down hosts after this point rule // report downages for interactive scripts if #interactive output $dnmsg output =newline next fi // non-interactive scripts after this point rule // page, but only periodically, if highest-level alert if =highestlevelalert =hourly(=page($dnmsg, =allpager, =always), ) fi rule // for all systems, always report new downages if $state ne %state output mail $dnmsg next fi rule // for missioncritical systems, report continuing downages, // but only periodically if #misscrit =every_four_hours(output mail $dnmsg, ) if =highestlevelalert =hourly(=output_other_mail(SYSDOWN, 'PIKT SysDown', =sysadmins, $dnmsg), ) fi next fi end quit ///////////////////////////////////////////////////////////////////////////////
You might invoke the =systems_down() macro in your alarms.cfg file thusly:
/////////////////////////////////////////////////////////////////////////////// // // downage_alarms.cfg // /////////////////////////////////////////////////////////////////////////////// #if piktmaster SysDown =systems_down(=piktc -L -H down) #endif /////////////////////////////////////////////////////////////////////////////// #if piktmaster | piktmistress ACDown =systems_down(=piktc -L +H ac) #endif /////////////////////////////////////////////////////////////////////////////// #if piktmaster | piktmistress PowerDown =systems_down(=piktc -L +H power) #endif ///////////////////////////////////////////////////////////////////////////////
where 'down' is a host group of known down systems (specified in down_systems.cfg), 'ac' is a host group of networked air-conditioning systems, and 'power' is a host group of networked power supply systems.
Since monitoring air conditioning and power unit downages is so vital, we run these scripts on both the piktmaster system as well as the so-called 'piktmistress', an alias (specified in systems.cfg for a system that backs up the piktmaster for certain crucial functions.
In the SysDown macro invocation, the macro argument '=piktc -L -H down' is a call to piktc to list all systems except for known down systems. Similarly, the macro arguments in ACDown and PowerDown has piktc list the air conditioning and power systems respectively.
Output from this script might look like, for example:
URGENT: SysDown Report systems down or off the network oslo is down, or off the network (ping failure) manchester is down, or off the network (ping failure) kiev is (back) up
For more examples, see Samples.