Systems Down Macro
The systems_down_alarms_macros.cfg is a script macro for reporting when systems go down or off the network and, optionally, when they come back up.
///////////////////////////////////////////////////////////////////////////////
//
// systems_down_alarms_macros.cfg
//
///////////////////////////////////////////////////////////////////////////////
systems_down(SYS)
init
status =piktstatus
level =piktlevel
task "Report systems down or off the network"
input proc "(SYS)"
dat $host 1
keys $host
begin
// initialize the mission-critical systems list
=initmisscrit
if $alert() =~ "DownSystems|DownServers|DownClients"
set #interactive = #true()
else
set #interactive = #false()
fi
rule // determine if host is mission-critical
=setmisscrit
rule
=bypass_server_reboots
rule // if high-level alert, bypass non mission-critical systems
if =highlevelalert
if ! #misscrit
set $state = %state
next
fi
fi
rule // if low-level alert, bypass mission-critical systems
if ! =highlevelalert
if #misscrit
set $state = %state
next
fi
fi
#ifdef debug
rule
output $host
#elsedef
rule
if #interactive
output $host
output =newline
fi
#endifdef
rule // initialize messages
set $dnmsg = "$host is down, or off the network (ping failure)"
set $upmsg = "$host is (back) up"
rule // initially, assume system is up
set $state = "+"
rule // ping the host
// do the initial poll quickly
if =pingfail($host, 1, 1)
// if the first ping failed, try again with more
// retries and longer timeouts
if =pingfail($host, 3, 5)
set $state = "-"
fi
fi
rule
if $state eq "+"
#ifdef verbose
// for high-level alerts, report if systems back up
if =highlevelalert
// if state was "-", is now "+", report change
if #defined(%state) && $state ne %state
output mail $upmsg
fi
fi
#endifdef
next
fi
// only down hosts after this point
rule // report downages for interactive scripts
if #interactive
output $dnmsg
output =newline
next
fi
// non-interactive scripts after this point
rule // page, but only periodically, if highest-level alert
if =highestlevelalert
=hourly(=page($dnmsg, =allpager, =always), )
fi
rule // for all systems, always report new downages
if $state ne %state
output mail $dnmsg
next
fi
rule // for missioncritical systems, report continuing downages,
// but only periodically
if #misscrit
=every_four_hours(output mail $dnmsg, )
if =highestlevelalert
=hourly(=output_other_mail(SYSDOWN, 'PIKT SysDown', =sysadmins, $dnmsg), )
fi
next
fi
end
quit
///////////////////////////////////////////////////////////////////////////////
You might invoke the =systems_down() macro in your alarms.cfg file thusly:
///////////////////////////////////////////////////////////////////////////////
//
// downage_alarms.cfg
//
///////////////////////////////////////////////////////////////////////////////
#if piktmaster
SysDown
=systems_down(=piktc -L -H down)
#endif
///////////////////////////////////////////////////////////////////////////////
#if piktmaster | piktmistress
ACDown
=systems_down(=piktc -L +H ac)
#endif
///////////////////////////////////////////////////////////////////////////////
#if piktmaster | piktmistress
PowerDown
=systems_down(=piktc -L +H power)
#endif
///////////////////////////////////////////////////////////////////////////////
where 'down' is a host group of known down systems (specified in down_systems.cfg), 'ac' is a host group of networked air-conditioning systems, and 'power' is a host group of networked power supply systems.
Since monitoring air conditioning and power unit downages is so vital, we run these scripts on both the piktmaster system as well as the so-called 'piktmistress', an alias (specified in systems.cfg for a system that backs up the piktmaster for certain crucial functions.
In the SysDown macro invocation, the macro argument '=piktc -L -H down' is a call to piktc to list all systems except for known down systems. Similarly, the macro arguments in ACDown and PowerDown has piktc list the air conditioning and power systems respectively.
Output from this script might look like, for example:
URGENT:
SysDown
Report systems down or off the network
oslo is down, or off the network (ping failure)
manchester is down, or off the network (ping failure)
kiev is (back) up
For more examples, see Samples.