Service Downage Macro
The service_downage_alarms_macros.cfg is a script macro for reporting when services go down (or, optionally, come back up) on remote systems.
service_downage(L, S, I, T, A, M)
init
status =piktstatus
level (L)
task "Report (S) service downages on remote systems"
input proc "(I)"
dat $host 1
keys $host
begin
set $missioncritical = $command("=piktc -L +H missioncritical | =oneline")
#ifdef debug
output $missioncritical
#endifdef
rule
set $dnmsg = "$host's (S) services are down ((M))"
set $upmsg = "$host's (S) services are back up"
rule
if $alert() =~~ "red|emergency|server"
if " $missioncritical " !~ " $host "
set $state = %state
next
fi
fi
rule
if $alert() =~~ "client"
if " $missioncritical " =~ " $host "
set $state = %state
next
fi
fi
#ifdef debug
rule
output $host
#elsedef
rule
if $alert() =~ "(A)"
output $host
output $newline()
fi
#endifdef
rule // initially, assume the service is up
set $state = "+"
rule
=bypass_server_reboots
rule
if (T)
if $alert() =~ "(A)"
output $dnmsg
output =newline
else
set $state = "-"
// for all systems, always report new downages
if $state ne %state
output mail $dnmsg
// but for missioncritical systems, report
// continuing downages only periodically
elsif " $missioncritical " =~ " $host "
=every_four_hours(output mail $dnmsg, )
fi
fi
next
fi
rule // for missioncritical systems, if state was "-",
// is now "+", then report change
if " $missioncritical " =~ " $host "
&& $state ne %state
output mail $upmsg
fi
end
quit
You might invoke the =service_downage() macro in your alarms.cfg file thusly:
///////////////////////////////////////////////////////////////////////////////
//
// network_alarms.cfg
//
///////////////////////////////////////////////////////////////////////////////
#if piktmaster
RpcDown
=service_downage(warning, RPC, =piktc -L +H pikt -H down sick, =rpcfail($host),
DownRpc|DownRpcServers|DownRpcClients, rpcinfo -p failure)
#endif
///////////////////////////////////////////////////////////////////////////////
#if piktmaster
SshDown
=service_downage(warning, SSH, =piktc -L +H pikt -H down sick, =sshfail($host),
DownSsh, telnet to port 22 failure)
#endif
///////////////////////////////////////////////////////////////////////////////
#if piktmaster
SmtpDown
=service_downage(warning, SMTP, =piktc -L +H pikt -H nosmtp down sick, =smtpfail($host),
DownSmtp, telnet to port 25 failure)
#endif
///////////////////////////////////////////////////////////////////////////////
#if piktmaster
HttpDown
=service_downage(info, HTTP, =piktc -L +H webserver -H down sick, =httpfail($host),
DownHttp, telnet to port 80 failure)
#endif
///////////////////////////////////////////////////////////////////////////////
where 'down' is a host group of known down systems (specified in down_systems.cfg) and 'sick' is a host group of known "sick" systems (systems up and running but somehow impaired).
Output from this script might look like, for example:
URGENT:
RpcDownUrgent
Report RPC downages on remote systems
helsinki's RPC services are down (rpcinfo -p failure)
URGENT:
SshDownUrgent
Report ssh service downages on remote systems
helsinki's SSH services are down (telnet to port 22 failure)
rouen's SSH services are back up
For more examples, see Samples.