Service Downage Macro
The service_downage_alarms_macros.cfg is a script macro for reporting when services go down (or, optionally, come back up) on remote systems.
service_downage(L, S, I, T, A, M) init status =piktstatus level (L) task "Report (S) service downages on remote systems" input proc "(I)" dat $host 1 keys $host begin set $missioncritical = $command("=piktc -L +H missioncritical | =oneline") #ifdef debug output $missioncritical #endifdef rule set $dnmsg = "$host's (S) services are down ((M))" set $upmsg = "$host's (S) services are back up" rule if $alert() =~~ "red|emergency|server" if " $missioncritical " !~ " $host " set $state = %state next fi fi rule if $alert() =~~ "client" if " $missioncritical " =~ " $host " set $state = %state next fi fi #ifdef debug rule output $host #elsedef rule if $alert() =~ "(A)" output $host output $newline() fi #endifdef rule // initially, assume the service is up set $state = "+" rule =bypass_server_reboots rule if (T) if $alert() =~ "(A)" output $dnmsg output =newline else set $state = "-" // for all systems, always report new downages if $state ne %state output mail $dnmsg // but for missioncritical systems, report // continuing downages only periodically elsif " $missioncritical " =~ " $host " =every_four_hours(output mail $dnmsg, ) fi fi next fi rule // for missioncritical systems, if state was "-", // is now "+", then report change if " $missioncritical " =~ " $host " && $state ne %state output mail $upmsg fi end quit
You might invoke the =service_downage() macro in your alarms.cfg file thusly:
/////////////////////////////////////////////////////////////////////////////// // // network_alarms.cfg // /////////////////////////////////////////////////////////////////////////////// #if piktmaster RpcDown =service_downage(warning, RPC, =piktc -L +H pikt -H down sick, =rpcfail($host), DownRpc|DownRpcServers|DownRpcClients, rpcinfo -p failure) #endif /////////////////////////////////////////////////////////////////////////////// #if piktmaster SshDown =service_downage(warning, SSH, =piktc -L +H pikt -H down sick, =sshfail($host), DownSsh, telnet to port 22 failure) #endif /////////////////////////////////////////////////////////////////////////////// #if piktmaster SmtpDown =service_downage(warning, SMTP, =piktc -L +H pikt -H nosmtp down sick, =smtpfail($host), DownSmtp, telnet to port 25 failure) #endif /////////////////////////////////////////////////////////////////////////////// #if piktmaster HttpDown =service_downage(info, HTTP, =piktc -L +H webserver -H down sick, =httpfail($host), DownHttp, telnet to port 80 failure) #endif ///////////////////////////////////////////////////////////////////////////////
where 'down' is a host group of known down systems (specified in down_systems.cfg) and 'sick' is a host group of known "sick" systems (systems up and running but somehow impaired).
Output from this script might look like, for example:
URGENT: RpcDownUrgent Report RPC downages on remote systems helsinki's RPC services are down (rpcinfo -p failure) URGENT: SshDownUrgent Report ssh service downages on remote systems helsinki's SSH services are down (telnet to port 22 failure) rouen's SSH services are back up
For more examples, see Samples.