Reporting Multiple Problems
(NOTE: Some of the techniques shown or described on this page--marked in purple--require new features in the latest official PIKT 1.19.0 release (pikt-current.tar.gz) that are unavailable in any previous version.)
From a problem reporting perspective, some system and network failures are much like others. We could write separate monitor scripts to report, for example, system downages and failures of various network services, such as RPC, SSH, SMTP, HTTP, etc. But by clever use of PIKT macros, we can write so-called script macros--generic, customizable, and reusable Pikt scripts that will, in many cases, save us much duplicate effort.
service_downage(S, I, T, M) init status =piktstatus level =piktlevel task "Report (S) service downages on remote systems" input proc "(I)" dat $host 1 keys $host begin // set the $missioncritical list set $missioncritical = $command("=piktc -L +H missioncritical | =oneline") rule // if this script is invoked from within a high-priority // alert, and if the current host is not mission-critical, // set the host's state to its previous state, and proceed // to the next host if $alert() =~~ "red|emergency|server" if " $missioncritical " !~ " $host " set $state = %state next endif endif // for high-priority alerts, only mission-critical systems after // this point // for lower-priority alerts, all systems after this point rule // initially, assume the service is up set $state = "+" rule // test the service, and report if down if (T) set $state = "-" // for all systems, always report new downages if $state ne %state output mail "$host's (S) services are down ((M))" // but for missioncritical systems, report // continuing downages only periodically elsif " $missioncritical " =~ " $host " =periodically(output mail "$host's (S) services are down ((M))", , 240) endif next endif #ifdef verbose rule // for missioncritical systems, if state was "-", // is now "+", then report change if " $missioncritical " =~ " $host " && $state ne %state output mail "$host's (S) services are back up" endif #endifdefIn the init section, rather than hard-code the script status and level as, say, "active" and "emergency" (or "urgent" or ...), we reference special built-in, predefined PIKT macros
status =piktstatus level =piktlevelwhose values are set (for consistency's sake) for the entire group in alerts.cfg, as in the Emergency group's specification
Emergency ... status active level emergency ...In the task statement, we utilize a PIKT macro argument. That is, if we invoke this service_downage() macro (in alarms.cfg) as:
=service_downage(SMTP, ...)when installed, the script task statement would read:
task "Report SMTP service downages on remote systems"So, the "S" in "service_downage(S, I, T, M)" maps to any "(S)" instances in the macro definition. (Enclosing an identifier within parentheses in a macro definition signifies an argument substitution. The parentheses are stripped out at the point of macro argument substitution and when the script is installed.)
Similarly, if the service_downage() macro is invoked as:
=service_downage(..., =piktc -L +H pikt -H nosmtp down sick, ...)the second macro argument would be substituted for the "(I)" in the macro definition, and the actual installed script would have:
input proc "=piktc -L +H pikt -H nosmtp down sick"More about the 'keys $host' statement in a moment...
In the script begin section, we set a variable, $missioncritical, to a piktc $command() output that might look something like:
genoa dublin stockholm rome madrid kievthat is, the list of site mission-critical systems. (The missioncritical host group is defined in systems.cfg.)
In the first script rule, in the outer if-endif we check the alert context. In the inner if-endif, if the $missioncritical list doesn't pattern match the current $host, we set the system state to what it was before (i.e., when the script was last run), and proceed to the next input line, that is, the next host.
"$state" is the current system state. "%state" is an example of a so-called PIKT history variable. Here is where the 'keys $host' statement comes into play. Keying on $host, PIKT looks up in the script history database the previous $state of the current host and assigns that value to %state. PIKT handles this value persistence by way of history variables automatically. Just be sure to provide the appropriate keys statement, make reference to any variable with the "%" prefix, and PIKT takes care of the details behind-the-scenes for you. (Read more about PIKT history variables here.)
In the second script rule, we initialize the current state to "+", that is, initially assume that the service is up.
In the third script rule, we test to see if the service is up for the current host. If, for example, we invoke the service_downage() macro as:
=service_downage(..., =smtpfail($host), ...)this third macro argument would be substituted for the "(T)" in the macro definition, and the actual installed script would have:
if =smtpfail($host)=smtpfail() is defined in macros.cfg as:
smtpfail(H) // test if can't connect to system's port 25, // where (H) is the host $command("=echo '.close' | =telnet -e . -a (H) 25 | =tail -n +3 | =head -n 1") !~~ "connected to (H)"Without explaining exactly how these work, we show here some similar service test macros defined in macros.cfg:
/////////////////////////////////////////////////////////////////////////////// /////////////////////////////////////////////////////////////////////////////// // service test macros /////////////////////////////////////////////////////////////////////////////// pingfail(H, R, T) // test if system doesn't respond to ping, where // (H) is the host, (R) is the retries, (T) is timeout $command("=ping -c (R) -w (T) (H) | =tail -n 2 | =head -n 1") =~ "100% packet loss" /////////////////////////////////////////////////////////////////////////////// rpcfail(H) // test if system doesn't respond to a rpcinfo request, // where (H) is the host $command("=rpcinfo -p (H) 2>/dev/null | =head -n 2 | =tail -n 1") !~~ "100000.+tcp.+111.+portmapper" /////////////////////////////////////////////////////////////////////////////// sshfail(H) // test if can't connect to system's port 22, // where (H) is the host $command("=echo '.close' | =telnet -e . -a (H) 22 | =tail -n +3 | =head -n 1") !~~ "connected to (H)" httpfail(H) // test if can't connect to system's port 80, // where (H) is the host $command("=echo '.close' | =telnet -e . -a (H) 80 | =tail -n +3 | =head -n 1") !~~ "connected to (H)" /////////////////////////////////////////////////////////////////////////////// ///////////////////////////////////////////////////////////////////////////////(There may be better, more reliable, but more complicated availability tests for each of these services. Roll your own as needed.)
If $host fails the service test, we set this service's $state to "-", that is, "down".
If '$state ne %state', that is, the service state has changed since the last time (the script was run), we report the downage.
For mission-critical systems only, if the service state has not changed, that is, the service is still down, we only report the downage periodically using the macro
periodically(A1, A2, M) // (A1) is the reporting action // (A2) is the non-reporting action, if any // (M) is the number of minutes in the period // (30 is a tolerance factor correcting for slight // differences in timing) set #tv = #now() if ! #defined(%tv) || (#tv - %tv >= (M)*60 - =driftfactor) (A1) else (A2) set #tv = %tv fiSo, the statement
=periodically(output mail "$host's (S) services are down ((M))", , 240)says to report the service downage (for mission-critical systems) only once every 240 minutes (four hours).
hourly(A1, A2) set #tv60 = #now() if ! #defined(%tv60) || (#tv60 - %tv60 >= 60*60 - =driftfactor) (A1) else (A2) set #tv60 = %tv60 fi daily(A1, A2) set #tv1440 = #now() if ! #defined(%tv1440) || (#tv1440 - %tv1440 >= 1440*60 - =driftfactor) (A1) else (A2) set #tv1440 = %tv1440 fiIn the script's final rule, we report--for mission-critical systems only (and only if we have installed this script in verbose mode)--when the service comes back up.
/////////////////////////////////////////////////////////////////////////////// // // service_alarms.cfg // /////////////////////////////////////////////////////////////////////////////// #if piktmaster SysDown =service_downage(PING, =piktc -L +H pikt -H down sick, =pingfail($host\, 3\, 5), ping failure) #endif /////////////////////////////////////////////////////////////////////////////// #if piktmaster RpcDown =service_downage(RPC, =piktc -L +H pikt -H down sick, =rpcfail($host), rpcinfo -p failure) #endif /////////////////////////////////////////////////////////////////////////////// #if piktmaster SshDown =service_downage(SSH, =piktc -L +H pikt -H down sick, =sshfail($host), telnet to port 22 failure) #endif /////////////////////////////////////////////////////////////////////////////// #if piktmaster SmtpDown =service_downage(SMTP, =piktc -L +H pikt -H nosmtp down sick, =smtpfail($host), telnet to port 25 failure) #endif /////////////////////////////////////////////////////////////////////////////// #if piktmaster HttpDown =service_downage(HTTP, =piktc -L +H webserver -H down sick, =httpfail($host), telnet to port 80 failure) #endif ///////////////////////////////////////////////////////////////////////////////You might group these service downage scripts in alerts.cfg under Urgent:
Urgent ... alarms ... #if piktmaster SysDown RpcDown SshDown SmtpDown HttpDown #endif ...
Here is a sample problem report for this script:
PIKT ALERT Wed May 9 15:15:25 2007 berlin URGENT: SysDown Report PING service downages on remote systems basel's PING services are down (ping failure) URGENT: RpcDown Report RPC service downages on remote systems basel's RPC services are down (rpcinfo -p failure) URGENT: SshDown Report SSH service downages on remote systems basel's SSH services are down (telnet to port 22 failure) URGENT: SmtpDown Report SMTP service downages on remote systems basel's SMTP services are down (telnet to port 25 failure) belgrade's SMTP services are down (telnet to port 25 failure) moscow's SMTP services are down (telnet to port 25 failure)Generalizing Pikt scripts by way of macros--this is a very powerful technique that eases our PIKT management and shortens our configuration considerably.
|prev page||1st page||next page|