Reporting Multiple Problems

(NOTE: Some of the techniques shown or described on this page--marked in purple--require new features in the latest official PIKT 1.19.0 release (pikt-current.tar.gz) that are unavailable in any previous version.)

From a problem reporting perspective, some system and network failures are much like others. We could write separate monitor scripts to report, for example, system downages and failures of various network services, such as RPC, SSH, SMTP, HTTP, etc. But by clever use of PIKT macros, we can write so-called script macros--generic, customizable, and reusable Pikt scripts that will, in many cases, save us much duplicate effort.

For example, here is a Pikt script in the form of a generalized PIKT macro for reporting network service failures:

service_downage(S, I, T, M)

        init
                status =piktstatus
                level =piktlevel
                task "Report (S) service downages on remote systems"
                input proc "(I)"
                dat $host 1
                keys $host

        begin   // set the $missioncritical list
                set $missioncritical = $command("=piktc -L +H missioncritical | =oneline")

        rule    // if this script is invoked from within a high-priority
                // alert, and if the current host is not mission-critical,
                // set the host's state to its previous state, and proceed
                // to the next host
                if $alert() =~~ "red|emergency|server"
                        if " $missioncritical " !~ " $host "
                                set $state = %state
                                next
                        endif
                endif

        // for high-priority alerts, only mission-critical systems after
        // this point
        // for lower-priority alerts, all systems after this point

        rule    // initially, assume the service is up
                set $state = "+"

        rule    // test the service, and report if down
                if (T)
                        set $state = "-"
                        // for all systems, always report new downages
                        if $state ne %state
                                output mail "$host's (S) services are down ((M))"
                        // but for missioncritical systems, report
                        // continuing downages only periodically
                        elsif " $missioncritical " =~ " $host "
                                =periodically(output mail
                                              "$host's (S) services are down ((M))", , 240)
                        endif
                        next
                endif

#ifdef verbose
        rule    // for missioncritical systems, if state was "-",
                // is now "+", then report change
                if    " $missioncritical " =~ " $host "
                   && $state ne %state
                        output mail "$host's (S) services are back up"
                endif
#endifdef

In the init section, rather than hard-code the script status and level as, say, "active" and "emergency" (or "urgent" or ...), we reference special built-in, predefined PIKT macros

                status =piktstatus
                level =piktlevel

whose values are set (for consistency's sake) for the entire group in alerts.cfg, as in the Emergency group's specification

Emergency

        ...

        status          active
        level           emergency

        ...

In the task statement, we utilize a PIKT macro argument. That is, if we invoke this service_downage() macro (in alarms.cfg) as:

        =service_downage(SMTP, ...)

when installed, the script task statement would read:

                task "Report SMTP service downages on remote systems"

So, the "S" in "service_downage(S, I, T, M)" maps to any "(S)" instances in the macro definition. (Enclosing an identifier within parentheses in a macro definition signifies an argument substitution. The parentheses are stripped out at the point of macro argument substitution and when the script is installed.)

Similarly, if the service_downage() macro is invoked as:

        =service_downage(..., =piktc -L +H pikt -H nosmtp down sick, ...)

the second macro argument would be substituted for the "(I)" in the macro definition, and the actual installed script would have:

                input proc "=piktc -L +H pikt -H nosmtp down sick"

More about the 'keys $host' statement in a moment...

In the script begin section, we set a variable, $missioncritical, to a piktc $command() output that might look something like:

        genoa dublin stockholm rome madrid kiev

that is, the list of site mission-critical systems. (The missioncritical host group is defined in systems.cfg.)

In the first script rule, in the outer if-endif we check the alert context. In the inner if-endif, if the $missioncritical list doesn't pattern match the current $host, we set the system state to what it was before (i.e., when the script was last run), and proceed to the next input line, that is, the next host.

"$state" is the current system state. "%state" is an example of a so-called PIKT history variable. Here is where the 'keys $host' statement comes into play. Keying on $host, PIKT looks up in the script history database the previous $state of the current host and assigns that value to %state. PIKT handles this value persistence by way of history variables automatically. Just be sure to provide the appropriate keys statement, make reference to any variable with the "%" prefix, and PIKT takes care of the details behind-the-scenes for you. (Read more about PIKT history variables here.)

In the second script rule, we initialize the current state to "+", that is, initially assume that the service is up.

In the third script rule, we test to see if the service is up for the current host. If, for example, we invoke the service_downage() macro as:

        =service_downage(..., =smtpfail($host), ...)

this third macro argument would be substituted for the "(T)" in the macro definition, and the actual installed script would have:

                if =smtpfail($host)

=smtpfail() is defined in macros.cfg as:

smtpfail(H)     // test if can't connect to system's port 25,
                // where (H) is the host
                $command("=echo '.close' | =telnet -e . -a (H) 25 | =tail -n +3 |
                          =head -n 1") !~~ "connected to (H)"

Without explaining exactly how these work, we show here some similar service test macros defined in macros.cfg:

///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////

// service test macros

///////////////////////////////////////////////////////////////////////////////

pingfail(H, R, T)
                // test if system doesn't respond to ping, where
                // (H) is the host, (R) is the retries, (T) is timeout
                $command("=ping -c (R) -w (T) (H) | =tail -n 2 |
                          =head -n 1") =~ "100% packet loss"

///////////////////////////////////////////////////////////////////////////////

rpcfail(H)      // test if system doesn't respond to a rpcinfo request,
                // where (H) is the host
                $command("=rpcinfo -p (H) 2>/dev/null | =head -n 2 |
                          =tail -n 1") !~~ "100000.+tcp.+111.+portmapper"

///////////////////////////////////////////////////////////////////////////////

sshfail(H)      // test if can't connect to system's port 22,
                // where (H) is the host
                $command("=echo '.close' | =telnet -e . -a (H) 22 | =tail -n +3 |
                          =head -n 1") !~~ "connected to (H)"

httpfail(H)     // test if can't connect to system's port 80,
                // where (H) is the host
                $command("=echo '.close' | =telnet -e . -a (H) 80 | =tail -n +3 |
                          =head -n 1") !~~ "connected to (H)"

///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////

(There may be better, more reliable, but more complicated availability tests for each of these services. Roll your own as needed.)

If $host fails the service test, we set this service's $state to "-", that is, "down".

If '$state ne %state', that is, the service state has changed since the last time (the script was run), we report the downage.

For mission-critical systems only, if the service state has not changed, that is, the service is still down, we only report the downage periodically using the macro

periodically(A1, A2, M) // (A1) is the reporting action
                        // (A2) is the non-reporting action, if any
                        // (M) is the number of minutes in the period
                        // (30 is a tolerance factor correcting for slight
                        // differences in timing)
                        set #tv = #now()
                        if ! #defined(%tv) || (#tv - %tv >= (M)*60 - =driftfactor)
                                (A1)
                        else
                                (A2)
                                set #tv = %tv
                        fi

So, the statement

                                =periodically(output mail
                                              "$host's (S) services are down ((M))", , 240)

says to report the service downage (for mission-critical systems) only once every 240 minutes (four hours).

Here are two other useful, standard report scheduling macros:

hourly(A1, A2)
                        set #tv60 = #now()
                        if ! #defined(%tv60) || (#tv60 - %tv60 >= 60*60 - =driftfactor)
                                (A1)
                        else
                                (A2)
                                set #tv60 = %tv60
                        fi

daily(A1, A2)
                        set #tv1440 = #now()
                        if ! #defined(%tv1440) || (#tv1440 - %tv1440 >= 1440*60 - =driftfactor)
                                (A1)
                        else
                                (A2)
                                set #tv1440 = %tv1440
                        fi

In the script's final rule, we report--for mission-critical systems only (and only if we have installed this script in verbose mode)--when the service comes back up.

Here is how we might invoke this script, in alarms.cfg or, better, in one of its #include files, service_alarms.cfg:

///////////////////////////////////////////////////////////////////////////////
//
// service_alarms.cfg
//
///////////////////////////////////////////////////////////////////////////////

#if piktmaster

SysDown

        =service_downage(PING, =piktc -L +H pikt -H down sick,
                         =pingfail($host\, 3\, 5), ping failure)

#endif

///////////////////////////////////////////////////////////////////////////////

#if piktmaster

RpcDown

        =service_downage(RPC, =piktc -L +H pikt -H down sick,
                         =rpcfail($host), rpcinfo -p failure)

#endif

///////////////////////////////////////////////////////////////////////////////

#if piktmaster

SshDown

        =service_downage(SSH, =piktc -L +H pikt -H down sick,
                         =sshfail($host), telnet to port 22 failure)

#endif

///////////////////////////////////////////////////////////////////////////////

#if piktmaster

SmtpDown

        =service_downage(SMTP, =piktc -L +H pikt -H nosmtp down sick,
                         =smtpfail($host), telnet to port 25 failure)

#endif

///////////////////////////////////////////////////////////////////////////////

#if piktmaster

HttpDown

        =service_downage(HTTP, =piktc -L +H webserver -H down sick,
                         =httpfail($host), telnet to port 80 failure)

#endif

///////////////////////////////////////////////////////////////////////////////

You might group these service downage scripts in alerts.cfg under Urgent:

Urgent

        ...

        alarms

                ...

#if piktmaster
                SysDown
                RpcDown
                SshDown
                SmtpDown
                HttpDown
#endif

                ...

Here is a sample problem report for this script:

                                PIKT ALERT
                         Wed May  9 15:15:25 2007
                                  berlin

URGENT:
    SysDown
        Report PING service downages on remote systems

        basel's PING services are down (ping failure)

URGENT:
    RpcDown
        Report RPC service downages on remote systems

        basel's RPC services are down (rpcinfo -p failure)

URGENT:
    SshDown
        Report SSH service downages on remote systems

        basel's SSH services are down (telnet to port 22 failure)

URGENT:
    SmtpDown
        Report SMTP service downages on remote systems

        basel's SMTP services are down (telnet to port 25 failure)
        belgrade's SMTP services are down (telnet to port 25 failure)
        moscow's SMTP services are down (telnet to port 25 failure)

Generalizing Pikt scripts by way of macros--this is a very powerful technique that eases our PIKT management and shortens our configuration considerably.

1st page