Reporting a Problem
In a typical configuration, PIKT is primarily used to report problems. For any given problem, the normal procedure is:
- On the central control system (the so-called "piktmaster"), you write a Pikt script to identify and report the problem.
- You also need to specify the problem's priority level, how often to run the monitor script, who are to receive the problem reports, etc.
- You run a command to install and schedule the script on one or more client systems (the so-called PIKT "slaves").
- If the problem happens, it is reported, usually via e-mail, sent to you, the PIKT administrator (the so-called "piktadmin"), and possibly other interested persons.
- You read the problem report, and take action, or not, depending on the problem's severity and urgency.
For example, one common problem is excessively high system load averages. If the load average is unusually high, the LoadAverage script following will help you to know that. LoadAverage is shown here, more or less as it would appear on the piktmaster system (in the central configuration alarms.cfg file) and on the slave system(s) (in their Critical.alt files):
LoadAverage init status active level critical task "Report perilously high system load averages" input proc "/usr/bin/uptime" dat $a1 $-2 dat $a5 $-1 // unused dat $a15 $ // unused rule // dispose of trailing comma, and set value set #la1 = #value($chop($a1,1)) rule // report if exceeds threshold if #la1 >= 2.0 output mail "uptime - $trim($inlin)" endif
Pikt scripts, much like Awk scripts, come in sections: init, begin, rule, and/or end. Syntax combines elements of Awk, Perl, and Bourne shell (Bash). Comments are denoted by '//' and '/* */'. When done right, Pikt scripts should be clear, straightforward, and easy to read.
In this script's init section, we indicate the script's status (active, as opposed to inactive, or some other possible status setting), its priority level (in this case, it's of critical importance), its task description, and its input. Normally, input is a file or the output of some process. For LoadAverage, the script input is the output of the process "/usr/bin/uptime," for example:
10:15:03 up 35 days, 20:27, 6 users, load average: 5.10, 3.71, 1.79We then assign some values: the current load average to the second-to-last output field (the "5.10,"), and the five-minute and fifteen-minute load averages to the second-to-last and last output fields, respectively. (In the LoadAverage script, the $a5 and $a15 are, in fact, not used.)
After the init section (and an optional begin section), we enter an implicit input loop (again, much like Awk). For each line of input, we consider a series of rules--things to check, actions to take, etc. Besides the $a1 (and $a5 and $a15) assignment(s) mentioned previously, the current input line is also assigned to the built-in variable $inlin (aka $inline, $inputline) by default.
In the first rule, we need to chop the trailing comma from the $a1 and convert it to a numerical value. For these purposes, we use the standard Pikt functions $chop() and #value(). We assign this value to a new variable, #la1.
In the next rule, we test whether the #la1 exceeds some threshold, say 2.0. If so, we send e-mail to that effect. (We use the standard $trim() function to trim any leading or trailing spaces from the uptime output.)
The problem alert e-mail might look like the following:
PIKT ALERT Mon Mar 5 10:15:04 2007 calgary CRITICAL: LoadAverage Report perilously high system load averages uptime - 10:15:03 up 35 days, 20:27, 6 users, load average: 5.10, 3.71, 1.79
We might embellish this script in various ways, adding more rules to do things such as:
- setting the reporting threshold differently depending on the system, the day of the week, or the time of day
- specifying more than one threshold, considering also the $a5 and $a15
- reporting lower load averages to just the piktadmin, and higher load averages to other sysadmins besides
- after the initial report, reporting anew only if the load average is increasing
- logging the load average(s), to syslog or some script-specific log file
- showing 'top' or 'ps' output in the same problem report e-mail in order to provide some diagnostic context for the high system load
Suppose we want to check the system load average frequently, every 15 minutes say. We could add the LoadAverage script to the Critical alerts group, defined (in the piktmaster's alerts.cfg file) as:
Critical timing 0,15,30,45 * * * * mailcmd "mailx -s 'PIKT Alert on =pikthostname: Critical' =sysadmins" alarms LoadAverage ...
Timings are similar to cron's. Here we have scheduled the "Critical" scripts, including LoadAverage (and possibly others), to run every 15 minutes throughout the day and night. In the mailcmd, we send e-mail for the =pikthostname system to the =sysadmins. =pikthostname and =sysadmins are examples of PIKT macros. We'll have more to say about PIKT macros later.
On the central piktmaster system, we would install and schedule the Critical scripts, including the LoadAverage script, on all host systems using the command:
# piktc -ierv +A Critical +H all ... processing calgary... installing file(s)... Critical.alt installed Critical enabled (re)starting daemon (piktd)... daemon (re)started ...where the '-i' stands for "install", '-e' enables (schedules), '-r' restarts the PIKT slave's piktd daemon, and '-v' says to be verbose when doing all of this. The '+A' stands for "alerts" and the '+H' signifies "host" systems.
piktc is the central command-and-control and preprocessor program, used to install and schedule Pikt scripts, to stop and (re)start PIKT daemons, and do many, many other things. Other PIKT programs include: pikt, the Pikt script interpreter; piktd, the daemon that runs the alert scripts periodically; piktc_svc, the piktc service daemon; and still others.
Reporting the system load average is just one among many different problems you might have PIKT monitoring and reporting. For other examples, continue reading the Introduction and/or visit the Samples pages.
|prev page||1st page||next page|