Reporting Policies
PIKT's primary mission is to report problems. To be effective, PIKT problem reports should be timely and get to the right people at a pace they can deal with. When configured well, PIKT tells you what you need to know when you need to know it, no more, no less. When configured badly, PIKT doesn't tell you enough, or worse, it tells you way, way too much. So rather than people appreciating PIKT, they come to resent it. At best, PIKT can be a useful and even essential part of your enterprise computing infrastructure. At worst, PIKT can be an annoying nuisance.
Setting the right reporting policies is crucial to PIKT's success and acceptance throughout the enterprise. Following is an extended discussion of ways you might manage the information flow and configure PIKT report messaging "well".
There are many different ways you could organize your PIKT e-mail macros. (Indeed, one might even disregard macros entirely and instead use hard-coded e-mail addresses within alerts.cfg and elsewhere.) If you opt for the e-mail macro approach, this example pikt_mail_macros.cfg file is one way how you might arrange it:
/////////////////////////////////////////////////////////////////////////////// // // pikt mail macros - pikt mail routing // /////////////////////////////////////////////////////////////////////////////// // piktadmin byrd byrd\ piktadmin =byrd // pikt head honcho /////////////////////////////////////////////////////////////////////////////// // everyone else dowland dowland\ telemann telemann\ tartini tartini\ josquin desprez.gmail\ boyce #if missioncritical boyce\ #else =piktnullchar #endif /////////////////////////////////////////////////////////////////////////////// // mail groups sysadmins =dowland =boyce =piktadmin coders =telemann =tartini =josquin /////////////////////////////////////////////////////////////////////////////// // the various pikt- macros, the addresses used in the =piktlnk(alerts.cfg, ref/ref.2.alerts.cfg.html) mailcmd pikt-emergency =sysadmins =coders sysadmins-urgent =sysadmins coders-urgent =coders =piktadmin sysadmins-critical =sysadmins coders-critical =coders =piktadmin sysadmins-warning =sysadmins coders-warning =coders =piktadmin pikt-notice =piktadmin pikt-info =piktadmin pikt-admin =piktadmin pikt-debug =piktadmin pikt-test =piktadmin ///////////////////////////////////////////////////////////////////////////////In this #include file's first section, we set some macros for the piktadmin. 'byrd' is a macro defined as the e-mail address 'byrd\'. In the second macro definition, =byrd is the designated piktadmin.
In the second section, we define the e-mail macros for everyone else. "Everyone else" means everybody else on the staff that cares to, or needs to, receive PIKT report e-mail. There may be many other staff members in the organization who have no interest in receiving any PIKT e-mails.
boyce is a special case. Staff member boyce only cares to receive report e-mail for the mission-critical systems. For all other systems, we set his e-mail macro to the =piktnullchar, effectively a blank.
In the #include file's third section, we aggregate individual e-mail macros into two main groupings, the sysadmins and the coders. A bit later, you will see how, using these macros, we send coding-related alert e-mails just to the coders, and alert e-mails related to systems administration just to the sysadmins.
In the fourth section, we set the e-mail macros referenced in the alerts.cfg mailcmd settings. Here is a sample reference from alerts.cfg:
SysAdminsUrgent ... mailcmd "=mailx -s 'PIKT Alert on =pikthostname: Urgent' =sysadmins-urgent"Returning to the pikt_mail_macros.cfg #include file, for the highest-priority alert--Emergency--we specify that everyone should receive e-mails.
=sysadmins should receive the SysAdminsUrgent, SysAdminsCritical, and SysAdminsWarning e-mails, and =coders should receive the CodersUrgent, CodersCritical, and CodersWarning e-mails.
The =piktadmin alone should receive all other alert e-mails.
We want the piktadmin to receive all PIKT alert e-mails. Where =coders is specified, we also tack on =piktadmin. We don't also tack on =piktadmin to =sysadmins, because the latter macro includes the piktadmin in its macro definition.
Moving on to alerts.cfg, here are two alert groups, one for high-priority alerts of interest to the sysadmins, and the other for high-priority alerts intended for the coders:
/////////////////////////////////////////////////////////////////////////////// SysAdminsUrgent // stuff deserving nearly immediate attention #if missioncritical timing 15 * * * 1-5 // mon-fri 15 */2 * * 0,6 // sat-sun #else timing 15 6-18 * * 1-5 // mon-fri #endif mailcmd "=mailx -s 'PIKT Alert on =pikthostname: Urgent' =sysadmins-urgent" alarms // stuff of interest to the sysadmins ... DmesgScan // reporting redflag items only ... /////////////////////////////////////////////////////////////////////////////// CodersUrgent // stuff deserving nearly immediate attention #if missioncritical timing 45 * * * 1-5 // mon-fri #else timing 45 6-18 * * 1-5 // mon-fri #endif mailcmd "=mailx -s 'PIKT Alert on =pikthostname: Urgent' =coders-urgent" alarms // stuff of interest to the coders ... DmesgScan // reporting redflag items only ... ///////////////////////////////////////////////////////////////////////////////In words:
On mission-critical systems, Monday through Friday, sysadmins receive SysAdminsUrgent alerts hourly, at 15 minutes past the hour, and coders receive CodersUrgent alerts hourly, at 45 minutes after the hour.
For non mission-critical systems, Monday through Friday, sysadmins receive SysAdminsUrgent alerts hourly, at 15 minutes past the hour, but from 6 AM until 6 PM only. For coders, it is similar, except they again receive their alerts at 45 minutes after the hour.
On Saturdays and Sundays, for mission-critical systems only, sysadmins receive alerts every two hours. On these days, no alerts are sent for non mission-critical systems. Coders receive no weekend alerts.
Both sysadmins and coders receive DmesgScan alerts, adapting at run-time to report red-flagged items only. By reporting much the same things--the latest troublesome dmesg entries, to sysadmins on the one hand and coders on the other--a half hour apart, we achieve better dispersed, more nearly round-the-clock coverage without unduly bothering either the sysadmins or the coders. Contrast this setup with one where sysadmins (and coders) receive DmesgScan redflag alerts at both 15 and 45 minutes after the hour.
Here is the alerts.cfg stanza for the Critical alerts, which are at a priority level just below the Urgent alerts:
/////////////////////////////////////////////////////////////////////////////// SysAdminsCritical // important stuff, but not highest priority #if missioncritical timing 30 6-18 * * 1-5 // mon-fri 30 6,12,18 * * 0,6 // sat-sun #else # ifndef holiday timing 30 6-18/2 * * 1-5 // mon-fri # elsedef timing =piktnever # endifdef #endif mailcmd "=mailx -s 'PIKT Alert on =pikthostname: Critical' =sysadmins-critical" alarms // stuff of interest to the sysadmins ... DmesgScan // reporting yellowflag items only ... ///////////////////////////////////////////////////////////////////////////////In words:
On mission-critical systems, Monday through Friday, sysadmins receive SysAdminsCritical alerts hourly at 30 minutes past the hour, but from 6 AM until 6 PM only.
On non mission-critical systems, Monday through Friday, syadmins receive SysAdminsCritical alerts every other hour, from 6 AM to 6 PM only.
On Saturdays and Sundays, for mission-critical systems only, sysadmins receive alerts every six hours.
Coders receive no Critical alerts at all on the weekend.
In defines.cfg, we have set a "holiday" define in this way:
holiday FALSE // are we in a holiday period (e.g., xmas or easter)? // set this to TRUE when entering a holiday period, // then re-enable all alerts to set up special // restricted holiday schedule; after the holiday, // set back to FALSE, then re-enable (hence reschedule) // all alerts
Moreover, if we have taken care to reconfigure out PIKT setup before a holiday by means of
# piktc -evr +D holiday +A all -H down sickbecause we have set holiday to TRUE by means of the '+D holiday' at the command line, SysAdminsCritical alerts are effectively timed to be sent "never" for the non mission-critical systems. (Note that the '-r' restart option is needed to force piktd to restart and reread its configuration.)
So Critical alerts are sent on holidays for mission-critical systems only. (Whether or not the sysadmins are paying attention at all on holidays--say by checking their work e-mail from home--whether or not they are doing this is another question altogether. But at least they are not being pestered by non mission-critical alerts on holidays.)
After the holiday, we should remember to re-enable (and thereby reschedule) all alerts by issuing the command
# piktc -evr -D holiday +A all -H down sick(The '-D holiday' is not really necessary--we could leave it out--because holiday defaults to FALSE in defines.cfg.)
Okay, so now we have a setup that routes the appropriate alert e-mails to the appropriate staff members at the appropriate times. What else can we do to fine tune our reporting policies? Answer: Plenty. Following are just a few among many techniques we might use to pinpoint our PIKT alert messaging.
In alerts.cfg, PIKT is set to run alarm scripts, and send alert e-mails to groups of e-mail recipients, at varying times of the day and days of the week. We can also, at run time, and depending on the timing, have Pikt scripts do things like quit, 'output log' instead of 'output mail', and other things to avoid sending alert messages at the wrong times.
Here are some scheduling macros to do just that:
/////////////////////////////////////////////////////////////////////////////// // scheduling macros /////////////////////////////////////////////////////////////////////////////// night ( #hour() < 6 ) morning ( #hour() >= 6 && #hour() < 12 ) afternoon ( #hour() >= 12 && #hour() < 18 ) evening ( #hour() >= 18 ) /////////////////////////////////////////////////////////////////////////////// offhours(H) // between 10 PM and 6 AM ((H) >= 22 || (H) < 6) allhours(H) // any time of the day or night #true() /////////////////////////////////////////////////////////////////////////////// sunday ( #weekday() == 1 ) monday ( #weekday() == 2 ) tuesday ( #weekday() == 3 ) wednesday ( #weekday() == 4 ) thursday ( #weekday() == 5 ) friday ( #weekday() == 6 ) saturday ( #weekday() == 7 ) weekend ((=friday && =evening) | =saturday || =sunday) /////////////////////////////////////////////////////////////////////////////// bypass_evening if =evening quit fi /////////////////////////////////////////////////////////////////////////////// bypass_weekend if =weekend quit fi /////////////////////////////////////////////////////////////////////////////// output_by_time(L) if ! =weekend output mail "(L)" else output log "(L)" endif /////////////////////////////////////////////////////////////////////////////// reboot_period(D, H) // (D) is currently unused #if tabletserver ( (H) >= 1 // between 1 AM && (H) < 2 // and 2 AM ) #else ( (H) >= 25 ) // i.e., never #endif /////////////////////////////////////////////////////////////////////////////// bypass_reboots if =reboot_period(#weekday(), #hour()) next fi /////////////////////////////////////////////////////////////////////////////// bypass_day_rollover if ( #hour() == 23 && #minute() >= 30 ) || ( #hour() == 0 && #minute() < 30 ) quit fi ///////////////////////////////////////////////////////////////////////////////There are many other such macros you could devise. These just give you a taste for what is possible.
Here is a useful and standard PIKT define, verbose:
verbose #if new TRUE // if TRUE, output mail about routine execs, such as // "deleting <this>" or "truncating <that>"; usually // set this to FALSE; but occasionally set this to // TRUE to get a fuller report of all that PIKT is // doing silently, behind-the-scenes #else FALSE #endifAnd here is one way you might use it, in macros.cfg:
// the verbose define controls whether certain routine messages get emailed // or thrown away; in earlier versions of PIKT, this conditionality was // handled in this way in alarms.cfg: // // #ifdef verbose // output mail "truncated $inlin" // #endifdef // // with the macros below, we can now achieve the same effect by replacing // the above three lines with just this one line: // // =outputmail "truncated $inlin" #ifdef verbose outputmail output mail #elsedef outputmail output log "/dev/null" #endifdef // if verbose is not defined (is set to FALSE), the message is logged to // /dev/null, that is, thrown awayAnd here is more straightforward example use of verbose:
#ifdef verbose rule // for missioncritical systems, if state was "-", // is now "+", then report change if " $missioncritical " =~ " $host " && $state ne %state output mail "$host is back up" endif #endifdefSo, if you have too much PIKT messaging, consider putting '#ifdef verbose ... #endifdef' wrappers around some of your 'output mail' statements. Note that, in defines.cfg, you can set the verbose define on a per-system basis. So, in the example above, we automatically set verbose to TRUE on all new systems (where 'new' is defined in systems.cfg). We might also set verbose to TRUE on mission-critical systems but leave it set to FALSE everywhere else.
A major potential problem is endlessly repeating alert e-mails. Do you really need to be reminded hour after hour that system munich is down? Or day after day this or that file is old or out-of-date? (Maybe you need to know, or be reminded of these things, just not quite so often or repeatedly.
To report something just once daily, you might do something like this:
once_daily(A) // (A) is some action, which could be a single Pikt // statement, or many (including complex control structures) // this macro assumes an earlier set #hr = #hour() statement if #hr < %hr (A) fiYou might use the =once_daily() macro in a Pikt script this way:
=once_daily(output mail "$sys is down")Here is a define, akin to verbose, set in defines.cfg:
stifle #if missioncritical | new FALSE // by default, limit how often certain relatively // unimportant warnings get sent; from time to time, // undefine stifle so that we may get a complete set // of warnings #else TRUE #endifAnd following is how we might make use of the stifle define in macros.cfg. (Note that in this example, #fa refers to a "file age".)
/////////////////////////////////////////////////////////////////////////////// // the stifle define controls how often certain routine messages ("nagmail") // are sent; in earlier versions of PIKT, this conditionality was handled // in this way in alarms.cfg: // // #ifdef stifle // if #fa % 7 == 0 // report only every 7 days // output mail "orphaned?: $inline" // endif // #elsedef // output mail "orphaned?: $inline" // #endifdef // // with the macros below, also with the stifle define, we can now achieve // the same effect by replacing the above seven lines with just these // three lines: // // if #fa % =stifle(7) == 0 // report only every 7 days // output mail "orphaned?: $inline" // endif #ifdef stifle stifle(N) (N) #elsedef stifle(N) 1 #endifdef ///////////////////////////////////////////////////////////////////////////////The =once_daily() and =stifle() techniques are rather crude. Here are some more sophisticated macro techniques:
driftfactor 300 // fudge factor to allow for alert timing drift // in all cases, (A1) refers to the action to be taken each period, // and (A2) refers to some default action taken at all other times // ((A2) is usually blank) periodically(A1, A2, M) set #tv = #now() if ! #defined(%tv) || (#tv - %tv >= (M)*60 - \=driftfactor) (A1) else (A2) set #tv = %tv fi hourly(A1, A2) set #tv60 = #now() if ! #defined(%tv60) || (#tv60 - %tv60 >= 60*60 - \=driftfactor) (A1) else (A2) set #tv60 = %tv60 fi every_two_hours(A1, A2) set #tv120 = #now() if ! #defined(%tv120) || (#tv120 - %tv120 >= 120*60 - \=driftfactor) (A1) else (A2) set #tv120 = %tv120 fi every_four_hours(A1, A2) set #tv240 = #now() if ! #defined(%tv240) || (#tv240 - %tv240 >= 240*60 - \=driftfactor) (A1) else (A2) set #tv240 = %tv240 fi daily(A1, A2) set #tv1440 = #now() if ! #defined(%tv1440) || (#tv1440 - %tv1440 >= 1440*60 - \=driftfactor) (A1) else (A2) set #tv1440 = %tv1440 fiHere is an example invocation of =hourly():
rule // page if the temp is greater than or equal to higher // threshold, but only once every hour if #envtemp >= #pagelim[#unit] =hourly(set $pagemsg = $upper("AC$text(#unit): envtemp $text(#envtemp) >= pagelim $text(#pagelim[#unit])!") =page($pagemsg\, =pageaddr, =allhours(#now())), ) fiHere are ways to use =hourly() and =periodically():
rule // report unusually high process count if #procnum >= #procnumlim // only report if proc count is rising #if missioncritical =hourly(if #procnum > %procnum output mail "Unusually high process count: $text(#procnum)" fi, ) #else =periodically(if #procnum > %procnum output mail "Unusually high process count: $text(#procnum)" fi, , 240) #endif fiIn the case of non mission-critical systems, it says to report if the process count is rising but only at most every 240 minutes, or 4 hours). (Note the blank (A2) actions in all =hourly() and =periodically() macro calls above.)
'output mail', the standard means of sending alert e-mail from within a Pikt script, sends e-mail to all recipients specified in the mailcmd in alerts.cfg. Rather than 'output mail' to a larger group, we could instead use the following special mail routing macro within Pikt scripts to send e-mail to specially designated individuals:
/////////////////////////////////////////////////////////////////////////////// output_other_mail(P, S, R, L) // output conditional mail to addressee(s) // beyond those specified in the alert // mailcmd; we don't #pclose() the (P) // proc handle at the end, instead letting // pikt do it, enabling us to make this a // a one-liner macro // (P) is the proc handle name (e.g., MAIL) // (S) is the subject (e.g., 'check this out') // (R) is the recipient (e.g., byrd\) // (L) is the line (e.g., $inline) if ! #defined(#isopen(P)) set #isopen(P) = #false() fi if ! #isopen(P) if #popen((P), "=mailx -a 'From: piktadmin' -s (S) (R)", "w") != #err() set #isopen(P) = #true() else output mail "\#popen() failure for: =mailx -s (S) (R)" quit fi fi do #write((P), (L)) ///////////////////////////////////////////////////////////////////////////////Here is a sample invocation of the =output_other_mail macro, from a script to scan dmesg:
#if systemssys rule if $inlin =~~ "segfault" if $alert() =~~ "coders" # if telemannsys =output_other_mail(DMESGSCAN, 'PIKT Dmesg Errors on =pikthostname', =piktadmin =telemann, $inlin) # elsif tartinisys =output_other_mail(DMESGSCAN, 'PIKT Dmesg Errors on =pikthostname', =piktadmin =tartini, $inlin) # elsif josquinsys =output_other_mail(DMESGSCAN, 'PIKT Dmesg Errors on =pikthostname', =piktadmin =josquin, $inlin) # endif fi next fi #endif // systemssysSo, the effect of this is, on each code development system, to send segfault messages just to the individual coder system owner (also the piktadmin). For example, if a program under development segfaults on josquin's system, only he (and the piktadmin) are told about it.
Because you want PIKT to report problems in a timely manner, alarm scripts must run more or less frequently. And because some people are busy or inattentive, or home sick, or away on vacation, you need to broadcast PIKT report e-mails to some extent. If you are not careful, though, PIKT might barrage you, the piktadmin, and everyone else on your staff with endless sysadmin "spam". But there are ways to fight back. This page has given you some weapons to use in the fight.
prev page | 1st page | next page |