Reporting Policies

PIKT's primary mission is to report problems.  To be effective, PIKT problem reports should be timely and get to the right people at a pace they can deal with.  When configured well, PIKT tells you what you need to know when you need to know it, no more, no less.  When configured badly, PIKT doesn't tell you enough, or worse, it tells you way, way too much.  So rather than people appreciating PIKT, they come to resent it.  At best, PIKT can be a useful and even essential part of your enterprise computing infrastructure.  At worst, PIKT can be an annoying nuisance.

Setting the right reporting policies is crucial to PIKT's success and acceptance throughout the enterprise.  Following is an extended discussion of ways you might manage the information flow and configure PIKT report messaging "well".

There are many different ways you could organize your PIKT e-mail macros.  (Indeed, one might even disregard macros entirely and instead use hard-coded e-mail addresses within alerts.cfg and elsewhere.)  If you opt for the e-mail macro approach, this example pikt_mail_macros.cfg file is one way how you might arrange it:

///////////////////////////////////////////////////////////////////////////////
//
// pikt mail macros - pikt mail routing
//
///////////////////////////////////////////////////////////////////////////////

// piktadmin

byrd            byrd\@acme.com

piktadmin       =byrd         // pikt head honcho

///////////////////////////////////////////////////////////////////////////////

// everyone else

dowland         dowland\@acme.com
telemann        telemann\@acme.com
tartini         tartini\@acme.com
josquin         desprez.gmail\@acme.com

boyce
#if missioncritical
                boyce\@acme.com
#else
                =piktnullchar
#endif

///////////////////////////////////////////////////////////////////////////////

// mail groups

sysadmins               =dowland =boyce =piktadmin
coders                  =telemann =tartini =josquin

///////////////////////////////////////////////////////////////////////////////

// the various pikt- macros, the addresses used in the =piktlnk(alerts.cfg, ref/ref.2.alerts.cfg.html) mailcmd

pikt-emergency          =sysadmins =coders

sysadmins-urgent        =sysadmins
coders-urgent           =coders =piktadmin

sysadmins-critical      =sysadmins
coders-critical         =coders =piktadmin

sysadmins-warning       =sysadmins
coders-warning          =coders =piktadmin

pikt-notice             =piktadmin
pikt-info               =piktadmin
pikt-admin              =piktadmin
pikt-debug              =piktadmin

pikt-test               =piktadmin

///////////////////////////////////////////////////////////////////////////////
In this #include file's first section, we set some macros for the piktadmin.  'byrd' is a macro defined as the e-mail address 'byrd\@acme.com'.  In the second macro definition, =byrd is the designated piktadmin.

In the second section, we define the e-mail macros for everyone else.  "Everyone else" means everybody else on the staff that cares to, or needs to, receive PIKT report e-mail.  There may be many other staff members in the organization who have no interest in receiving any PIKT e-mails.

boyce is a special case.  Staff member boyce only cares to receive report e-mail for the mission-critical systems.  For all other systems, we set his e-mail macro to the =piktnullchar, effectively a blank.

In the #include file's third section, we aggregate individual e-mail macros into two main groupings, the sysadmins and the coders.  A bit later, you will see how, using these macros, we send coding-related alert e-mails just to the coders, and alert e-mails related to systems administration just to the sysadmins.

In the fourth section, we set the e-mail macros referenced in the alerts.cfg mailcmd settings.  Here is a sample reference from alerts.cfg:

SysAdminsUrgent

        ...

        mailcmd         "=mailx -s 'PIKT Alert on =pikthostname: Urgent' =sysadmins-urgent"
Returning to the pikt_mail_macros.cfg #include file, for the highest-priority alert--Emergency--we specify that everyone should receive e-mails.

=sysadmins should receive the SysAdminsUrgent, SysAdminsCritical, and SysAdminsWarning e-mails, and =coders should receive the CodersUrgent, CodersCritical, and CodersWarning e-mails.

The =piktadmin alone should receive all other alert e-mails.

We want the piktadmin to receive all PIKT alert e-mails.  Where =coders is specified, we also tack on =piktadmin.  We don't also tack on =piktadmin to =sysadmins, because the latter macro includes the piktadmin in its macro definition.

Moving on to alerts.cfg, here are two alert groups, one for high-priority alerts of interest to the sysadmins, and the other for high-priority alerts intended for the coders:

///////////////////////////////////////////////////////////////////////////////

SysAdminsUrgent         // stuff deserving nearly immediate attention

#if missioncritical
        timing
                        15      *       * * 1-5  // mon-fri
                        15      */2     * * 0,6  // sat-sun
#else
        timing
                        15      6-18    * * 1-5  // mon-fri
#endif

        mailcmd         "=mailx -s 'PIKT Alert on =pikthostname: Urgent' =sysadmins-urgent"

        alarms          // stuff of interest to the sysadmins
                        ...
                        DmesgScan       // reporting redflag items only
                        ...

///////////////////////////////////////////////////////////////////////////////

CodersUrgent            // stuff deserving nearly immediate attention

#if missioncritical
        timing
                        45      *       * * 1-5  // mon-fri
#else
        timing
                        45      6-18    * * 1-5  // mon-fri
#endif

        mailcmd         "=mailx -s 'PIKT Alert on =pikthostname: Urgent' =coders-urgent"

        alarms          // stuff of interest to the coders
                        ...
                        DmesgScan       // reporting redflag items only
                        ...

///////////////////////////////////////////////////////////////////////////////
In words:

On mission-critical systems, Monday through Friday, sysadmins receive SysAdminsUrgent alerts hourly, at 15 minutes past the hour, and coders receive CodersUrgent alerts hourly, at 45 minutes after the hour.

For non mission-critical systems, Monday through Friday, sysadmins receive SysAdminsUrgent alerts hourly, at 15 minutes past the hour, but from 6 AM until 6 PM only.  For coders, it is similar, except they again receive their alerts at 45 minutes after the hour.

On Saturdays and Sundays, for mission-critical systems only, sysadmins receive alerts every two hours.  On these days, no alerts are sent for non mission-critical systems.  Coders receive no weekend alerts.

Both sysadmins and coders receive DmesgScan alerts, adapting at run-time to report red-flagged items only.  By reporting much the same things--the latest troublesome dmesg entries, to sysadmins on the one hand and coders on the other--a half hour apart, we achieve better dispersed, more nearly round-the-clock coverage without unduly bothering either the sysadmins or the coders.  Contrast this setup with one where sysadmins (and coders) receive DmesgScan redflag alerts at both 15 and 45 minutes after the hour.

Here is the alerts.cfg stanza for the Critical alerts, which are at a priority level just below the Urgent alerts:

///////////////////////////////////////////////////////////////////////////////

SysAdminsCritical       // important stuff, but not highest priority

#if missioncritical
        timing
                        30      6-18    * * 1-5  // mon-fri
                        30      6,12,18 * * 0,6  // sat-sun
#else
#  ifndef holiday
        timing
                        30      6-18/2  * * 1-5  // mon-fri
#  elsedef
        timing          =piktnever
#  endifdef
#endif

        mailcmd         "=mailx -s 'PIKT Alert on =pikthostname:
                         Critical' =sysadmins-critical"

        alarms          // stuff of interest to the sysadmins
                        ...
                        DmesgScan       // reporting yellowflag items only
                        ...

///////////////////////////////////////////////////////////////////////////////
In words:

On mission-critical systems, Monday through Friday, sysadmins receive SysAdminsCritical alerts hourly at 30 minutes past the hour, but from 6 AM until 6 PM only.

On non mission-critical systems, Monday through Friday, syadmins receive SysAdminsCritical alerts every other hour, from 6 AM to 6 PM only.

On Saturdays and Sundays, for mission-critical systems only, sysadmins receive alerts every six hours.

Coders receive no Critical alerts at all on the weekend.

In defines.cfg, we have set a "holiday" define in this way:

holiday         FALSE   // are we in a holiday period (e.g., xmas or easter)?
                        // set this to TRUE when entering a holiday period,
                        // then re-enable all alerts to set up special
                        // restricted holiday schedule; after the holiday,
                        // set back to FALSE, then re-enable (hence reschedule)
                        // all alerts

Moreover, if we have taken care to reconfigure out PIKT setup before a holiday by means of

# piktc -evr +D holiday +A all -H down sick
because we have set holiday to TRUE by means of the '+D holiday' at the command line, SysAdminsCritical alerts are effectively timed to be sent "never" for the non mission-critical systems.  (Note that the '-r' restart option is needed to force piktd to restart and reread its configuration.)

So Critical alerts are sent on holidays for mission-critical systems only.  (Whether or not the sysadmins are paying attention at all on holidays--say by checking their work e-mail from home--whether or not they are doing this is another question altogether.  But at least they are not being pestered by non mission-critical alerts on holidays.)

After the holiday, we should remember to re-enable (and thereby reschedule) all alerts by issuing the command

# piktc -evr -D holiday +A all -H down sick
(The '-D holiday' is not really necessary--we could leave it out--because holiday defaults to FALSE in defines.cfg.)

Okay, so now we have a setup that routes the appropriate alert e-mails to the appropriate staff members at the appropriate times.  What else can we do to fine tune our reporting policies?  Answer:  Plenty.  Following are just a few among many techniques we might use to pinpoint our PIKT alert messaging.

In alerts.cfg, PIKT is set to run alarm scripts, and send alert e-mails to groups of e-mail recipients, at varying times of the day and days of the week.  We can also, at run time, and depending on the timing, have Pikt scripts do things like quit, 'output log' instead of 'output mail', and other things to avoid sending alert messages at the wrong times.

Here are some scheduling macros to do just that:

///////////////////////////////////////////////////////////////////////////////

// scheduling macros

///////////////////////////////////////////////////////////////////////////////

night                            ( #hour() <   6 )
morning         ( #hour() >=  6 && #hour() <  12 )
afternoon       ( #hour() >= 12 && #hour() <  18 )
evening         ( #hour() >= 18 )

///////////////////////////////////////////////////////////////////////////////

offhours(H)     // between 10 PM and 6 AM
                ((H) >= 22 || (H) < 6)

allhours(H)     // any time of the day or night
                #true()

///////////////////////////////////////////////////////////////////////////////

sunday          ( #weekday() == 1 )
monday          ( #weekday() == 2 )
tuesday         ( #weekday() == 3 )
wednesday       ( #weekday() == 4 )
thursday        ( #weekday() == 5 )
friday          ( #weekday() == 6 )
saturday        ( #weekday() == 7 )

weekend         ((=friday && =evening) | =saturday || =sunday)

///////////////////////////////////////////////////////////////////////////////

bypass_evening
                if =evening
                        quit
                fi

///////////////////////////////////////////////////////////////////////////////

bypass_weekend
                if =weekend
                        quit
                fi

///////////////////////////////////////////////////////////////////////////////

output_by_time(L)
                if ! =weekend
                        output mail "(L)"
                else
                        output log  "(L)"
                endif

///////////////////////////////////////////////////////////////////////////////

reboot_period(D, H)     // (D) is currently unused
#if tabletserver
                        (    (H)     >= 1       // between 1 AM
                          && (H)     <  2       // and     2 AM
                        )
#else
                        ( (H) >= 25 )           // i.e., never
#endif

///////////////////////////////////////////////////////////////////////////////

bypass_reboots
                        if =reboot_period(#weekday(), #hour())
                                next
                        fi

///////////////////////////////////////////////////////////////////////////////

bypass_day_rollover
                if    ( #hour() == 23 && #minute() >=  30 )
                   || ( #hour() == 0  && #minute() <   30 )
                        quit
                fi

///////////////////////////////////////////////////////////////////////////////
There are many other such macros you could devise.  These just give you a taste for what is possible.

Here is a useful and standard PIKT define, verbose:

verbose
#if new
                TRUE    // if TRUE, output mail about routine execs, such as
                        // "deleting <this>" or "truncating <that>"; usually
                        // set this to FALSE; but occasionally set this to
                        // TRUE to get a fuller report of all that PIKT is
                        // doing silently, behind-the-scenes
#else
                FALSE
#endif
And here is one way you might use it, in macros.cfg:
// the verbose define controls whether certain routine messages get emailed
// or thrown away; in earlier versions of PIKT, this conditionality was
// handled in this way in alarms.cfg:
//
// #ifdef verbose
//              output mail "truncated $inlin"
// #endifdef
//
// with the macros below, we can now achieve the same effect by replacing
// the above three lines with just this one line:
//
//              =outputmail "truncated $inlin"

#ifdef verbose
outputmail      output mail
#elsedef
outputmail      output log "/dev/null"
#endifdef

// if verbose is not defined (is set to FALSE), the message is logged to
// /dev/null, that is, thrown away
And here is more straightforward example use of verbose:
#ifdef verbose
        rule    // for missioncritical systems, if state was "-",
                // is now "+", then report change
                if    " $missioncritical " =~ " $host "
                   && $state ne %state
                        output mail "$host is back up"
                endif
#endifdef
So, if you have too much PIKT messaging, consider putting '#ifdef verbose ... #endifdef' wrappers around some of your 'output mail' statements.  Note that, in defines.cfg, you can set the verbose define on a per-system basis.  So, in the example above, we automatically set verbose to TRUE on all new systems (where 'new' is defined in systems.cfg).  We might also set verbose to TRUE on mission-critical systems but leave it set to FALSE everywhere else.

A major potential problem is endlessly repeating alert e-mails.  Do you really need to be reminded hour after hour that system munich is down?  Or day after day this or that file is old or out-of-date?  (Maybe you need to know, or be reminded of these things, just not quite so often or repeatedly.

To report something just once daily, you might do something like this:

once_daily(A)
                // (A) is some action, which could be a single Pikt
                // statement, or many (including complex control structures)
                // this macro assumes an earlier set #hr = #hour() statement
                if #hr < %hr
                        (A)
                fi
You might use the =once_daily() macro in a Pikt script this way:
                =once_daily(output mail "$sys is down")
Here is a define, akin to verbose, set in defines.cfg:
stifle
#if missioncritical | new
                FALSE   // by default, limit how often certain relatively
                        // unimportant warnings get sent; from time to time,
                        // undefine stifle so that we may get a complete set
                        // of warnings
#else
                TRUE
#endif
And following is how we might make use of the stifle define in macros.cfg.  (Note that in this example, #fa refers to a "file age".)
///////////////////////////////////////////////////////////////////////////////

// the stifle define controls how often certain routine messages ("nagmail")
// are sent; in earlier versions of PIKT, this conditionality was handled
// in this way in alarms.cfg:
//
// #ifdef stifle
//              if #fa % 7 == 0                 // report only every 7 days
//                      output mail "orphaned?: $inline"
//              endif
// #elsedef
//                      output mail "orphaned?: $inline"
// #endifdef
//
// with the macros below, also with the stifle define, we can now achieve
// the same effect by replacing the above seven lines with just these
// three lines:
//
//              if #fa % =stifle(7) == 0        // report only every 7 days
//                      output mail "orphaned?: $inline"
//              endif

#ifdef stifle
stifle(N)       (N)
#elsedef
stifle(N)       1
#endifdef

///////////////////////////////////////////////////////////////////////////////
The =once_daily() and =stifle() techniques are rather crude.  Here are some more sophisticated macro techniques:
driftfactor             300     // fudge factor to allow for alert timing drift

// in all cases, (A1) refers to the action to be taken each period,
// and (A2) refers to some default action taken at all other times
// ((A2) is usually blank)

periodically(A1, A2, M)
                        set #tv = #now()
                        if ! #defined(%tv) || (#tv - %tv >= (M)*60 - \=driftfactor)
                                (A1)
                        else
                                (A2)
                                set #tv = %tv
                        fi

hourly(A1, A2)
                        set #tv60 = #now()
                        if ! #defined(%tv60) || (#tv60 - %tv60 >= 60*60 - \=driftfactor)
                                (A1)
                        else
                                (A2)
                                set #tv60 = %tv60
                        fi

every_two_hours(A1, A2)
                        set #tv120 = #now()
                        if ! #defined(%tv120) || (#tv120 - %tv120 >= 120*60 - \=driftfactor)
                                (A1)
                        else
                                (A2)
                                set #tv120 = %tv120
                        fi

every_four_hours(A1, A2)
                        set #tv240 = #now()
                        if ! #defined(%tv240) || (#tv240 - %tv240 >= 240*60 - \=driftfactor)
                                (A1)
                        else
                                (A2)
                                set #tv240 = %tv240
                        fi

daily(A1, A2)
                        set #tv1440 = #now()
                        if ! #defined(%tv1440) || (#tv1440 - %tv1440 >= 1440*60 - \=driftfactor)
                                (A1)
                        else
                                (A2)
                                set #tv1440 = %tv1440
                        fi
Here is an example invocation of =hourly():
        rule    // page if the temp is greater than or equal to higher
                // threshold, but only once every hour
                if #envtemp >= #pagelim[#unit]
                        =hourly(set $pagemsg = $upper("AC$text(#unit):
                                envtemp $text(#envtemp) >= pagelim $text(#pagelim[#unit])!")
                                =page($pagemsg\, =pageaddr, =allhours(#now())), )
                fi
Here are ways to use =hourly() and =periodically():
        rule    // report unusually high process count
                if #procnum >= #procnumlim
                        // only report if proc count is rising
#if missioncritical
                        =hourly(if #procnum > %procnum
                                output mail "Unusually high process count:
                                             $text(#procnum)"
                                fi, )
#else
                        =periodically(if #procnum > %procnum
                                      output mail "Unusually high process count:
                                                   $text(#procnum)"
                                      fi, , 240)
#endif
                fi
In the case of non mission-critical systems, it says to report if the process count is rising but only at most every 240 minutes, or 4 hours).  (Note the blank (A2) actions in all =hourly() and =periodically() macro calls above.)

'output mail', the standard means of sending alert e-mail from within a Pikt script, sends e-mail to all recipients specified in the mailcmd in alerts.cfg.  Rather than 'output mail' to a larger group, we could instead use the following special mail routing macro within Pikt scripts to send e-mail to specially designated individuals:

///////////////////////////////////////////////////////////////////////////////

output_other_mail(P, S, R, L)   // output conditional mail to addressee(s)
                                // beyond those specified in the alert
                                // mailcmd; we don't #pclose() the (P)
                                // proc handle at the end, instead letting
                                // pikt do it, enabling us to make this a
                                // a one-liner macro
                                // (P) is the proc handle name (e.g., MAIL)
                                // (S) is the subject (e.g., 'check this out')
                                // (R) is the recipient (e.g., byrd\@acme.com)
                                // (L) is the line (e.g., $inline)

                if ! #defined(#isopen(P))
                        set #isopen(P) = #false()
                fi
                if ! #isopen(P)
                        if #popen((P), "=mailx -a 'From: piktadmin' -s (S) (R)",
                                       "w") != #err()
                                set #isopen(P) = #true()
                        else
                                output mail "\#popen() failure for: =mailx -s (S) (R)"
                                quit
                        fi
                fi
                do #write((P), (L))

///////////////////////////////////////////////////////////////////////////////
Here is a sample invocation of the =output_other_mail macro, from a script to scan dmesg:
#if systemssys
        rule
                if $inlin =~~ "segfault"
                        if $alert() =~~ "coders"
#  if telemannsys
                                =output_other_mail(DMESGSCAN,
                                                   'PIKT Dmesg Errors on =pikthostname',
                                                   =piktadmin =telemann, $inlin)
#  elsif tartinisys
                                =output_other_mail(DMESGSCAN,
                                                   'PIKT Dmesg Errors on =pikthostname',
                                                   =piktadmin =tartini, $inlin)
#  elsif josquinsys
                                =output_other_mail(DMESGSCAN,
                                                   'PIKT Dmesg Errors on =pikthostname',
                                                   =piktadmin =josquin, $inlin)
#  endif
                        fi
                        next
                fi
#endif  // systemssys
So, the effect of this is, on each code development system, to send segfault messages just to the individual coder system owner (also the piktadmin).  For example, if a program under development segfaults on josquin's system, only he (and the piktadmin) are told about it.

Because you want PIKT to report problems in a timely manner, alarm scripts must run more or less frequently.  And because some people are busy or inattentive, or home sick, or away on vacation, you need to broadcast PIKT report e-mails to some extent. If you are not careful, though, PIKT might barrage you, the piktadmin, and everyone else on your staff with endless sysadmin "spam".  But there are ways to fight back.  This page has given you some weapons to use in the fight.

prev page 1st page next page
 
Home | FAQ | News | Intro | Samples | Tutorial | Reference | Software
Developer's Notes | Licensing | Authors | Pikt-Users | Pikt-Workers | Related Projects | Site Index | Privacy Policy | Contact Us
Page best viewed at 1024x768 or greater.   Page last updated 2019-01-12.   This site is PIKT® powered.
Copyright © 1998-2019 Robert Osterlund. All rights reserved.
Home FAQ News Intro Samples Tutorial Reference Software
PIKT Logo
PIKT Page Title
View sample
date & time
macros