Developing Alarm Scripts

[posted 1999/09/03]

What follows is a short narrative of how I go about developing alarm scripts.  Some of what follows may or may not be obvious.

A while ago, we received the following alert from one of our Solaris systems:

                                PIKT ALERT
                         Wed Sep  1 08:30:13 1999
                                 cologne

CRITICAL:
    DfChkCritical
        Detect filesystem near-full situations

        Filesystem /var on /dev/dsk/c0t0d0s3 is 96% full, 2282 Kb left

        43563   /var/crash
        4671    /var/sadm
        4113    /var/pikt
        444     /var/adm
        131     /var/snmp
        85      /var/cron
        46      /var/spool
        30      /var/dmi
        22      /var/log
        14      /var/yp

That 43 MB in /var/crash came as a surprise.  Out of curiosity, I then issued the following command to check out our other Solaris systems:

# piktc -xv +C "=dusk /var/crash" +H solaris -H downsys

/var/crash was empty on most systems, held 20-40 MB on several other systems, and exceeded 300 MB on one system!

This called for development of another alarm:

///////////////////////////////////////////////////////////////////////////////

#if solaris

RemoveCrashFilesNotice  // before implementing this alarm, old /var/crash
                        // files had accumulated to as much as 300 MB on some
                        // of our systems

        init
                status active
                level notice
                task "Remove old system crash files"
                input proc "=find /var/crash -type f -exec =ll {} \\; 2>/dev/null"
                =lldata

        rule    // rm if more than one week old
                if #fileage($mon,$date,$time) > 7
                        exec "=rm $name"
                endif

#endif  // solaris

///////////////////////////////////////////////////////////////////////////////

To test this, I first added it to the Test alert (in alerts.cfg on the piktmaster, vienna) and installed Test on cologne:

vienna# piktc -iv +A Test +H cologne

Then, on cologne, I ran it manually:

cologne# pikt +A Test

(Note that, when developing or debugging an alarm script, you can also edit the .alt files directly on the clients.)

After verifying that it worked properly (by "ls -l /var/crash/cologne"), I then (on the piktmaster) deleted the temporary Test alert from cologne:

vienna# piktc -tv +A Test +H cologne

processing cologne...
disabling alert(s)...
Test disabled
deleting file(s)...
Test.alt deleted
deleting file(s)...
Test.hst deleted
deleting file(s)...
Test.log deleted

Then, in alerts.cfg (on vienna), I moved RemoveCrashFilesNotice from the Test alert to the Notice alert, taking care to put the appropriate #if wrapper around it:

Notice

...

        alarms

...

#if solaris
                        RemoveCrashFilesNotice
#endif

Finally, I installed the amended Notice alert on all solaris systems:

vienna# piktc -iv +A Notice +H solaris -H downsys

So, with RemoveCrashFilesNotice now in place, that's one less reason for /var to fill up.

[Here is a follow-up to the previous installment.]

Yesterday I walked through the process of creating a new alarm, RemoveCrashFilesNotice.  Today, we received a flurry of alerts messages like so:

                                PIKT ALERT
                         Thu Sep  2 09:55:50 1999
                                  minsk

DEBUG:
    PiktNoticeLogChk
        Detect pikt log errors

        /usr/bin/find: cannot open /var/crash: No such file or directory
        Sep  2 03:52:20 WARNING: in scan(), RemoveCrashFilesNotice, no input data

This sort of debug message is standard.  Some adjustments are in order.

First, we have to modify the alarm script by adding a "2>/dev/null" to the end of the input proc statement.

///////////////////////////////////////////////////////////////////////////////

#if solaris

RemoveCrashFilesNotice  // before implementing this alarm, old /var/crash
                        // files had accumulated to as much as 300 MB on some
                        // of our systems

        init
                status active
                level notice
                task "Remove old system crash files"
                input proc "=find /var/crash -type f -exec =ll {} \\; 2>/dev/null"
                =lldata

        rule    // rm if more than one week old
                if #fileage($mon,$date,$time) > 7
                        exec "=rm $name"
                endif

#endif  // solaris

///////////////////////////////////////////////////////////////////////////////

This will suppress the

        /usr/bin/find: cannot open /var/crash: No such file or directory

log entry.

Next, we go to PiktNoticeLogChk in alarms.cfg and add "RemoveCrashFilesNotice" to the list of alarms where empty input is permissible:

        rule    // a check against a badly formed input command
                // that results in no input
                if    $inline =~ "WARNING:.+no input data"
                   && $inline !~ "SysMsgScanNotice|
                                          NumberedPacctFileNotice|
                                          NumberedSyslogFileNotice|
                                          TmpChkNotice|
                                          DfChkNotice|MailChkNotice|
                                          MailFileChkNotice|
                                          SpoolChkDateNotice|MailArcChkNotice|
                                          MajordomoArcChkNotice|
                                          RemoveCrashFilesNotice"
                        output mail $inline
                        next
                endif

(It should be clear that a reinstall with

   piktc -iv +A Notice Debug +H all

is required after these edits.)

I contend that with these two adjustments, we shouldn't see any more debug messages like the one above.

Again, this is standard operating procedure for me.

For more examples, see Developer's Notes.

 
Home | FAQ | News | Intro | Samples | Tutorial | Reference | Software
Developer's Notes | Licensing | Authors | Pikt-Users | Pikt-Workers | Related Projects | Site Index | Privacy Policy | Contact Us
Page best viewed at 1024x768 or greater.   Page last updated 2019-01-12.   This site is PIKT® powered.
Copyright © 1998-2019 Robert Osterlund. All rights reserved.
Home FAQ News Intro Samples Tutorial Reference Software
PIKT Logo
PIKT Page Title
View sample
checksum
differences
Pikt script