Developing Alarm Scripts
[posted 1999/09/03]
What follows is a short narrative of how I go about developing alarm scripts. Some of what follows may or may not be obvious.
A while ago, we received the following alert from one of our Solaris systems:
PIKT ALERT Wed Sep 1 08:30:13 1999 cologne CRITICAL: DfChkCritical Detect filesystem near-full situations Filesystem /var on /dev/dsk/c0t0d0s3 is 96% full, 2282 Kb left 43563 /var/crash 4671 /var/sadm 4113 /var/pikt 444 /var/adm 131 /var/snmp 85 /var/cron 46 /var/spool 30 /var/dmi 22 /var/log 14 /var/yp
That 43 MB in /var/crash came as a surprise. Out of curiosity, I then issued the following command to check out our other Solaris systems:
# piktc -xv +C "=dusk /var/crash" +H solaris -H downsys
/var/crash was empty on most systems, held 20-40 MB on several other systems, and exceeded 300 MB on one system!
This called for development of another alarm:
/////////////////////////////////////////////////////////////////////////////// #if solaris RemoveCrashFilesNotice // before implementing this alarm, old /var/crash // files had accumulated to as much as 300 MB on some // of our systems init status active level notice task "Remove old system crash files" input proc "=find /var/crash -type f -exec =ll {} \\; 2>/dev/null" =lldata rule // rm if more than one week old if #fileage($mon,$date,$time) > 7 exec "=rm $name" endif #endif // solaris ///////////////////////////////////////////////////////////////////////////////
To test this, I first added it to the Test alert (in alerts.cfg on the piktmaster, vienna) and installed Test on cologne:
vienna# piktc -iv +A Test +H cologne
Then, on cologne, I ran it manually:
cologne# pikt +A Test
(Note that, when developing or debugging an alarm script, you can also edit the .alt files directly on the clients.)
After verifying that it worked properly (by "ls -l /var/crash/cologne"), I then (on the piktmaster) deleted the temporary Test alert from cologne:
vienna# piktc -tv +A Test +H cologne processing cologne... disabling alert(s)... Test disabled deleting file(s)... Test.alt deleted deleting file(s)... Test.hst deleted deleting file(s)... Test.log deleted
Then, in alerts.cfg (on vienna), I moved RemoveCrashFilesNotice from the Test alert to the Notice alert, taking care to put the appropriate #if wrapper around it:
Notice ... alarms ... #if solaris RemoveCrashFilesNotice #endif
Finally, I installed the amended Notice alert on all solaris systems:
vienna# piktc -iv +A Notice +H solaris -H downsys
So, with RemoveCrashFilesNotice now in place, that's one less reason for /var to fill up.
[Here is a follow-up to the previous installment.]
Yesterday I walked through the process of creating a new alarm, RemoveCrashFilesNotice. Today, we received a flurry of alerts messages like so:
PIKT ALERT Thu Sep 2 09:55:50 1999 minsk DEBUG: PiktNoticeLogChk Detect pikt log errors /usr/bin/find: cannot open /var/crash: No such file or directory Sep 2 03:52:20 WARNING: in scan(), RemoveCrashFilesNotice, no input data
This sort of debug message is standard. Some adjustments are in order.
First, we have to modify the alarm script by adding a "2>/dev/null" to the end of the input proc statement.
/////////////////////////////////////////////////////////////////////////////// #if solaris RemoveCrashFilesNotice // before implementing this alarm, old /var/crash // files had accumulated to as much as 300 MB on some // of our systems init status active level notice task "Remove old system crash files" input proc "=find /var/crash -type f -exec =ll {} \\; 2>/dev/null" =lldata rule // rm if more than one week old if #fileage($mon,$date,$time) > 7 exec "=rm $name" endif #endif // solaris ///////////////////////////////////////////////////////////////////////////////
This will suppress the
/usr/bin/find: cannot open /var/crash: No such file or directory
log entry.
Next, we go to PiktNoticeLogChk in alarms.cfg and add "RemoveCrashFilesNotice" to the list of alarms where empty input is permissible:
rule // a check against a badly formed input command // that results in no input if $inline =~ "WARNING:.+no input data" && $inline !~ "SysMsgScanNotice| NumberedPacctFileNotice| NumberedSyslogFileNotice| TmpChkNotice| DfChkNotice|MailChkNotice| MailFileChkNotice| SpoolChkDateNotice|MailArcChkNotice| MajordomoArcChkNotice| RemoveCrashFilesNotice" output mail $inline next endif
(It should be clear that a reinstall with
piktc -iv +A Notice Debug +H all
is required after these edits.)
I contend that with these two adjustments, we shouldn't see any more debug messages like the one above.
Again, this is standard operating procedure for me.
For more examples, see Developer's Notes.