Temporary Glitches
[posted 2000/05/10]
Successfully running validation tests on new PIKT code is not enough (because your results are only as good as your tests). One has to run the new code in an actual production environment as well.
We're running the latest pikt-1.10.0pre5 code on our master machine and are suddenly seeing these alerts:
PIKT ALERT Wed May 10 12:55:02 2000 vienna DEBUG: AlertChkCritical Detect PIKT alert/service daemon failures/restarts/redundancies piktd appears to be hung/dead on rheims piktd appears to be hung/dead on nantes ...
It seems like the piktd daemon is dead everywhere!
I cp'ed Debug.alt to Test.alt, edited out every alarm script except for AlertChkCritical, then added another output line to the last rule of AlertChkCritical:
rule do #split ( $time , ":" ) [new -->] output "$time, $[1], $[2], $[3]" set #t = #datevalue ( #yrnow - #if ( #monnow == 1 && $mon eq "Dec" , 1 , 0 ) , #monthnumber ( $mon ) , #date ) + #timevalue ( #val ( $[1] ) , #val ( $[2] ) , #val ( $[3] ) ) if #timenow - #t > #hrs * 60 * 60 output mail "piktd appears to be hung/dead on $sys" endif
After doing a 'pikt +A Test', I got these results:
12:41:00, :, ÿ, ÿ piktd appears to be hung/dead on rheims 12:42:00, :, ÿ, ÿ piktd appears to be hung/dead on nantes ...
I recognized this instantly as a bug in the revised #split() implementation. I'll fix the bug later this week. In the meantime, we don't want to continue seeing bogus messages about supposed piktd failure. So, I commented out the last rule of AlertChkCritical and added a =remind() macro to remind me a week from now to reactivate the rule:
rule // if it's been more than #hrs hours since the last piktd_log // entry, piktd is not logging, hence appears to be hung/dead; =remind(2000, 5, 17, "REACTIVATE FINAL ALERTCHKCRITICAL RULE AFTER DEBUG") // do #split($time, ":") // set #t = #datevalue(#yrnow - #if(#monnow == 1 && $mon eq "Dec", 1, 0), #monthnumber($mon), #date) // + #timevalue(#val($[1]), #val($[2]), #val($[3])) // if #timenow - #t > #hrs*60*60 // output mail "piktd appears to be hung/dead on $sys" // endif
(I then reinstalled with 'piktc -iv +A Debug +H vienna', it perhaps goes without saying.)
Consider doing something like this when faced with similar temporary glitches.
For more examples, see Developer's Notes.