Several weeks ago, I wrote about the NISChkEmergency alarm we installed at our site. This past week, on a couple of occasions, the alarm went off, and we were paged with the message, "NIS ON ... IS SICK/DOWN!" Subsequent investigation showed that whatever the problem was, it was transitory. A false positive, we think because the alarm script ran simultaneous with an NIS map update (so that ypserv was temporarily not serving, or serving incomplete information).
How to solve this apparent false positives problem? The answer involves repeating tests and only reporting persistent errors.
Here is a revised NISChkEmergency. Even if you don't run NIS at your site(s), and most likely you don't, still you might learn a new technique or two from the following.
NISChkEmergency init status active level emergency task "Report malfunctions in the NIS service" begin set #errors = 0 // initialize pause #random(60) // up to a 1-minute pause so we don't // hit the NIS servers all at once // (although alert timing drift may // already have solved this problem) // add more cases as warranted // we now loop through this sequence of tests twice, if an // error is reported, and only report an NIS malfunction if // more than one error is detected--this to prevent some // observed false positives (probably because this was being // run simultaneously with a ypmake or yppush operation) set #loop = 0 repeat // this is a required test, because nextuid should // be the last entry in our NIS passwd file set $c = $command("=ypmatch nextuid passwd 2>&1") if $c !~ "^nextuid:" output mail $c set #errors += 1 fi set $c = $command("=ypmatch brahms passwd 2>&1") if $c !~ "^brahms:" output mail $c set #errors += 1 fi set $c = $command("=ypmatch 12357 passwd.byuid 2>&1") if $c !~ "^brahms:" output mail $c set #errors += 1 fi set $c = $command("=ypmatch johannes.brahms aliases 2>&1") if $c !~ "^brahms@" output mail $c set #errors += 1 fi set $c = $command("=ypmatch hamburg hosts 2>&1") if $c !~ "^111.222" output mail $c set #errors += 1 fi # if nisserver if $command("=ypcat passwd | =tail -10 | =wc -l") !~ "10" # else if $command("=ypcat passwd | =head -20 | =wc -l") !~ "20" # endif set $c = $command("=ypcat passwd 2>&1") // get // first // line // (err // msg) // only output mail $c set #errors += 1 fi if #errors == 0 break fi pause 120 // pause to give time for any ongoing // ypmake or yppush operation to // finish (if indeed that is the // cause of any false positives) set #loop += 1 until #loop >= 2 || #errors > 1 # if nismaster do #split($command("=wc -l /etc/NIS/passwd")) set #lcurr = #val($1) do #split($command("=wc -l /etc/NIS/passwd.nightly.backup")) set #lback = #val($1) if #lcurr < #lback - 10 // allow for limited truncation output mail "NIS passwd file may be too small! Current line count is $text(#lcurr), was $text(#lback) in passwd.nightly.backup." set #errors += 2 // increment by 2 to ensure // its reporting below fi # endif end if #errors >= 2 set $server = $command("=ypwhich 2>&1") output mail "NIS ON $upper($server) IS SICK/DOWN!" fi # ifdef page // page just once per downage if #errors >= 2 && ( ! #defined(%errors) || %errors < 2 ) exec wait "echo 'NIS on $server is sick/down' | =mailx -s 'NIS on $server is sick/down' =pagesysadmins" endif # endifdef
This is one of those rare instances where it makes sense to employ the little-used repeat-until control structure.
What we now do is: Run through all the tests at least once. If there are no errors, we move on. If there are two or more reported errors after the first test round, we conclude that the problem is real and proceed with the rest of the report. If, however, there is just one detected error, we pause for 120 seconds, to allow any ongoing yp operations to finish, then rerun all the tests. We no longer highlight cases of just a single error, because if an error is real, it must be repeatable in the tests.
Sometimes, little adjustments like this can really make a difference.
For more examples, see Developer's Notes.