False Positives

[posted 2000/06/24]

Several weeks ago, I wrote about the NISChkEmergency alarm we installed at our site.  This past week, on a couple of occasions, the alarm went off, and we were paged with the message, "NIS ON ... IS SICK/DOWN!"  Subsequent investigation showed that whatever the problem was, it was transitory.  A false positive, we think because the alarm script ran simultaneous with an NIS map update (so that ypserv was temporarily not serving, or serving incomplete information).

How to solve this apparent false positives problem?  The answer involves repeating tests and only reporting persistent errors.

Here is a revised NISChkEmergency.  Even if you don't run NIS at your site(s), and most likely you don't, still you might learn a new technique or two from the following.

NISChkEmergency

        init
                status active
                level emergency
                task "Report malfunctions in the NIS service"

        begin
                set #errors = 0         // initialize
                pause #random(60)       // up to a 1-minute pause so we don't
                                        // hit the NIS servers all at once
                                        // (although alert timing drift may
                                        // already have solved this problem)
                // add more cases as warranted
                // we now loop through this sequence of tests twice, if an
                // error is reported, and only report an NIS malfunction if
                // more than one error is detected--this to prevent some
                // observed false positives (probably because this was being
                // run simultaneously with a ypmake or yppush operation)
                set #loop = 0
                repeat
                        // this is a required test, because nextuid should
                        // be the last entry in our NIS passwd file
                        set $c = $command("=ypmatch nextuid passwd 2>&1")
                        if $c !~ "^nextuid:"
                                output mail $c
                                set #errors += 1
                        fi
                        set $c = $command("=ypmatch brahms passwd 2>&1")
                        if $c !~ "^brahms:"
                                output mail $c
                                set #errors += 1
                        fi
                        set $c = $command("=ypmatch 12357 passwd.byuid 2>&1")
                        if $c !~ "^brahms:"
                                output mail $c
                                set #errors += 1
                        fi
                        set $c = $command("=ypmatch johannes.brahms aliases 2>&1")
                        if $c !~ "^brahms@"
                                output mail $c
                                set #errors += 1
                        fi
                        set $c = $command("=ypmatch hamburg hosts 2>&1")
                        if $c !~ "^111.222"
                                output mail $c
                                set #errors += 1
                        fi
#  if nisserver
                        if $command("=ypcat passwd | =tail -10 | =wc -l") !~ "10"
#  else
                        if $command("=ypcat passwd | =head -20 | =wc -l") !~ "20"
#  endif
                                set $c = $command("=ypcat passwd 2>&1") // get
                                                                        // first
                                                                        // line
                                                                        // (err
                                                                        // msg)
                                                                        // only
                                output mail $c
                                set #errors += 1
                        fi
                        if #errors == 0
                                break
                        fi
                        pause 120       // pause to give time for any ongoing
                                        // ypmake or yppush operation to
                                        // finish (if indeed that is the
                                        // cause of any false positives)
                        set #loop += 1
                until #loop >= 2 || #errors > 1
#  if nismaster
                do #split($command("=wc -l /etc/NIS/passwd"))
                set #lcurr = #val($1)
                do #split($command("=wc -l /etc/NIS/passwd.nightly.backup"))
                set #lback = #val($1)
                if #lcurr < #lback - 10         // allow for limited truncation
                        output mail "NIS passwd file may be too small!
                                     Current line count is $text(#lcurr),
                                     was $text(#lback)
                                     in passwd.nightly.backup."
                        set #errors += 2        // increment by 2 to ensure
                                                // its reporting below
                fi
#  endif

        end
                if #errors >= 2
                        set $server = $command("=ypwhich 2>&1")
                        output mail "NIS ON $upper($server) IS SICK/DOWN!"
                fi
#  ifdef page
                // page just once per downage
                if    #errors >= 2
                   && (    ! #defined(%errors)
                        || %errors < 2
                      )
                        exec wait "echo 'NIS on $server is sick/down' | =mailx -s
                                   'NIS on $server is sick/down' =pagesysadmins"
                endif
#  endifdef

This is one of those rare instances where it makes sense to employ the little-used repeat-until control structure.

What we now do is:  Run through all the tests at least once.  If there are no errors, we move on.  If there are two or more reported errors after the first test round, we conclude that the problem is real and proceed with the rest of the report.  If, however, there is just one detected error, we pause for 120 seconds, to allow any ongoing yp operations to finish, then rerun all the tests.  We no longer highlight cases of just a single error, because if an error is real, it must be repeatable in the tests.

Sometimes, little adjustments like this can really make a difference.

For more examples, see Developer's Notes.

 
Home | FAQ | News | Intro | Samples | Tutorial | Reference | Software
Developer's Notes | Licensing | Authors | Pikt-Users | Pikt-Workers | Related Projects | Site Index | Privacy Policy | Contact Us
Page best viewed at 1024x768 or greater.   Page last updated 2018-01-02.   This site is PIKT® powered.
Copyright © 1998-2018 Robert Osterlund. All rights reserved.
Home FAQ News Intro Samples Tutorial Reference Software
PIKT Logo
PIKT Page Title
View sample
HTTP log entries
Pikt script