NIS Malfunctions
[posted 2000/05/03]
Our mail situation is quieting down, thankfully. We still experience an "imapd storm" every couple of days, but the Pikt script described last week is keeping them manageable.
One new problem, though, is with our NIS service. Our NIS master, a Sparc 10, managed just fine when the passwd map was ~4,000 lines, but with the addition of 32,000+ accounts, that system has been huffing and puffing making the maps, and sometimes the maps get corrupted, or don't get pushed to the slave servers properly, or ???
One fix for this NIS malfunction is to move NIS master service to a new machine. That's in the cards, but not until interim week a month and a half now. For now, we just can't risk further, possibly worse problems by messing with our basic NIS setup.
Here is one of those Pikt alarms I've been thinking about for quite a while but didn't get around to implementing until the latest crises forced me to:
NISChkEmergency
init
status active
level emergency
task "Report NIS service malfunctions"
begin
set $state = "+" // initialize
pause #random(60) // up to a 1-minute pause so we don't
// hit the NIS servers all at once
// (although alert timing drift may
// already have solved this problem)
// add more cases as warranted
// this is a required test, because nextuid should be the
// last entry in our NIS passwd file
set $c = $command("=ypmatch nextuid passwd 2>&1")
if $c !~ "^nextuid:"
output mail $c
set $state = "-"
fi
set $c = $command("=ypmatch brahms passwd 2>&1")
if $c !~ "^brahms:"
output mail $c
set $state = "-"
fi
set $c = $command("=ypmatch 508 passwd.byuid 2>&1")
if $c !~ "^brahms:"
output mail $c
set $state = "-"
fi
set $c = $command("=ypmatch johannes.brahms aliases 2>&1")
if $c !~ "^brahms@"
output mail $c
set $state = "-"
fi
set $c = $command("=ypmatch hamburg hosts 2>&1")
if $c !~ "^111.222"
output mail $c
set $state = "-"
fi
#if nisserver
if $command("=ypcat passwd | =tail -10 | =wc -l") !~ "10"
#else
if $command("=ypcat passwd | =head -20 | =wc -l") !~ "20"
#endif
set $c = $command("=ypcat passwd 2>&1") // get first
// line (err
// msg) only
output mail $c
set $state = "-"
fi
# if nismaster
do #split($command("=wc -l /etc/NIS/passwd"))
set #lcurr = #val($1)
do #split($command("=wc -l /etc/NIS/passwd.nightly.backup"))
set #lback = #val($1)
if #lcurr < #lback - 10 // allow for limited truncation
output mail "NIS passwd file may be too small!
Current line count is $text(#lcurr),
was $text(#lback)
in passwd.nightly.backup."
set $state = "-"
fi
# endif
end
if $state eq "-"
set $server = $command("=ypwhich 2>&1")
output mail "NIS ON $upper($server) IS SICK/DOWN!"
fi
# ifdef page
// page just once per downage
if $state eq "-"
&& ( ! #defined(%state)
|| $state ne %state
)
// exec wait "echo 'NIS on $server is sick/down' |
// =mailx -s 'NIS on $server is sick/down'
// pagebrahms\@egbdf"
exec wait "echo 'NIS on $server is sick/down' | =mailx -s
'NIS on $server is sick/down' =pagesysadmins"
endif
# endifdef
This should be self-explanatory, except possibly for =pagesysadmins, which is a macro defined in macros.cfg as
pagesysadmins pagedonizetti\@egbdf pagebrahms\@egbdf pageliszt\@egbdf
and "pagebrahms\@egbdf", in turn, is an email alias resolving to my pager number.
I've thought of ways we could, if the bound-to NIS server is sick/down, auto-force a rebinding to an alternative server. (We can't do that easily for reasons I won't go into here.) Like most Pikt scripts, we'll revise and tweak this over time.
Got anything interesting/new you would like to share?
For more examples, see Developer's Notes.