Checking File Status
[posted 2001/01/29]
Early last week, we faced a near disaster when our 40,000 line NIS passwd file got trashed. Worse, this happened sometime overnight, and our automated file backup procedures (including one handled by PIKT) overwrote the backup files with the trashed version. So we were down to just a single good backup copy, the passwd.bak that our acctmgr program makes for us.
Potentially worse still, we subsequently discovered that the NIS directory on our NIS master machine was not being backed up to tape. (More on that later.)
Losing that one last backup file (passwd.bak) would have been The Apocalypse, no kidding!
Trashed NIS passwd files have happened more than once in recent months. FileStatChk, the check file status alarm script, reported the problem, and we usually had several backups to do restores with. But not this time! We were down to just one final backup of any type! (Actually, we did have other backups, but they were two months old or older--still a disaster if we had to resort to them.)
This inspired me to write a new Pikt script, actually an extension to FileStatChk, the portion enclosed within the '#if nismaster ... #endif' below.
First the script, then commentary.
FileStatChkUrgent init status active level urgent task "Detect unusual system file size changes" input proc "=cat =sysfiles_obj | =awk '{print $1}' | =xargs =lld" =lldata keys $name begin set #pctdiff = 10% // the percentage difference beyond // which we signal a potential problem // (and, for passwd, under which we // restore a backup passwd) set #noeditwait = 60 // for each loop, the number of // seconds we pause while waiting // for the =nisdir/noedit file to // disappear rule // reset for every file set #deviated = #false() rule set #filsize = #size // necessary, because we potentially // reset the fixed file's #filsize // below, and PIKT prevents us from // resetting the dat value #size // (specified in =lldata) rule if =deviated(filsize, #pctdiff) && ( $name !~ "/etc/mnttab" ) output mail "the size of $name has changed drastically, was $text(%filsize) bytes, is now $text(#filsize) bytes" set #deviated = #true() endif #if nismaster // let's redo all this in a more general fashion with macros someday rule // fix NIS passwd database if #deviated && $name eq "=nispasswd" // order here is important; give priority to: // // passwd.bak // passwd.piktbak (if any) // passwd.nightly.backup // passwd.daily.backup // "precious" passwd // created nightly // passwd.weekly.backup // passwd.monthly.backup // // a more sophisticated approach would be to poll // the #fileage() of each backup file, then select // the most recent of acceptable size set #b = #split($bkp, "=nispasswd" . ".bak " . "=nispasswd" . ".piktbak " . "=nispasswd" . ".nightly.backup " . "=nisbakdir/passwd.daily.bak " . "=preciousdir/passwd.gsbnis " . "=nisbakdir/passwd.weekly.bak " . "=nisbakdir/passwd.monthly.bak", " ") for #i=1 #i<=#b #i+=1 if -e $bkp[#i] =set_fa($bkp[#i]) if #fa <= 1 // or maybe 0? if #defined(%filsize) && %filsize != 0 && #abs((#filesize($bkp[#i]) - %filsize)/%filsize) < #pctdiff #ifdef doexec while -e "=nisdir/noedit" pause #noeditwait endwhile // it's possible that // a just concluded // edit operation fixed // the problem so check // again before restor- // ing from backup set #filsize = #filesize($name) if =deviated(filsize, #pctdiff) =execwait "echo 'PIKT $alarm()' >> =nisdir/noedit" =execwait "=cp -p $name $name" . ".piktbad" =execwait "=cp -p $bkp[#i] $name" =execwait "=acctmgrbindir/nismake.pl" set #filsize = #filesize($name) output mail "fixed and remade =nispasswd, is now $text(#filsize) bytes" =execwait "=rm =nisdir/noedit" endif set #deviated = #false() #endifdef break endif endif endif endfor endif rule // if NIS passwd database still corrupted, report and // possibly page if #deviated && $name eq "=nispasswd" output mail $upper("NIS passwd database on =nismaster is bad") # ifdef page =execwait "echo 'NIS passwd database on =nismaster is bad' | =mailx -s 'NIS passwd database on =nismaster is bad' =pagesysadmins" # endifdef endif #endif // nismaster
The 'input proc' statement above produces ordinary 'ls -l' output for the essential system files listed in the SysFiles.obj file. =lldata specifies the usual 'll' ('ls -l') data fields.
The basic idea here is that if a file size differs by more than 10% since the last time the script was run, we report the "problem" (it might not be a real problem).
In the case of the NIS passwd file, since the consequences of its corruption are so dire, we go beyond reporting to actually fixing. (We might extend this to other essential system files, but covering passwd is good enough for now. When we do so, we would rewrite much of the above code in macro form for ease of reuse.)
Please read the // comments in the script for useful explanation. I'll provide some additional commentary here.
The =deviated() macro I introduced in the 1.12.0 sample configs. It goes like this:
deviated(V, P) // a logical condition that is true if a value has increased // or decreased by a certain percentage or more // (V) is the variable (without type qualifier, e.g., size) // (P) is the percentage (e.g., 20%) ( #defined(%(V)) && ( ( %(V) != 0 && (#(V) - %(V))/%(V) >= (P) ) || ( %(V) != 0 && (#(V) - %(V))/%(V) <= -(P) ) ) )
If this is the nismaster, if the current file is the NIS passwd file and it was found to have deviated by more than the 10% threshold, we take action.
We have quite a few on-disk backups now for NIS passwd:
passwd.bak created by each acctmgr action (e.g., changing a password, adding or deleting an account, etc.) passwd.piktbak not yet implemented; for future use passwd.nightly.backup created by the crontab entry: 30 4 * * * cp -p /etc/NIS/passwd /etc/NIS/passwd.nightly.backup "precious" passwd created by the Pikt SysFilesBackup script (included in the 1.12.x configs_samples/alarms.cfg and discussed in an earlier posting to this list) passwd.daily.backup passwd.weekly.backup passwd.monthly.backup a new backup set created in response to this most recent crisis; based on the following new Perl script:
#ifndef generic # if nismaster nisbak.pl // back up essential NIS files on a periodic basis path "=nisbakdir/nisbak.pl" mode 750 uid 0 gid 1 #!=perl # nisbak.pl -- back up essential NIS files on a periodic basis use Getopt::Std ; getopts('hdwm') ; $HELP = $opt_h ; $DAILY = $opt_d ; $WEEKLY = $opt_w ; $MONTHLY = $opt_m ; die "Usage: nisbak.pl [-h] [-d|-w|-m]\n" if $HELP || $#ARGV >= 0 || (! $DAILY && ! $WEEKLY && ! $MONTHLY); open(LS, "=ls -1F =nisdir |") ; while (<LS>) { chomp ; next if (((/\.+/) && ! (/^auto\.home$/)) || (/[@~0-9\-\/]/) || (/^.$/) || (/^\s*$/)) ; $file = $_ ; if ($DAILY) { $period = "daily" ; } elsif ($WEEKLY) { $period = "weekly" ; } elsif ($MONTHLY) { $period = "monthly" ; } system("=cp -p =nisdir/$file =nisbakdir/$file.$period.bak") ; } close(LS) ; # endif // nismaster #endifdef // generic
This script is run from cron via:
30 22 * * * [ -x /usr/local/etc/NIS/nisbak.pl ] && /usr/local/etc/NIS/nisbak.pl -d 30 22 * * 1 [ -x /usr/local/etc/NIS/nisbak.pl ] && /usr/local/etc/NIS/nisbak.pl -w 30 22 1 * * [ -x /usr/local/etc/NIS/nisbak.pl ] && /usr/local/etc/NIS/nisbak.pl -m
In the 'set #b = #split()' statement, I order these passwd backup files in the order given in the comment. This is the preferred restore order--from passwd.bak down to finally the most recent monthly backup.
The '=set_fa($bkp[#i])' statement is based on a revised version of the =set_fa macro:
set_fa(F) // set age of file in days; this presupposes that (F) exists, // unless (F) is not supplied (invoked as: =set_fa()) so // $mon, $date & $time are already known for the current file if "(F)" eq "" // as with: =set_fa() set #fa = #fileage($mon,$date,$time) else // as with: =set_fa("=passwd") if ! #defined($mon) set $mon = $nil() fi if ! #defined($date) set $date = $nil() fi if ! #defined($time) set $time = $nil() fi do #split($command("=lld (F)")) set #fa = #fileage($[6],$[7],$[8]) endif
In this instance, we want to set #fa, the fileage (in days), of the current backup file, $bkp[#i]. This usage conflicts with the earlier version of =set_fa that took no arguments:
set_fa // set age of file in days set #fa = #fileage($mon,$date,$time)
The earlier form just assumed that $mon, $date, and $time were set because of an earlier =lldata specification, say. That need not necessarily be so. And even though it is true here, we want to set the #fa for the $bkp[#i] file, not for the file $name--the file for the current input line.
In accordance with the revised =set_fa() macro, I had to go into alarms.cfg and change every existing
=set_fa
to
=set_fa()
Note how the revised =set_fa() macro demonstrates how you can write macros to work both with arguments and without.
The
if ! #defined($mon) set $mon = $nil() fi ...
is needed in case you use =set_fa() in a script where $mon, etc. are not specified (in which case PIKT would complain about the unknown $mon, etc. in the line
set #fa = #fileage($mon,$date,$time)
(Don't worry too much if you don't follow this last bit of explanation. Just use the macro, or steal these techniques in your own configuration work.)
The restore algorithm says to cycle through each of the restore candidates, one by one, until we either do a successful restore or run out of candidates.
If the current restore candidate exists, if its fileage is less than one day, and if it differs by less than 10% from the last recorded "good" filesize (recorded in the alert .hst file), we attempt to restore it. Note how the '#abs((#filesize($bkp[#i]) - %filsize)/%filsize) < #pctdiff' condition prevents us from restoring a trashed backup. We keep going down the list of restore candidates until we find one recently made (or hopefully not too old) and of the "right" size.
What's with the 'while -e "=nisdir/noedit"' loop? That's because some other process might be in the process of rewriting the passwd file (maybe even correcting it). Or, one of the sysadmins might be fixing the situation, and--if they have followed recommended procedure--have put in place the =nisdir/noedit file. In other words, the 'noedit' file serves as a (weak) form of lockfile.
So, we wait until the 'noedit' file disappears, either because the sysadmin rm's it, or--yes--because we have another Pikt script specially set up for this purpose. (See the AcctmgrChkUrgent script in the 1.12.1 configs_samples.)
Under ordinary circumstances, there would be no noedit file, or there would be one present only briefly because of an ongoing acctmgr operation. acctmgr removes the noedit file when it's done with the current operation.
Coming out of the while loop, we have to check the filesize of NIS passwd again to see if something or someone else--the sysadmin?--has corrected it.
Only if the NIS passwd file is still found to deviate from the history value (and presumed "correct") filesize is a sequence of =execwait statements launched to do the actual restore. After the restore, we force a remake of the NIS passwd map using the separate nismake.pl program, report the restoration, and in any case set the #deviated flag to #false(), which indicates problem solved.
In the final rule, if for some reason the file restoration didn't take place, we report this as e-mail and give it emphasis in the alert message using the $upper() function. If paging is active, we page the syadmins, too. (We were paged last Monday, but it was ~4:00 AM. Ugh!)
We could embellish this script, adding all sorts of bells and whistles, if we really want to. For example, we could poll all the restore candidates first, then figure out which is "best" because closest in size to the presumed "good" size (the value recorded in the .hst file) and most recent. Having identified the "best" restore candidate, we could also overwrite the trashed backups, restoring everything to the best of health.
There are still other complications I could mention here. I expect to revise and polish this FileStatChk script some more over time.
But this is enough for now. And, now, we can sleep a little more soundly than before.
For more examples, see Developer's Notes.