File Backups
[posted 2001/01/29]
A while ago, I mentioned that at the time of our NIS passwd near-crisis last week, we discovered that crucial areas of our NIS master server were not being backed up to tape. In fact, since we put this new machine into operation in mid December, it hadn't been backed up at all!
(Backup operator error. It's a long story. No, I'm not the backup operator!)
This episode galvanized me to write a script I've been wanting to do for quite some time:
DumpDatesChkWarning init status active level warning task "Report backup problems as revealed by /etc/dumpdates" input proc "=dfk | =grep '^/'" =dfdata begin set #maxgnrl = 14 // beyond which we signal general // backup failure #ifndef debug set #maxfull = 14 // beyond which we signal full // backup failure #elsedef set #maxfull = -1 #endifdef #ifndef debug set #maxincr = 3 // beyond which we signal incr // backup failure #elsedef set #maxincr = -1 #endifdef set #slackfull = 7 // added days beyond which we output // in ALL CAPS set #slackincr = 7 // added days beyond which we output // in ALL CAPS if ! -e "=dumpdates" output mail $upper("=dumpdates not found!") quit fi =set_fa(=dumpdates) if #fa > #maxgnrl output mail $upper("=dumpdates is more than $text(#maxgnrl) days out of date:") output mail $command("=ll =dumpdates") output mail "GENERAL BACKUP FAILURE!" fi if #fopen(DUMPDATES, "=dumpdates", "r") == #err() output mail $upper("\#fopen() error for =dumpdates!") quit fi while #read(DUMPDATES) > 0 if #parse($rdlin, "^([[:graph:]]+)[[:space:]]+([[:digit:]]) (.+)$") != 3 output mail "malformed =dumpdates line: $rdlin" continue else set $dsk = #if solaris $substitute($1, "/rdsk/", "/dsk/") #elsif sunos $substitute($1, "/rsd", "/sd") #else $1 #endif set $lvl = $2 set $dat = $3 =set_lineage($rdlin) set #daysold = #max(#int(#lineage/=secs_in_day), 0) if $lvl eq "0" if #defined(#full[$dsk]) set #full[$dsk] = #min(#full[$dsk], #daysold) else set #full[$dsk] = #daysold fi else if #defined(#incr[$dsk]) set #incr[$dsk] = #min(#incr[$dsk], #daysold) else set #incr[$dsk] = #daysold fi fi fi endwhile do #fclose(DUMPDATES) #ifdef debug foreach #keys($f, #full) output mail "$text(#full[$f]) $f" endforeach foreach #keys($f, #incr) output mail "$text(#incr[$f]) $f" endforeach output =newline #endifdef rule // skip special file systems if ( $mount =~~ "^(/proc|/swap.?)$" || $mount =~~ "^(/cdrom|/home/)" || $mount =~~ "^(/var/mail)$" // we don't backup mail ) && $mount !~~ "^/home/egbdf" next fi //#if ! ??? rule // skip /tmp (except possibly for some machines specified // in the preceding #if statement) if $mount =~ "^/tmp$" next fi //#endif rule set $mnt[$fsname] = $mount end #ifdef debug foreach #keys ($f, $mnt) output "$f $mnt[$f]" endforeach output =newline #endifdef foreach #keys($f, $mnt) if ! #defined(#full[$f]) output mail $upper("no record of any full backup for $mnt[$f] ($f)") else if #full[$f] > #maxfull + #slackfull output mail $upper("last recorded full backup is $text(#full[$f]) days old for $mnt[$f] ($f)") elseif #full[$f] > #maxfull output mail "last recorded full backup is $text(#full[$f]) days old for $mnt[$f] ($f)" fi fi if ! #defined(#incr[$f]) output mail $upper("no record of any incr backup for $mnt[$f] ($f)") else if #incr[$f] > #maxincr + #slackincr output mail $upper("last recorded incr backup is $text(#incr[$f]) days old for $mnt[$f] ($f)") elseif #incr[$f] > #maxincr output mail "last recorded incr backup is $text(#incr[$f]) days old for $mnt[$f] ($f)" fi fi endforeach foreach #keys($f, #full) if ! #defined($mnt[$f]) output mail "orphaned filesystem in =dumpdates: $f" fi endforeach foreach #keys($f, #incr) if ! #defined($mnt[$f]) output mail "orphaned filesystem in =dumpdates: $f" fi endforeach
The 'input proc' statement yields script input like:
/dev/dsk/c0t0d0s0 246463 73514 148303 34% / /dev/dsk/c0t0d0s6 1269558 816915 401861 68% /usr /proc 0 0 0 0% /proc /dev/dsk/c0t0d0s3 492422 98217 344963 23% /var /dev/dsk/c0t0d0s4 492422 370390 72790 84% /opt /dev/dsk/c0t0d0s5 4672134 4517361 154773 97% /export/home /dev/dsk/c1t3d0s0 4156462 2883696 1064943 74% /pub/perf_disk_1 /dev/dsk/c1t4d0s0 4156462 3462278 486361 88% /pub/perf_disk_2
In the 'begin' section, we set some script parameters.
Next, we report if the dumpdates file is missing, is way out of date (or if we can't open it for reading).
We then start reading in the dumpdates file, line by line. For Solaris and SunOS (at least), we have to tweak the device files, as dumpdates lists character (raw) device files, while the df display lists block device files.
Here are some sample dumpdates entries:
/dev/rdsk/c0t0d0s4 4 Tue Jan 23 21:05:29 2001 /dev/rdsk/c1t2d0s0 4 Tue Jan 23 21:05:47 2001 /dev/rdsk/c1t3d0s0 4 Tue Jan 23 21:47:05 2001 /dev/rdsk/c1t9d0s0 0 Wed Jan 17 22:04:39 2001 /dev/rdsk/c1t8d0s0 0 Wed Jan 17 22:04:41 2001 /dev/rdsk/c1t3d0s0 0 Wed Jan 17 22:07:33 2001 /dev/rdsk/c1t4d0s0 0 Wed Jan 17 22:28:50 2001 /dev/rdsk/c1t4d0s0 6 Sat Jan 27 21:22:27 2001 /dev/rdsk/c1t5d0s0 6 Sat Jan 27 21:57:41 2001 /dev/rdsk/c1t8d0s0 6 Sat Jan 27 22:20:54 2001 /dev/rdsk/c1t9d0s0 6 Sat Jan 27 23:20:28 2001
Note how it's '/dev/dsk' in the df display but '/dev/rdsk' in the dumpdates file. (Different conventions are followed for different OSes. Add your own $substitute() cases as necessary.)
The =set_lineage() macro sets the age of the current input line, #lineage, in seconds from the present time. For this, I had to add another case to the configs_samples =set_lineage() macro:
// for dumpdates output and others; look for date/time // stamp like "Feb 25 16:19:30 2000" anywhere within a line elseif #parse("(L)","(=months)[[:space:]]+([[:digit:]]+)[[:space:]]+ ([[:digit:]]+):([[:digit:]]+):([[:digit:]]+) [[:space:]]([[:digit:]]+)") == 6 set #lineage = =nowdst - (#datevalue(#val($6), #monthnumber($1), #val($2)) + #timevalue(#val($3), #val($4), #val($5)))
I have thought of writing a standard Pikt #lineage() function, but the current example is the best argument against doing that. Who knows what screwy date and time formats we might encounter in the future? My current =set_lineage() macro specification considers five different formats. How many more will there be? By doing =set_lineage() as a PIKT macro, we make it easier for the ordinary user to tweak than by venturing into the PIKT source code (shudder the thought!) to make the desired extensions.
After recasting the #lineage in terms of days, we enter an 'if ... fi' construct that fills two associative arrays with information from the dumpdates files--one showing the "age" (in days) of the most recent full (dump level 0) backup, and the other showing the "age" of the most recent incr (incremental, dump level > 0) backup.
With reading and storing the dumpdates data out of the way, we enter the rules sections to begin considering the df output.
We skip several special file systems that we don't back up because it makes no sense to do so or because it's against policy. (We don't back up user mail files. Don't ask!)
In the next rule, we store the current file system in the $mnt[] array for later recall.
(We throw in a couple '#ifdef debug ... #endifdef' sections for testing purposes.)
In the 'end' section is where we do some correlations and report any problems.
For each of the mounted file systems as reported in the df display, we inspect their most recent full and incremental backup ages. If the last recorded full or incremental backup exceeds the #maxfull or #maxincr threshold, we report that as alert e-mail. If the age exceeds #maxfull or #maxincr by an additional slack factor--that is, the file backups are getting to be uncomfortably out of date--we SCREAM e-mail using the $upper() function.
Finally, in the last foreach, we report possibly orphaned file systems-- reported in dumpdates but appearing nowhere in the df display. These typically represent disks long ago retired. It behooves us to clean these out of the dumpdates files, as they no longer serve any useful purpose.
We added this to the Warning alerts set that runs once overnight. Here is a typical alert message:
PIKT ALERT Sat Jan 27 02:32:04 2001 moscow WARNING: DumpDatesChkWarning Report backup problems as revealed by /etc/dumpdates LAST RECORDED INCR BACKUP IS 255 DAYS OLD FOR /PUB/ALUM_DISK_1 (/DEV/DSK/C0T1D0S6) NO RECORD OF ANY FULL BACKUP FOR /OPT/MAILMAN_DISK_1 (/DEV/DSK/C2T5D0S5) NO RECORD OF ANY INCR BACKUP FOR /OPT/MAILMAN_DISK_1 (/DEV/DSK/C2T5D0S5) orphaned filesystem in /etc/dumpdates: /dev/md/dsk/d30 orphaned filesystem in /etc/dumpdates: /dev/md/dsk/d30
Alas, we found more than a few backup gaps like those reported in the moscow system. These gaps are crucial for us to know.
In developing this, I first tested it using the special Test alert:
#ifdef test //# if piktmaster //# if milan Test timing =piktnever // timing 7,37 * * * * // drift 5 mailcmd "=mailx -s 'PIKT Alert on =pikthostname: Test' pikt-test" alarms //ProcCountsChk //CronLogChkUrgent //AliasesChkWarning //RemoveOrphanedPrintFilesNotice //FileStatChkUrgent DumpDatesChkWarning //# endif #endifdef // test
Before, I wasn't wrapping this within an '#ifdef test ... #endifdef'. If I left it uncommented, I was always either installing the Test alert in my 'piktc -iv +A all' commands, or I was seeing messages complaining about missing Test.alt in the log files. To avoid this, I was commenting and uncommenting the Test section in alerts.cfg. Now, by means of the '#ifdef test ... #endifdef', I can more conveniently activate this at the command line by, e.g.,
# piktc -iv +D test +A Test +H ...
to install, or
# piktc -tv +D test +A Test +H ...
to delete it (and all attendant .log and .hst files, if any) when I am through with the testing.
Since the 'test' #define is set by default to FALSE in defines.cfg, most ordinary piktc commands (where I don't explicitly specify '+D test') have piktc just disregard the Test alert entirely.
To test this new DumpDatesChk script, I installed everywhere:
# piktc -iv +D test +A Test -H downsys
Then, I ran this test script on all systems with
# piktc -x +C "hostname; echo; =pikt +A Test; echo; echo" -H downsys 2>&1 | tee DumpDatesChk.out
Here is some sample output:
nismaster /ETC/DUMPDATES IS MORE THAN 14 DAYS OUT OF DATE: -rw-rw-r-- 1 root sys 0 Dec 18 12:15 /etc/dumpdates GENERAL BACKUP FAILURE! NO RECORD OF ANY FULL BACKUP FOR / (/DEV/DSK/C0T0D0S0) NO RECORD OF ANY INCR BACKUP FOR / (/DEV/DSK/C0T0D0S0) NO RECORD OF ANY FULL BACKUP FOR /USR (/DEV/DSK/C0T0D0S4) NO RECORD OF ANY INCR BACKUP FOR /USR (/DEV/DSK/C0T0D0S4) NO RECORD OF ANY FULL BACKUP FOR /VAR (/DEV/DSK/C0T0D0S3) NO RECORD OF ANY INCR BACKUP FOR /VAR (/DEV/DSK/C0T0D0S3) moscow LAST RECORDED INCR BACKUP IS 256 DAYS OLD FOR /PUB/ALUM_DISK_1 (/DEV/DSK/C0T1D0S6) NO RECORD OF ANY FULL BACKUP FOR /OPT/MAILMAN_DISK_1 (/DEV/DSK/C2T5D0S5) NO RECORD OF ANY INCR BACKUP FOR /OPT/MAILMAN_DISK_1 (/DEV/DSK/C2T5D0S5) orphaned filesystem in /etc/dumpdates: /dev/md/dsk/d30 orphaned filesystem in /etc/dumpdates: /dev/md/dsk/d30 utrecht leiden ...
For any given system, as in the last two systems listed, no output indicates no problem. Most systems reported no problems, or only trivial problems. Still, the new DumpDatesChk script reported far too many problems for comfort!
Once I had fixed any bugs (easy to spot when all systems report at once) and e-mailed the full report (the 'piktc -x +C' output) to the backup operator and others, I deleted everywhere with
# piktc -tv +D test +A Test -H downsys
and, after registering DumpDatesChk in the Warning section of alerts.cfg, installed everywhere with
# piktc -iv +A Warning -H downsys
This problem--monitoring the state of file backups--is so important that I will be doing more in this area in the days ahead. But we now have, in the words of the backup operator, a "very nice tool" for reporting these potentially serious--job and organization threatening--backup failures.
You probably have a backup system that uses something other than Unix dump. (It just so happens that both of our backup systems--Amanda and Budtool-- use dump, or at least we have configured them that way.) If so, you can't use this DumpDatesChk script directly. But I hope I've inspired you to think about how you might apply something like this to your own situation, also to think about how disastrous things might be if you aren't closely monitoring the extent of your own backup coverage.
[posted 2001/01/30]
Returning briefly to the DumpDatesChk script described yesterday, we were seeing repeat orphan messages like
orphaned filesystem in /etc/dumpdates: /dev/dsk/c0t0d0s5 orphaned filesystem in /etc/dumpdates: /dev/dsk/c0t0d0s5
I modified the final foreach to read:
foreach #keys($f, #incr) if ! #defined($mnt[$f]) if ! #defined(#full[$f]) // to squelch // repeats from // prev foreach output mail "orphaned filesystem in =dumpdates: $f" fi fi endforeach
And so it goes: always refining, always improving the configuration.
For more examples, see Developer's Notes.