File Backups

[posted 2001/01/29]

A while ago, I mentioned that at the time of our NIS passwd near-crisis last week, we discovered that crucial areas of our NIS master server were not being backed up to tape.  In fact, since we put this new machine into operation in mid December, it hadn't been backed up at all!

(Backup operator error.  It's a long story.  No, I'm not the backup operator!)

This episode galvanized me to write a script I've been wanting to do for quite some time:

DumpDatesChkWarning

        init
                status active
                level warning
                task "Report backup problems as revealed by /etc/dumpdates"
                input proc "=dfk | =grep '^/'"
                =dfdata

        begin
                set #maxgnrl = 14       // beyond which we signal general
                                        // backup failure

#ifndef debug
                set #maxfull = 14       // beyond which we signal full
                                        // backup failure
#elsedef
                set #maxfull = -1
#endifdef

#ifndef debug
                set #maxincr = 3        // beyond which we signal incr
                                        // backup failure
#elsedef
                set #maxincr = -1
#endifdef

                set #slackfull = 7      // added days beyond which we output
                                        // in ALL CAPS

                set #slackincr = 7      // added days beyond which we output
                                        // in ALL CAPS

                if ! -e "=dumpdates"
                        output mail $upper("=dumpdates not found!")
                        quit
                fi

                =set_fa(=dumpdates)
                if #fa > #maxgnrl
                        output mail $upper("=dumpdates is more
                                            than $text(#maxgnrl)
                                            days out of date:")
                        output mail $command("=ll =dumpdates")
                        output mail "GENERAL BACKUP FAILURE!"
                fi

                if #fopen(DUMPDATES, "=dumpdates", "r") == #err()
                        output mail $upper("\#fopen() error for =dumpdates!")
                        quit
                fi
                while #read(DUMPDATES) > 0
                        if #parse($rdlin,
                                  "^([[:graph:]]+)[[:space:]]+([[:digit:]])
                                  (.+)$") != 3
                                output mail "malformed =dumpdates line: $rdlin"
                                continue
                        else
                                set $dsk = 
#if solaris
                                           $substitute($1, "/rdsk/", "/dsk/")
#elsif sunos
                                           $substitute($1, "/rsd", "/sd")
#else
                                           $1
#endif
                                set $lvl = $2
                                set $dat = $3
                                =set_lineage($rdlin)
                                set #daysold = #max(#int(#lineage/=secs_in_day),
                                                    0)
                                if $lvl eq "0"
                                        if #defined(#full[$dsk])
                                                set #full[$dsk] =
                                                    #min(#full[$dsk],
                                                    #daysold)
                                        else
                                                set #full[$dsk] = #daysold
                                        fi
                                else
                                        if #defined(#incr[$dsk])
                                                set #incr[$dsk] =
                                                    #min(#incr[$dsk],
                                                    #daysold)
                                        else
                                                set #incr[$dsk] = #daysold
                                        fi
                                fi
                        fi
                endwhile
                do #fclose(DUMPDATES)
#ifdef debug
                foreach #keys($f, #full)
                        output mail "$text(#full[$f]) $f"
                endforeach
                foreach #keys($f, #incr)
                        output mail "$text(#incr[$f]) $f"
                endforeach
                output =newline
#endifdef

        rule    // skip special file systems
                if    (    $mount =~~ "^(/proc|/swap.?)$"
                        || $mount =~~ "^(/cdrom|/home/)"
                        || $mount =~~ "^(/var/mail)$"   // we don't backup mail
                      )
                   &&      $mount !~~ "^/home/egbdf"
                        next
                fi

//#if ! ???
        rule    // skip /tmp (except possibly for some machines specified
                // in the preceding #if statement)
                if $mount =~ "^/tmp$"
                        next
                fi
//#endif

        rule
                set $mnt[$fsname] = $mount

        end

#ifdef debug
                foreach #keys ($f, $mnt)
                        output "$f $mnt[$f]"
                endforeach
                output =newline
#endifdef

                foreach #keys($f, $mnt)
                        if ! #defined(#full[$f])
                                output mail $upper("no record of any full
                                                    backup for
                                                    $mnt[$f] ($f)")
                        else
                                if #full[$f] > #maxfull + #slackfull
                                        output mail $upper("last recorded
                                                            full backup
                                                            is $text(#full[$f])
                                                            days old for
                                                            $mnt[$f] ($f)")
                                elseif #full[$f] > #maxfull
                                        output mail "last recorded full backup
                                                     is $text(#full[$f])
                                                     days old for
                                                     $mnt[$f] ($f)"
                                fi
                        fi
                        if ! #defined(#incr[$f])
                                output mail $upper("no record of any
                                                    incr backup
                                                    for $mnt[$f] ($f)")
                        else
                                if #incr[$f] > #maxincr + #slackincr
                                        output mail $upper("last recorded
                                                            incr backup
                                                            is $text(#incr[$f])
                                                            days old for
                                                            $mnt[$f] ($f)")
                                elseif #incr[$f] > #maxincr
                                        output mail "last recorded incr backup is
                                                     $text(#incr[$f])
                                                     days old for $mnt[$f] ($f)"
                                fi
                        fi
                endforeach

                foreach #keys($f, #full)
                        if ! #defined($mnt[$f])
                                output mail "orphaned filesystem in
                                             =dumpdates: $f"
                        fi
                endforeach

                foreach #keys($f, #incr)
                        if ! #defined($mnt[$f])
                                output mail "orphaned filesystem in
                                             =dumpdates: $f"
                        fi
                endforeach

The 'input proc' statement yields script input like:

/dev/dsk/c0t0d0s0     246463   73514  148303    34%    /
/dev/dsk/c0t0d0s6    1269558  816915  401861    68%    /usr
/proc                      0       0       0     0%    /proc
/dev/dsk/c0t0d0s3     492422   98217  344963    23%    /var
/dev/dsk/c0t0d0s4     492422  370390   72790    84%    /opt
/dev/dsk/c0t0d0s5    4672134 4517361  154773    97%    /export/home
/dev/dsk/c1t3d0s0    4156462 2883696 1064943    74%    /pub/perf_disk_1
/dev/dsk/c1t4d0s0    4156462 3462278  486361    88%    /pub/perf_disk_2

In the 'begin' section, we set some script parameters.

Next, we report if the dumpdates file is missing, is way out of date (or if we can't open it for reading).

We then start reading in the dumpdates file, line by line.  For Solaris and SunOS (at least), we have to tweak the device files, as dumpdates lists character (raw) device files, while the df display lists block device files.

Here are some sample dumpdates entries:

/dev/rdsk/c0t0d0s4               4 Tue Jan 23 21:05:29 2001
/dev/rdsk/c1t2d0s0               4 Tue Jan 23 21:05:47 2001
/dev/rdsk/c1t3d0s0               4 Tue Jan 23 21:47:05 2001
/dev/rdsk/c1t9d0s0               0 Wed Jan 17 22:04:39 2001
/dev/rdsk/c1t8d0s0               0 Wed Jan 17 22:04:41 2001
/dev/rdsk/c1t3d0s0               0 Wed Jan 17 22:07:33 2001
/dev/rdsk/c1t4d0s0               0 Wed Jan 17 22:28:50 2001
/dev/rdsk/c1t4d0s0               6 Sat Jan 27 21:22:27 2001
/dev/rdsk/c1t5d0s0               6 Sat Jan 27 21:57:41 2001
/dev/rdsk/c1t8d0s0               6 Sat Jan 27 22:20:54 2001
/dev/rdsk/c1t9d0s0               6 Sat Jan 27 23:20:28 2001

Note how it's '/dev/dsk' in the df display but '/dev/rdsk' in the dumpdates file.  (Different conventions are followed for different OSes.  Add your own $substitute() cases as necessary.)

The =set_lineage() macro sets the age of the current input line, #lineage, in seconds from the present time.  For this, I had to add another case to the configs_samples =set_lineage() macro:

            // for dumpdates output and others; look for date/time
            // stamp like "Feb 25 16:19:30 2000" anywhere within a line
            elseif #parse("(L)","(=months)[[:space:]]+([[:digit:]]+)[[:space:]]+
                                   ([[:digit:]]+):([[:digit:]]+):([[:digit:]]+)
                                   [[:space:]]([[:digit:]]+)") == 6
                    set #lineage = =nowdst - (#datevalue(#val($6),
                                              #monthnumber($1),
                                              #val($2)) + #timevalue(#val($3),
                                              #val($4), #val($5)))

I have thought of writing a standard Pikt #lineage() function, but the current example is the best argument against doing that.  Who knows what screwy date and time formats we might encounter in the future?  My current =set_lineage() macro specification considers five different formats. How many more will there be?  By doing =set_lineage() as a PIKT macro, we make it easier for the ordinary user to tweak than by venturing into the PIKT source code (shudder the thought!) to make the desired extensions.

After recasting the #lineage in terms of days, we enter an 'if ... fi' construct that fills two associative arrays with information from the dumpdates files--one showing the "age" (in days) of the most recent full (dump level 0) backup, and the other showing the "age" of the most recent incr (incremental, dump level > 0) backup.

With reading and storing the dumpdates data out of the way, we enter the rules sections to begin considering the df output.

We skip several special file systems that we don't back up because it makes no sense to do so or because it's against policy.  (We don't back up user mail files.  Don't ask!)

In the next rule, we store the current file system in the $mnt[] array for later recall.

(We throw in a couple '#ifdef debug ... #endifdef' sections for testing purposes.)

In the 'end' section is where we do some correlations and report any problems.

For each of the mounted file systems as reported in the df display, we inspect their most recent full and incremental backup ages.  If the last recorded full or incremental backup exceeds the #maxfull or #maxincr threshold, we report that as alert e-mail.  If the age exceeds #maxfull or #maxincr by an additional slack factor--that is, the file backups are getting to be uncomfortably out of date--we SCREAM e-mail using the $upper() function.

Finally, in the last foreach, we report possibly orphaned file systems-- reported in dumpdates but appearing nowhere in the df display.  These typically represent disks long ago retired.  It behooves us to clean these out of the dumpdates files, as they no longer serve any useful purpose.

We added this to the Warning alerts set that runs once overnight.  Here is a typical alert message:

                                PIKT ALERT
                         Sat Jan 27 02:32:04 2001
                                  moscow

WARNING:
    DumpDatesChkWarning
        Report backup problems as revealed by /etc/dumpdates

        LAST RECORDED INCR BACKUP IS 255 DAYS OLD FOR /PUB/ALUM_DISK_1
          (/DEV/DSK/C0T1D0S6)
        NO RECORD OF ANY FULL BACKUP FOR /OPT/MAILMAN_DISK_1 (/DEV/DSK/C2T5D0S5)
        NO RECORD OF ANY INCR BACKUP FOR /OPT/MAILMAN_DISK_1 (/DEV/DSK/C2T5D0S5)
        orphaned filesystem in /etc/dumpdates: /dev/md/dsk/d30
        orphaned filesystem in /etc/dumpdates: /dev/md/dsk/d30

Alas, we found more than a few backup gaps like those reported in the moscow system.  These gaps are crucial for us to know.

In developing this, I first tested it using the special Test alert:

#ifdef test

//#  if piktmaster
//#  if milan

Test
        timing          =piktnever
//      timing          7,37 * * * *
//      drift           5
        mailcmd         "=mailx -s 'PIKT Alert on =pikthostname: Test' pikt-test"
        alarms
                        //ProcCountsChk
                        //CronLogChkUrgent
                        //AliasesChkWarning
                        //RemoveOrphanedPrintFilesNotice
                        //FileStatChkUrgent
                        DumpDatesChkWarning
//#  endif

#endifdef  // test

Before, I wasn't wrapping this within an '#ifdef test ... #endifdef'.  If I left it uncommented, I was always either installing the Test alert in my 'piktc -iv +A all' commands, or I was seeing messages complaining about missing Test.alt in the log files.  To avoid this, I was commenting and uncommenting the Test section in alerts.cfg.  Now, by means of the '#ifdef test ... #endifdef', I can more conveniently activate this at the command line by, e.g.,

# piktc -iv +D test +A Test +H ...

to install, or

# piktc -tv +D test +A Test +H ...

to delete it (and all attendant .log and .hst files, if any) when I am through with the testing.

Since the 'test' #define is set by default to FALSE in defines.cfg, most ordinary piktc commands (where I don't explicitly specify '+D test') have piktc just disregard the Test alert entirely.

To test this new DumpDatesChk script, I installed everywhere:

# piktc -iv +D test +A Test -H downsys

Then, I ran this test script on all systems with

# piktc -x +C "hostname; echo; =pikt +A Test; echo; echo" -H downsys 2>&1 |
    tee DumpDatesChk.out

Here is some sample output:

nismaster

/ETC/DUMPDATES IS MORE THAN 14 DAYS OUT OF DATE:
-rw-rw-r--   1 root     sys            0 Dec 18 12:15 /etc/dumpdates
GENERAL BACKUP FAILURE!
NO RECORD OF ANY FULL BACKUP FOR / (/DEV/DSK/C0T0D0S0)
NO RECORD OF ANY INCR BACKUP FOR / (/DEV/DSK/C0T0D0S0)
NO RECORD OF ANY FULL BACKUP FOR /USR (/DEV/DSK/C0T0D0S4)
NO RECORD OF ANY INCR BACKUP FOR /USR (/DEV/DSK/C0T0D0S4)
NO RECORD OF ANY FULL BACKUP FOR /VAR (/DEV/DSK/C0T0D0S3)
NO RECORD OF ANY INCR BACKUP FOR /VAR (/DEV/DSK/C0T0D0S3)


moscow

LAST RECORDED INCR BACKUP IS 256 DAYS OLD FOR /PUB/ALUM_DISK_1 (/DEV/DSK/C0T1D0S6)
NO RECORD OF ANY FULL BACKUP FOR /OPT/MAILMAN_DISK_1 (/DEV/DSK/C2T5D0S5)
NO RECORD OF ANY INCR BACKUP FOR /OPT/MAILMAN_DISK_1 (/DEV/DSK/C2T5D0S5)
orphaned filesystem in /etc/dumpdates: /dev/md/dsk/d30
orphaned filesystem in /etc/dumpdates: /dev/md/dsk/d30


utrecht


leiden


...

For any given system, as in the last two systems listed, no output indicates no problem.  Most systems reported no problems, or only trivial problems. Still, the new DumpDatesChk script reported far too many problems for comfort!

Once I had fixed any bugs (easy to spot when all systems report at once) and e-mailed the full report (the 'piktc -x +C' output) to the backup operator and others, I deleted everywhere with

# piktc -tv +D test +A Test -H downsys

and, after registering DumpDatesChk in the Warning section of alerts.cfg, installed everywhere with

# piktc -iv +A Warning -H downsys

This problem--monitoring the state of file backups--is so important that I will be doing more in this area in the days ahead.  But we now have, in the words of the backup operator, a "very nice tool" for reporting these potentially serious--job and organization threatening--backup failures.

You probably have a backup system that uses something other than Unix dump. (It just so happens that both of our backup systems--Amanda and Budtool-- use dump, or at least we have configured them that way.)  If so, you can't use this DumpDatesChk script directly.  But I hope I've inspired you to think about how you might apply something like this to your own situation, also to think about how disastrous things might be if you aren't closely monitoring the extent of your own backup coverage.

[posted 2001/01/30]

Returning briefly to the DumpDatesChk script described yesterday, we were seeing repeat orphan messages like

        orphaned filesystem in /etc/dumpdates: /dev/dsk/c0t0d0s5
        orphaned filesystem in /etc/dumpdates: /dev/dsk/c0t0d0s5

I modified the final foreach to read:

                foreach #keys($f, #incr)
                        if ! #defined($mnt[$f])
                                if ! #defined(#full[$f])        // to squelch
                                                                // repeats from
                                                                // prev foreach
                                        output mail "orphaned filesystem in
                                                     =dumpdates: $f"
                                fi
                        fi
                endforeach

And so it goes: always refining, always improving the configuration.

Open Hand For more examples, see Developer's Notes.

 
Home | FAQ | News | Intro | Samples | Tutorial | Reference | Software
Developer's Notes | Licensing | Authors | Pikt-Users | Pikt-Workers | Related Projects | Site Index | Privacy Policy | Contact Us
Page best viewed at 1024x768 or greater.   Page last updated 2010-04-15.   This site is PIKT® powered.
PIKT® is a registered trademark of the University of Chicago.   Copyright © 1998-2010 Robert Osterlund. All rights reserved.
Home FAQ News Intro Samples Tutorial Reference Software
PIKT Logo
PIKT Page Title
View sample
backup system
configuration
Pikt script