Process Time Limits
Mark Cheverton writes:
One of our machines had a bit of a problem today and was taken offline. Another machine NFS mounted partitions of the first machine and so started to have some problems of its own. Unfortunatly PIKT didn't pick this up for the following reason. My LoadAvg check (the high load alerted us to the problem) was part of the Emergency Alert along with an alarm which checked free disk. This alarm was running df which was hanging on the NFS mount highlighting the need for me to df -l in my alarm scripts. More generally I was a bit worried to see that alarms aren't run in parallel so one bad one in an alert set means the others don't get run. More importantly I don't see any way to set a timeout on my 'input proc' at the PIKT level. Can I suggest this as a feature request?
And I respond:
We've encountered that sort of problem before, too.
In order to set process time limits, I suggest you use the maxtime.exp script provided in the configs/programs directory:
/////////////////////////////////////////////////////////////////////////////// // // maxtime_programs.cfg // /////////////////////////////////////////////////////////////////////////////// maxtime.exp // set a process time limit #!=expect set timeout [lindex $argv 0] eval spawn -noecho [lrange $argv 1 end] expect ///////////////////////////////////////////////////////////////////////////////
Here is a sample use from the 1.15.0 configs_samples/alarms/reboot_alarms.cfg:
if $command("=maxtime $timeout =rsh $host hostname") !~ $host
Here's a trivial example of using maxtime.exp:
vienna# /pikt/lib/programs/maxtime.exp 1 yes y y y y y ... vienna# [program died after 1 second]
Of course, you need to have expect installed on your system.
In the 1.15.0 configs_samples, I make extensive use of the =dflinput macro. =dflinput is defined as:
/////////////////////////////////////////////////////////////////////////////// dflinput // df display for local disks input proc "=dfl | =behead(1)" ///////////////////////////////////////////////////////////////////////////////
You could incorporate maxtime.exp with:
/////////////////////////////////////////////////////////////////////////////// dflinput(T) // df display for local disks // timeout after (T) seconds input proc "=maxtime (T) =dfl | =behead(1)" ///////////////////////////////////////////////////////////////////////////////
(At times we have seen 'df -k -F ufs' procs hang--i.e., when just accessing local disks only. So applying maxtime.exp to local disk reads has its uses, too.)
You could adjust your other =df*input macros in the same way. Or, if you are not using the =df*input macros, just append "=maxtime <some number>" at the beginning of the appropriate 'input proc' statements.
In fact, you could even define this all-encompassing macro:
/////////////////////////////////////////////////////////////////////////////// inputproc(P, T) // do an 'input proc' of proc (P) // timeout after (T) seconds if necessary input proc "=maxtime (T) (P)" ///////////////////////////////////////////////////////////////////////////////
You could then redefine the =dfinput macro as
/////////////////////////////////////////////////////////////////////////////// dfinput(T) // df display for all disks // timeout after (T) seconds =inputproc(=dfk | =behead(1), (T)) ///////////////////////////////////////////////////////////////////////////////
You might use this as
I defined =inputproc() as described, then substituted =dfinput(10) in one of my alarm scripts. Rather than install and clobber the existing client (slave) script, I just diff'ed my test configuration against the client copy:
vienna# piktc -fv +A EMERGENCY +H madrid processing madrid... fetching file(s)... EMERGENCY.alt fetched diffing file(s)... diff -r /pikt/lib/configs/staging/EMERGENCY.alt /pikt/lib/configs/diffing/EMERGENCY.alt 462c462 < input proc "/pikt/lib/programs/maxtime.exp 10 /usr/bin/df -k | /usr/bin/sed '1,1d'" --- > input proc "/usr/bin/df -k | /usr/bin/sed '1,1d'"
It seems to work!
You could get in the habit of everywhere using
(use different timeouts as needed) instead of the simple
input proc "..."
Note that in the =inputproc() invocation, the proc macro argument should not be enclosed in quotes, for example
=inputproc(/usr/sbin/dptutil -L raid | =behead(2), 10)
(That was just a quicky example pulled from our configs. I'm not suggesting a need or the desirability of putting a timeout on the dptutil command.)
Using Expect and maxtime.exp with PIKT is in the best Unix tradition of combining well-tested smaller tools.
So, my preference is, rather than adding this new feature to the pikt interpreter, instead to rely on Expect, maxtime.exp (or some variant), and PIKT macros where necessary.
As for alarm parallelization, that's under your control, too. Just split
alarms into smaller and smaller alert groups as necessary. Alarms (the
new smaller alert groups) will then run under independent pikt processes.
Parallelizing by default has its problems. I prefer the current scheme,
which allows the user to parallelize as needed.
For more examples, see Developer's Notes.