High Load Averages
In this example, we report perilously high load averages.
The LoadAverage script might send an alert message like the following:
PIKT ALERT
Tue Feb 26 16:54:09 2002
murmansk
URGENT:
LoadAverage
Report perilously high system load averages
uptime - 19:40:10 up 120 days, 4:46, 1 user, load average: 10.75, 6.35, 4.22
top - 19:40:13 up 120 days, 4:46, 1 user, load average: 10.75, 6.35, 4.22
Tasks: 91 total, 2 running, 89 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0% us, 0.4% sy, 0.0% ni, 96.9% id, 2.6% wa, 0.0% hi, 0.1% si
Mem: 2058228k total, 2048120k used, 10108k free, 104816k buffers
Swap: 4016168k total, 240k used, 4015928k free, 1756208k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22516 root 15 0 8032 748 616 R 36 0.0 0:30.33 tar
177 root 15 0 0 0 0 D 2 0.0 31:32.46 kswapd0
3714 root 10 -5 0 0 0 D 2 0.0 42:24.87 kjournald
1 root 16 0 3632 596 508 S 0 0.0 0:03.18 init
...
The script follows.
LoadAverage
init
status =piktstatus
level =piktlevel
task "Report perilously high system load averages"
input proc "=uptime"
dat $ky $-3 // invariant key, "average:",
// as in "load average:"
dat $a1 $-2
dat $a5 $-1
dat $a15 $
keys $ky
begin
#if london | copenhagen
if $alert() =~ "EMERGENCY"
set #lalim = 15.0
elsif $alert() =~ "Urgent"
set #lalim = 10.0
#else
if #pid("dd") > 0
set #lalim = 15.0
elsif $alert() =~ "EMERGENCY"
set #lalim = 10.0
elsif $alert() =~ "Urgent"
set #lalim = 5.0
#endif
else // if $alert() =~ "LoadAverages"
set #lalim = 1.0
fi
rule // dispose of trailing comma, and set value
set #la1 = #value($chop($a1,1))
#ifdef debug
rule
output "\#la1 is $text(#la1), \#lalim is $text(#lalim)"
#endifdef
rule // if exceeds threshold
if #la1 >= #lalim
// always report if manual LoadAverages script
if $alert() eq "LoadAverages"
output $trim($inline)
else
// unless load avg is rising (is at least
// one level higher than before)
#if missioncritical
=hourly(if #trunc(#la1) > #trunc(%la1)
output mail "uptime - $trim($inline)"
output mail =newline =toptop(1000)
fi, )
#else
=every_four_hours(if #trunc(#la1) > #trunc(%la1)
output mail "uptime - $trim($inline)"
output mail =newline =toptop(1000)
fi, )
#endif
fi
fi
rule // only log load averages for Urgent alerts
if $alert() eq "Urgent"
=output_alarm_log($inlin)
fi
This is just one program example. You could add rules, or write new scripts, for example to: report and possibly kill runaway processes, report unusually high counts of per-user processes, report and possibly kill forbidden processes, report extremely high numbers of zombie and defunct processes, log special process accounting data, etc.
For more examples, see Samples.