High Load Averages
In this example, we report perilously high load averages.
The LoadAverage script might send an alert message like the following:
PIKT ALERT Tue Feb 26 16:54:09 2002 murmansk URGENT: LoadAverage Report perilously high system load averages uptime - 19:40:10 up 120 days, 4:46, 1 user, load average: 10.75, 6.35, 4.22 top - 19:40:13 up 120 days, 4:46, 1 user, load average: 10.75, 6.35, 4.22 Tasks: 91 total, 2 running, 89 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0% us, 0.4% sy, 0.0% ni, 96.9% id, 2.6% wa, 0.0% hi, 0.1% si Mem: 2058228k total, 2048120k used, 10108k free, 104816k buffers Swap: 4016168k total, 240k used, 4015928k free, 1756208k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 22516 root 15 0 8032 748 616 R 36 0.0 0:30.33 tar 177 root 15 0 0 0 0 D 2 0.0 31:32.46 kswapd0 3714 root 10 -5 0 0 0 D 2 0.0 42:24.87 kjournald 1 root 16 0 3632 596 508 S 0 0.0 0:03.18 init ...
The script follows.
LoadAverage init status =piktstatus level =piktlevel task "Report perilously high system load averages" input proc "=uptime" dat $ky $-3 // invariant key, "average:", // as in "load average:" dat $a1 $-2 dat $a5 $-1 dat $a15 $ keys $ky begin #if london | copenhagen if $alert() =~ "EMERGENCY" set #lalim = 15.0 elsif $alert() =~ "Urgent" set #lalim = 10.0 #else if #pid("dd") > 0 set #lalim = 15.0 elsif $alert() =~ "EMERGENCY" set #lalim = 10.0 elsif $alert() =~ "Urgent" set #lalim = 5.0 #endif else // if $alert() =~ "LoadAverages" set #lalim = 1.0 fi rule // dispose of trailing comma, and set value set #la1 = #value($chop($a1,1)) #ifdef debug rule output "\#la1 is $text(#la1), \#lalim is $text(#lalim)" #endifdef rule // if exceeds threshold if #la1 >= #lalim // always report if manual LoadAverages script if $alert() eq "LoadAverages" output $trim($inline) else // unless load avg is rising (is at least // one level higher than before) #if missioncritical =hourly(if #trunc(#la1) > #trunc(%la1) output mail "uptime - $trim($inline)" output mail =newline =toptop(1000) fi, ) #else =every_four_hours(if #trunc(#la1) > #trunc(%la1) output mail "uptime - $trim($inline)" output mail =newline =toptop(1000) fi, ) #endif fi fi rule // only log load averages for Urgent alerts if $alert() eq "Urgent" =output_alarm_log($inlin) fi
This is just one program example. You could add rules, or write new scripts, for example to: report and possibly kill runaway processes, report unusually high counts of per-user processes, report and possibly kill forbidden processes, report extremely high numbers of zombie and defunct processes, log special process accounting data, etc.
For more examples, see Samples.