Runaway Process Macro
The runaway_process_alarms_macros.cfg is a script macro to report runaway processes--processes using excessive percentages of either CPU or MEM (memory).
runaway_process(S, s, F, LC, LU, LE, B) init status =piktstatus level =piktlevel task "Report runaway processes using a high % of the (S)" input proc "=psls | =trim(160) | =awk '\$(F)>=(LC)'" filter "=egrep '^[A-Za-z0-9\-]+[ ]+[0-9]+ ' | =zapinterpreters" dat #cpu 3 dat #mem 4 dat $proc 11 keys $proc begin set #doheader = #true() doexec wait "=top -b -n1 -d1 2>/dev/null > =hstdir/log/top." . $alarm() doexec wait "=psall > =hstdir/log/ps." . $alarm() #ifdef debug rule output mail "$text(#cpu,1) $text(#mem,1) $basename($proc)" #endifdef rule // permanent bypasses if $proc =~~ "=nonesuch" next fi rule // special bypasses if $proc =~~ "(B)" next fi rule if $alert() =~ "RED|EMERGENCY" if #(s) < (LE) next fi fi rule if $alert() =~ "Urgent" if #(s) < (LU) next fi fi rule #if missioncritical if $alert() =~ "RED|EMERGENCY" =periodically(if #doheader output mail $command("=psall\ | =head -n 1") set #doheader = #false() fi output mail $inlin, , 60) elsif $alert() =~ "Urgent" =periodically(if #doheader output mail $command("=psall\ | =head -n 1") set #doheader = #false() fi output mail $inlin, , 120) else =periodically(if #doheader output mail $command("=psall\ | =head -n 1") set #doheader = #false() fi output mail $inlin, , 240) fi #else if $alert() =~ "RED|EMERGENCY" =periodically(if #doheader output mail $command("=psall\ | =head -n 1") set #doheader = #false() fi output mail $inlin, , 120) elsif $alert() =~ "Urgent" =periodically(if #doheader output mail $command("=psall\ | =head -n 1") set #doheader = #false() fi output mail $inlin, , 240) else =daily(if #doheader output mail $command("=psall | =hea\d -n 1") set #doheader = #false() fi output mail $inlin, ) fi #endif end if ! #doheader output mail =newline =outputfile(mail, "=hstdir/log/top." . $alarm()) output mail =newline =outputfile(mail, "=hstdir/log/ps." . $alarm()) fi quit
You might invoke the =runaway_process() macro in your alarms.cfg file thusly:
/////////////////////////////////////////////////////////////////////////////// RunawayCPUProcs =runaway_process(CPU, cpu, 3, 90.0, 99.0, 100.0, =nonesuch) /////////////////////////////////////////////////////////////////////////////// RunawayMEMProcs =runaway_process(MEM, mem, 4, 40.0, 50.0, 60.0, =nonesuch) ///////////////////////////////////////////////////////////////////////////////
Note that we can invoke either of these scripts in various alert groups throughout our alerts.cfg, and the script will adapt (by means of, for example, 'if $alert() =~ "Urgent" ... fi' to its alert level.
In an Urgent group, output from the RunawayMEMProcs script might look like, for example:
URGENT: RunawayMEMProcs Report runaway processes using a high % of the MEM USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND kik 27978 7.7 51.0 559692 528512 ? S 08:04 34:53 /usr/local/acme/bin/prodtool top - 15:32:07 up 30 days, 6:51, 2 users, load average: 0.27, 0.16, 0.10 Tasks: 104 total, 2 running, 102 sleeping, 0 stopped, 0 zombie Cpu(s): 1.4% us, 0.2% sy, 0.0% ni, 98.0% id, 0.4% wa, 0.0% hi, 0.0% si Mem: 1034932k total, 984324k used, 50608k free, 5604k buffers Swap: 4016168k total, 28252k used, 3987916k free, 119952k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 28044 kik 16 0 271m 168m 23m R 95 16.7 42:51.45 firefox-bin 7940 root 15 0 187m 52m 4260 S 6 5.2 151:00.56 X 1 root 16 0 1468 472 448 S 0 0.0 0:01.92 init 2 root RT 0 0 0 0 S 0 0.0 0:00.11 migration/0 3 root 34 19 0 0 0 S 0 0.0 0:00.01 ksoftirqd/0 [...] 27978 kik 17 0 546m 516m 17m S 0 51.1 34:53.39 prodtool [...]
Note also how, in addition to reporting the runaway process, we also report top and ps output, in order to give context to the runaway.
For more examples, see Samples.