Runaway Process Macro
The runaway_process_alarms_macros.cfg is a script macro to report runaway processes--processes using excessive percentages of either CPU or MEM (memory).
runaway_process(S, s, F, LC, LU, LE, B)
init
status =piktstatus
level =piktlevel
task "Report runaway processes using a high % of the (S)"
input proc "=psls | =trim(160) | =awk '\$(F)>=(LC)'"
filter "=egrep '^[A-Za-z0-9\-]+[ ]+[0-9]+ ' | =zapinterpreters"
dat #cpu 3
dat #mem 4
dat $proc 11
keys $proc
begin
set #doheader = #true()
doexec wait "=top -b -n1 -d1 2>/dev/null > =hstdir/log/top." . $alarm()
doexec wait "=psall > =hstdir/log/ps." . $alarm()
#ifdef debug
rule
output mail "$text(#cpu,1) $text(#mem,1) $basename($proc)"
#endifdef
rule // permanent bypasses
if $proc =~~ "=nonesuch"
next
fi
rule // special bypasses
if $proc =~~ "(B)"
next
fi
rule
if $alert() =~ "RED|EMERGENCY"
if #(s) < (LE)
next
fi
fi
rule
if $alert() =~ "Urgent"
if #(s) < (LU)
next
fi
fi
rule
#if missioncritical
if $alert() =~ "RED|EMERGENCY"
=periodically(if #doheader
output mail $command("=psall\ | =head -n 1")
set #doheader = #false()
fi
output mail $inlin, , 60)
elsif $alert() =~ "Urgent"
=periodically(if #doheader
output mail $command("=psall\ | =head -n 1")
set #doheader = #false()
fi
output mail $inlin, , 120)
else
=periodically(if #doheader
output mail $command("=psall\ | =head -n 1")
set #doheader = #false()
fi
output mail $inlin, , 240)
fi
#else
if $alert() =~ "RED|EMERGENCY"
=periodically(if #doheader
output mail $command("=psall\ | =head -n 1")
set #doheader = #false()
fi
output mail $inlin, , 120)
elsif $alert() =~ "Urgent"
=periodically(if #doheader
output mail $command("=psall\ | =head -n 1")
set #doheader = #false()
fi
output mail $inlin, , 240)
else
=daily(if #doheader
output mail $command("=psall | =hea\d -n 1")
set #doheader = #false()
fi
output mail $inlin, )
fi
#endif
end
if ! #doheader
output mail =newline
=outputfile(mail, "=hstdir/log/top." . $alarm())
output mail =newline
=outputfile(mail, "=hstdir/log/ps." . $alarm())
fi
quit
You might invoke the =runaway_process() macro in your alarms.cfg file thusly:
///////////////////////////////////////////////////////////////////////////////
RunawayCPUProcs
=runaway_process(CPU, cpu, 3, 90.0, 99.0, 100.0, =nonesuch)
///////////////////////////////////////////////////////////////////////////////
RunawayMEMProcs
=runaway_process(MEM, mem, 4, 40.0, 50.0, 60.0, =nonesuch)
///////////////////////////////////////////////////////////////////////////////
Note that we can invoke either of these scripts in various alert groups throughout our alerts.cfg, and the script will adapt (by means of, for example, 'if $alert() =~ "Urgent" ... fi' to its alert level.
In an Urgent group, output from the RunawayMEMProcs script might look like, for example:
URGENT:
RunawayMEMProcs
Report runaway processes using a high % of the MEM
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
kik 27978 7.7 51.0 559692 528512 ? S 08:04 34:53 /usr/local/acme/bin/prodtool
top - 15:32:07 up 30 days, 6:51, 2 users, load average: 0.27, 0.16, 0.10
Tasks: 104 total, 2 running, 102 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.4% us, 0.2% sy, 0.0% ni, 98.0% id, 0.4% wa, 0.0% hi, 0.0% si
Mem: 1034932k total, 984324k used, 50608k free, 5604k buffers
Swap: 4016168k total, 28252k used, 3987916k free, 119952k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28044 kik 16 0 271m 168m 23m R 95 16.7 42:51.45 firefox-bin
7940 root 15 0 187m 52m 4260 S 6 5.2 151:00.56 X
1 root 16 0 1468 472 448 S 0 0.0 0:01.92 init
2 root RT 0 0 0 0 S 0 0.0 0:00.11 migration/0
3 root 34 19 0 0 0 S 0 0.0 0:00.01 ksoftirqd/0
[...]
27978 kik 17 0 546m 516m 17m S 0 51.1 34:53.39 prodtool
[...]
Note also how, in addition to reporting the runaway process, we also report top and ps output, in order to give context to the runaway.
For more examples, see Samples.