Process Counts Example
Case Study 4: ProcCountsChk
Recently, we have faced a crisis where a bug in the current version of our Web-based e-mail client has the imapd, under occasional and mysterious circumstances, spawning instances of itself every second or so. For a handful of users, we are seeing occasional "imapd storms" with per-user imapd counts reaching into the dozens, hundreds, and sometimes even thousands! At about the same time, but for different reasons, we began seeing "ypserv storms". Not only do these storms risk losing user mail files, they also imperil the entire system. Listing 7 is a Pikt script we have put into operation to deal with these sorts of problems.
Listing 7: ProcCountsChk
ProcCountsChk init status active level emergency task "Report unusually high counts of per-user procs." // note: a defunct process might show an empty comm field // below, so we pipe the ps output through the awk filter input proc "=ps -eo user,comm | =behead(1) | =awk 'NF==2' | \ =sort | =uniq -c" dat #count 1 dat $user 2 dat $proc 3 begin // read in process and threshold data from objects file if #fopen(PROCCOUNTS, "=proccounts_obj", "r") != #err() while #read(PROCCOUNTS) > 0 if #split($rdlin) == 5 set #lgcnt[$1] = #val($2) // log thresholds set #alcnt[$1] = #val($3) // alert thresholds set #pgcnt[$1] = #val($4) // page thresholds set #klcnt[$1] = #val($5) // kill thresholds // else send an error message? fi endwhile do #fclose(PROCCOUNTS) else output mail "Can't open =proccounts_obj for reading!" quit fi rule foreach #keys($pr, #lgcnt) if $proc =~~ "$pr$" // '=~~', not 'eq', so that '\\*' // works as a default if #lgcnt[$pr] && #count >= #lgcnt[$pr] // for gathering diagnostic stats output log "=proccounts_log" $inline fi if #alcnt[$pr] && #count >= #alcnt[$pr] output mail $inline if $proc eq "imapd" // special case =archive_mail_file($user, #true()) fi fi if #pgcnt[$pr] && #count >= #pgcnt[$pr] exec wait "echo '=pikthostname: $inlin' | \ =mailx -s '=pikthostname: $inlin' \ =pagesysadmins" pause 5 fi if #klcnt[$pr] && #count >= #klcnt[$pr] =kill_user_proc($proc, $user, #true()) fi next // next input line fi endforeach
The input proc statement yields input like
34 root /usr/lib/sendmail 404 chico imapd 1 zeppo imapd
In the begin section, we read data in from the ProcCounts.obj file (see Listing 8).
Listing 8: ProcCounts
ProcCounts // 0 signifies take no action; 1 signifies always take action // proc log alert page kill # if moscow imapd 10 100 1000 100 # endif # if mailserver sendmail 50 100 200 200 # else sendmail 5 10 20 40 # endif # if nisserver ypserv 2 3 3 0 # endif crack 1 1 1 1 sniffit 1 1 1 1 // ... // wild card should be last in ProcCounts list \\* 10 20 40 0
In the script's only rule, we check to see if the actual per-user process count exceeds the thresholds we set in the begin section, also if the threshold is non-zero.
Instead of 'foreach #keys($pr, #lgcnt)', we could have used 'for $pr in #keys(#lgcnt)'. These accomplish the same purpose but with somewhat different syntax. Variety of expression and keyword synonyms are typical of Pikt. Did you notice the use of 'if ... endif' in Case Studies 2 and 3 as opposed to 'if ... fi' in the current case study? Another example: elif, elsif, elseif are synonymous, and all achieve identical effect.
If the #lgcnt[] threshold is non-zero and if the process count exceeds the #lgcnt[] threshold, we log some diagnostic statistics for post-mortem analysis. If the process count exceeds #alcnt[], we send alert mail reporting that fact. In the case of imapd only, we also backup the user's mail file by means of the =archive_mail_file() macro (not shown).
If #count exceeds #pgcnt[], we send a short alert message to =pagesysadmins, a macro that resolves to the sysadmins' pager numbers.
Finally, if #count exceeds #klcnt[], we kill off the user processes by means of the =kill_user_proc() macro (see Listing 9).
Listing 9: kill_user_proc()
kill_user_proc(P, U, M) // kill off all instances of a given process for a given user // (P) is the process name (e.g., $proc, or "imapd") // (U) is the user (e.g., $user, or "root") // (M) is whether or not to output mail (e.g., #true()) set #killcount = 1 // initialize while #killcount > 0 set #killcount = 0 do #popen(KILL, "=ps -eo pid,user,comm", "r") while #read(KILL) > 0 if #split($readline) != 3 cont fi if $2 eq (U) && $3 eq (P) #ifdef debug output log "=proccounts_log" "$1, $2, $3" output log "=proccounts_log" "(P), (U), $text((M))" #endifdef exec wait "=kill -9 $1" set #killcount += 1 fi endwhile do #pclose(KILL) endwhile if (M) output mail "killed all (U) (P) processes" fi
Here is a sample alert message:
PIKT ALERT Tue Apr 25 00:17:02 2000 moscow URGENT: ProcCountsChk Report unusually high counts \ of per-user procs. 404 chico imapd saved user mail file as /var/mail/arc/chico.956639822 killed all chico imapd \ processes
(We still don't have an understanding of these problems, much less fixes, but at least we are not losing any more user e-mail, and our mail server is coping.)
Before leaving ProcCountsChk, note that by defining all count thresholds to 1 across the board, we can guard against users running "dangerous" or "forbidden" programs such as Crack or Sniffit.
prev page | 1st page | next page |