Process Counts Example

Case Study 4: ProcCountsChk

Recently, we have faced a crisis where a bug in the current version of our Web-based e-mail client has the imapd, under occasional and mysterious circumstances, spawning instances of itself every second or so.  For a handful of users, we are seeing occasional "imapd storms" with per-user imapd counts reaching into the dozens, hundreds, and sometimes even thousands! At about the same time, but for different reasons, we began seeing "ypserv storms".  Not only do these storms risk losing user mail files, they also imperil the entire system.  Listing 7 is a Pikt script we have put into operation to deal with these sorts of problems.
 


Listing 7: ProcCountsChk

ProcCountsChk

    init
        status active
        level emergency
        task "Report unusually high counts of per-user procs."
        // note: a defunct process might show an empty comm field
        // below, so we pipe the ps output through the awk filter
        input proc "=ps -eo user,comm | =behead(1) | =awk 'NF==2' | \
                    =sort | =uniq -c"
        dat #count 1
        dat $user  2
        dat $proc  3

    begin   // read in process and threshold data from objects file
        if #fopen(PROCCOUNTS, "=proccounts_obj", "r") != #err()
            while #read(PROCCOUNTS) > 0
                if #split($rdlin) == 5
                   set #lgcnt[$1] = #val($2)    // log   thresholds
                   set #alcnt[$1] = #val($3)    // alert thresholds
                   set #pgcnt[$1] = #val($4)    // page  thresholds
                   set #klcnt[$1] = #val($5)    // kill  thresholds
             // else send an error message?
                fi
            endwhile
            do #fclose(PROCCOUNTS)
        else
            output mail "Can't open =proccounts_obj for reading!"
            quit
        fi

    rule
        foreach #keys($pr, #lgcnt)
            if $proc =~~ "$pr$"   // '=~~', not 'eq', so that '\\*'
                                  // works as a default
                if #lgcnt[$pr] && #count >= #lgcnt[$pr]
                    // for gathering diagnostic stats
                    output log "=proccounts_log" $inline
                fi
                if #alcnt[$pr] && #count >= #alcnt[$pr]
                    output mail $inline
                    if $proc eq "imapd"  // special case
                        =archive_mail_file($user, #true())
                    fi
                fi
                if #pgcnt[$pr] && #count >= #pgcnt[$pr]
                    exec wait "echo '=pikthostname: $inlin' | \
                               =mailx -s '=pikthostname: $inlin' \
                               =pagesysadmins"
                    pause 5
                fi
                if #klcnt[$pr] && #count >= #klcnt[$pr]
                    =kill_user_proc($proc, $user, #true())
                fi
                next    // next input line
            fi
        endforeach

The input proc statement yields input like

   34  root /usr/lib/sendmail
  404  chico imapd
    1  zeppo imapd

In the begin section, we read data in from the ProcCounts.obj file (see Listing 8).
 


Listing 8: ProcCounts

ProcCounts

// 0 signifies take no action; 1 signifies always take action

//      proc            log         alert       page        kill

#    if moscow
        imapd           10          100         1000        100
#    endif
#    if mailserver
        sendmail        50          100         200         200
#    else
        sendmail        5           10          20          40
#    endif
#    if nisserver
        ypserv          2           3           3           0
#    endif
        crack           1           1           1           1
        sniffit         1           1           1           1
//      ...
// wild card should be last in ProcCounts list
        \\*             10          20          40          0

In the script's only rule, we check to see if the actual per-user process count exceeds the thresholds we set in the begin section, also if the threshold is non-zero.

Instead of 'foreach #keys($pr, #lgcnt)', we could have used 'for $pr in #keys(#lgcnt)'.  These accomplish the same purpose but with somewhat different syntax.  Variety of expression and keyword synonyms are typical of Pikt.  Did you notice the use of 'if ... endif' in Case Studies 2 and 3 as opposed to 'if ... fi' in the current case study? Another example: elif, elsif, elseif are synonymous, and all achieve identical effect.

If the #lgcnt[] threshold is non-zero and if the process count exceeds the #lgcnt[] threshold, we log some diagnostic statistics for post-mortem analysis.  If the process count exceeds #alcnt[], we send alert mail reporting that fact.  In the case of imapd only, we also backup the user's mail file by means of the =archive_mail_file() macro (not shown).

If #count exceeds #pgcnt[], we send a short alert message to =pagesysadmins, a macro that resolves to the sysadmins' pager numbers.

Finally, if #count exceeds #klcnt[], we kill off the user processes by means of the =kill_user_proc() macro (see Listing 9).
 


Listing 9: kill_user_proc()

kill_user_proc(P, U, M)
    // kill off all instances of a given process for a given user
    // (P) is the process name (e.g., $proc, or "imapd")
    // (U) is the user (e.g., $user, or "root")
    // (M) is whether or not to output mail (e.g., #true())
    set #killcount = 1    // initialize
    while #killcount > 0
        set #killcount = 0
        do #popen(KILL, "=ps -eo pid,user,comm", "r")
        while #read(KILL) > 0
            if #split($readline) != 3
                cont
            fi
            if    $2 eq (U)
               && $3 eq (P)
#ifdef debug
                output log "=proccounts_log" "$1, $2, $3"
                output log "=proccounts_log" "(P), (U), $text((M))"
#endifdef
                exec wait "=kill -9 $1"
                set #killcount += 1
            fi
        endwhile
        do #pclose(KILL)
    endwhile
    if (M)
        output mail "killed all (U) (P) processes"
    fi

Here is a sample alert message:

              PIKT ALERT
       Tue Apr 25 00:17:02 2000
                moscow

URGENT:
  ProcCountsChk
    Report unusually high counts \
      of per-user procs.

    404  chico imapd
    saved user mail file as
      /var/mail/arc/chico.956639822
    killed all chico imapd \
      processes

(We still don't have an understanding of these problems, much less fixes, but at least we are not losing any more user e-mail, and our mail server is coping.)

Before leaving ProcCountsChk, note that by defining all count thresholds to 1 across the board, we can guard against users running "dangerous" or "forbidden" programs such as Crack or Sniffit.

prev page 1st page next page
 
Home | FAQ | News | Intro | Samples | Tutorial | Reference | Software
Developer's Notes | Licensing | Authors | Pikt-Users | Pikt-Workers | Related Projects | Site Index | Privacy Policy | Contact Us
Page best viewed at 1024x768 or greater.   Page last updated 2018-01-02.   This site is PIKT® powered.
Copyright © 1998-2018 Robert Osterlund. All rights reserved.
Home FAQ News Intro Samples Tutorial Reference Software
PIKT Logo
PIKT Page Title
View sample
high
load averages
Pikt script