Scanning Multiple Log Files

(NOTE: Some of the techniques shown or described on this page--marked in purple--require new features in the latest official PIKT 1.19.0 release (pikt-current.tar.gz) that are unavailable in any previous version.)

Unix and Linux log files usually follow a common format, and system log file scanning tends toward sameness: Scan a log file, ignore some things, report other things, fix a few things, maybe page special things, etc. Rather than write many Pikt scripts for many system log files, where the differences between such scripts are minor or even trivial, it is better to write generalized log scanning macros using techniques described earlier.

Metalog is a replacement for the standard syslogd and klogd. Metalog operates much like syslog, but with this important difference: "Logged messages can be dispatched according to their facility, urgency, program name and/or Perl-compatible regular expressions." In the default metalog setup, metalog writes system messages to the directories critical, kernel, crond, sshd, pwdfail, and everything, among others. In each such directory, system messages are logged to a file called, appropriately enough, "current". (Metalog also automatically rotates the "current" file(s) to dated backups within each directory.)

The important point to note here is that with metalog, we don't have just one syslog-like file (for example, /var/log/syslog or /var/log/messages) to monitor; we have several, all named "current", and all located in several different directories. Equally important, the several "current" files are all very much alike. We should be able to write an all-purpose PIKT macro to scan the several different metalog "current" files.

And indeed we can. The =syslog_scan() macro following is a general log scanning script to monitor and report entries of special interest in each of the metalog "current" files:

syslog_scan(F, f, BYPASS1, LOGONLY, PAGE, BYPASS2)

        init
                status =piktstatus
                level =piktlevel
                task "Scan the system (f) log for noteworthy entries"
                input logfile "/var/log/(f)/current"

        begin   // run the load average checkpoint, but for low-priority
                // log file scripts only (i.e., always do the "critical"
                // and "kernel" log file scans)
                if $alarm() !~~ "critical|kernel"
                        =checkpoint(=lalim)
                fi

        rule    // automatic bypasses
                if    $inlin =~~ "==> /var/log/.+/current <=="
                   || $inlin =~~ "last (message|output) repeated"
                   || $inlin =~~ "metalog.+died with signal"
                   || $inlin =~ "^[[:space:]]*$"
                        next
                endif

        rule    // special alarm-specific bypasses, first bypass
                if $inlin =~~ "(BYPASS1)"
                        next
                endif

        rule    // make any desired substitutions here before assigning to $il
                set $il = $inlin

        rule    // reverse resolve any ipaddrs
                =resolveipaddr($il)

        rule    // log anything not bypassed
                if $il =~~ "(LOGONLY)"
                        =output_alarm_log($il)
                        next
                endif

        rule    // for flagged stuff, report and log (and possibly page)
                if (    $il =~~ "=redflags"
                     || $il =~~ "=yellowflags"
                   )
                        output mail $il
                        =output_alarm_log($il)
                        if $il =~~ "(PAGE)"
                                =hourly(set $pagemsg = $il
                                        =page($pagemsg\, =pageaddr, =allhours(#now())), )
                        endif
                        next
                endif

        rule    // report and log any root-related stuff
                if $alarm() =~~ "critical|kernel"
                        if $il =~ "root"
                                output mail $il
                                =output_alarm_log($il)
                                next
                        endif
                // but for cron, just bypass root stuff
                elsif $alarm() =~~ "cron"
                        if $il =~ "root"
                                next
                        endif
                endif
//#ifndef paranoid
//      rule    // bypass everything else
//              next
//#endifdef

        rule    // log anything not bypassed
                if $alarm() =~~ "critical|kernel"
                        =output_alarm_log($il)
                endif

        rule    // special alarm-specific bypasses, second bypass
                if $il =~~ "(BYPASS2)"
                        next
                endif

        rule    // report anything not bypassed
                if $alarm() =~~ "critical"
                        output mail $il
                endif

        rule    // report anything not bypassed, if in verbose mode
                if $alarm() =~~ "kernel"
                        =outputmail $il
                endif

In the init section, 'input logfile' has PIKT only show us new log entries--new since the previous script run. ('input file' would have PIKT show us the entire file each time we run the script.)

In the script's begin section, we use the =checkpoint() macro to check--but only for low-priority scripts--the system load average. If the load average is too high, we abort the script (and try again next time). (We could also check the $alert() and $level() for this purpose.)

In the script's first rule, we ignore some metalog clutter, where "next" says to proceed to the next input line (that is, the next "current" log file entry).

In the second rule, we bypass (ignore) more clutter specific to each particular "current" file. If we invoke this script macro with, for example:

        =syslog_scan(Kernel, kernel, =syslogkernelbypasses, ...)

then any log file entry that pattern matches the =syslogkernelbypasses macro, we ignore. We define =syslogkernelbypasses in macros.cfg as:

syslogkernelbypasses    root@|root=/dev/|exception support|exception polling|
                        mounted root|obsolete setsockopt|assume root bridge|
                        reset.+speed usb|set_dentry_child_flags|
                        pcie_portdrv_probe->dev.+ has invalid irq|
                        write protect|too many iterations.+nv_nic_irq|
                        analog subsections not ready

In the third rule, we set a new variable, $il, to the current input line, $inlin. This is because in the next rule, we intend to modify the current input, and $inlin is fixed (by design).

In the fourth rule, we reverse resolve IP addresses, if any, in the current input line using the macros:

addr2host(A)    // for the given ipaddr (A), return the associated host (as
                // resolved by nslookup)
                // (A) is the ipaddr (e.g., 123.111.222.111)
                $trim($command("=nslookup -sil (A) 2>/dev/null | =grep -i 'name =' |
                                =awk '{print \$NF}'"))

resolveipaddr(L)
                if (L) =~ "(.+)[[:space:]]([[:digit:]]{1,3}\\.[[:digit:]]{1,3}\\.
                                           [[:digit:]]{1,3}\\.[[:digit:]]{1,3})(.*)"
                        set (L) = "$1 $2 [" . =addr2host($2) . "]" . $3
                endif

There is a companion macro, =host2addr(), defined as:

host2addr(H)    // for the given host (H), return the associated addr (as
                // resolved by nslookup)
                // (H) is the host (e.g., hamburg.uppity.edu)
                $trim($command("=nslookup -sil (H) 2>/dev/null | =grep '^Address:' |
                                =tail -n +2 | =cut -d: -f2"))

The idea here is that if the current log entry reports an IP address, let's have PIKT do an nslookup and report a more meaningful domain name for us instead.

In the fifth rule, for certain entries, we log them only (and do not report them). For example, we might log entries referring to sysadmin activity by means of

        =syslog_scan(PwdFail, pwdfail, ..., =sysadmins, ...)

where =sysadmins is a list of system administrator accounts, defined in macros.cfg as, for example,

sysadmins       tom|dick|harry|bob|carol|ted|alice

In the sixth rule, we look for entries of special interest, i.e., those that pattern match the =redflags and =yellowflags macros. We report any such entries as e-mail, and also =output_alarm_log() them.

In the sixth rule's inner if-endif, if some entries pattern match the "(PAGE)" argument (for example, "break-in attempt|kernel panic"), we send a page using the macros

page(M, R, H)   // send a page message (M) to recipients (pager phone alias)
                // (R) but only during hours (H)
                // sample use:  =page($host is sick/down, =pagesysadmins, =offhours(#now()))
                if (H)
                        doexec wait "echo '(M)' | =mailx -a 'From: piktadmin' -s '(M)' (R)"
                fi

...

#ifdef test
pageaddr        brahms\
#elsedef
pageaddr        emergency\
#endifdef

...

offhours(H)     // between 10 PM and 6 AM
                ((H) >= 22 || (H) < 6)

allhours(H)     // any time of the day or night
                #true()

In the seventh rule, for any surviving entry (i.e., has not already been bypassed or dealt with then disposed of by means of an earlier 'next' statement), we e-mail and log any root-related critical and kernel metalog entries, but for crond metalog entries, we ignore the root stuff.

In the eighth rule, and if we are not in paranoid mode, we just ignore any surviving entries by means of an unqualified 'next' statement. Because of the unqualified 'next', any follow-up rules would be ignored. But, we have decided here to comment out this rule. It will remain in place here if in future we ever need it.

In the ninth rule, we =output_alarm_log() any surviving critical or kernel log file entries.

In the tenth rule, we apply a second bypass filter, if any.

In the eleventh rule, for critical log file entries, we report them as e-mail.

In the twelfth rule, for kernel entries, we report e-mail them, but only if we are in verbose mode. We do this by way of the macro trick:

// the verbose define controls whether certain routine messages get emailed
// or thrown away; in earlier versions of PIKT, this conditionality was
// handled in this way in alarms.cfg:
//
// #ifdef verbose
//              output mail "truncated $inlin"
// #endifdef
//
// with the macros below, we can now achieve the same effect by replacing
// the above three lines with just this one line:
//
//              =outputmail "truncated $inlin"

#ifdef verbose
outputmail      output mail
#elsedef
outputmail      output log "/dev/null"
#endifdef

// if verbose is not defined (is set to FALSE), the message is logged to
// /dev/null, that is, thrown away

=syslog_scan() is admittedly a rather complicated script, but if it's generality we want--and we do, to avoid having to write many different log scanning scripts for each particular log file--we have to add rules to deal with any special circumstance or contingency.

With the =syslog_scan() macro defined, it is now quite easy to specify, in alarms.cfg (or perhaps the #include file logs_system_alarms.cfg), several different versions of the script customized for each metalog file:

///////////////////////////////////////////////////////////////////////////////

SyslogCriticalScan

        =syslog_scan(Critical, critical, =nonesuch, =nonesuch, =nonesuch, =nonesuch)

///////////////////////////////////////////////////////////////////////////////

SyslogKernelScan

#if systemssys
        =syslog_scan(Kernel, kernel, segfault|=syslogkernelbypasses,
                     =nonesuch, =nonesuch, =nonesuch)
#else
        =syslog_scan(Kernel, kernel, =syslogkernelbypasses, =nonesuch, =nonesuch, =nonesuch)
#endif

///////////////////////////////////////////////////////////////////////////////

SyslogCrondScan

        =syslog_scan(Crond, crond, download, =nonesuch, =nonesuch, =nonesuch)

///////////////////////////////////////////////////////////////////////////////

SyslogEverythingScan
        =syslog_scan(Everything, everything, =syslogeverythingbypasses,
                     =nonesuch, =nonesuch, =nonesuch)

///////////////////////////////////////////////////////////////////////////////

// this alarm is not particularly useful, as mainly it reports simple
// login failures by legitimate system owners (from flubbed password
// entry, which happens all the time)

SyslogPwdfailScan

        =syslog_scan(PwdFail, pwdfail, =sysadmins, =nonesuch, =nonesuch, =nonesuch)

///////////////////////////////////////////////////////////////////////////////

SyslogSshdScan

        =syslog_scan(Sshd, sshd, =nonesuch, =nonesuch, =nonesuch, =nonesuch)

///////////////////////////////////////////////////////////////////////////////

In the SyslogEverythingScan invocation, we reference the macro

syslogeverythingbypasses        postfix|=sysadmins|download|obsolete setsockopt|
                                using 160 bit message hash|session closed|
                                reset low speed usb device|soft reset|soft_reset|
                                shutting down cleanly|failed to start vnc sessions|
                                gconf server is not in use|
                                shutting down for system reboot|write protect

What about the =nonesuch macro? =nonesuch should be defined in a manner that no log entry should ever pattern match it, that is, there should be "no such thing". We have defined it in macros.cfg as

nonesuch        nonesuch

but in the unlikely event the word "nonesuch" were ever to start appearing in our log files, we could redefine =nonesuch to something else, for example

nonesuch        rumpelstiltskin

The advantage of referencing a macro, =nonesuch, rather than a hard-coded string, "nonesuch", is that if we ever need to change this pattern match, we only need do it in one place--in the =nonesuch definition in macros.cfg. (Otherwise, we would have to hunt down and change the "nonesuch" string in many different places throughout our PIKT configuration.)

Years ago, before PIKT's macro capabilities were so fully explored and exploited, it was a huge chore to scan the many different PIKT log files, including Emergency.log, Urgent.log, Critical.log, ..., piktc.log, piktd.log, piktc_svc.log, ..., not to mention any special script-specific log files. The chore was in having to develop and maintain many different complicated and lengthy log scanning scripts, all of which did essentially the same thing: report a common set of problems (for example, any PIKT "ERROR" log entry).

Now, with the generalized script macro technique, we can define a =logs_pikt_alarms_macros.cfg, similar to the above =syslog_scan, and invoke it in alarms.cfg (or, better, the #include file logs_pikt_alarms.cfg) by means of:

///////////////////////////////////////////////////////////////////////////////

#if piktmaster
PIKTREDLogScan

        =piktlog_scan(RED, =nonesuch, =nonesuch)
#endif

///////////////////////////////////////////////////////////////////////////////

PIKTEmergencyLogScan

        =piktlog_scan(Emergency, runaway...procs, =nonesuch)

///////////////////////////////////////////////////////////////////////////////

PIKTSysAdminsUrgentLogScan

        =piktlog_scan(SysAdminsUrgent, dmesgrf: no such file, =nonesuch)

///////////////////////////////////////////////////////////////////////////////

PIKTCodersUrgentLogScan

        =piktlog_scan(CodersUrgent, dmesgrf: no such file, =nonesuch)

///////////////////////////////////////////////////////////////////////////////

PIKTSysAdminsCriticalLogScan

        =piktlog_scan(SysAdminsCritical, =nonesuch, =nonesuch)

///////////////////////////////////////////////////////////////////////////////

PIKTSysAdminsWarningLogScan

        =piktlog_scan(SysAdminsWarning, nightly-updates.log, =nonesuch)

///////////////////////////////////////////////////////////////////////////////

...

///////////////////////////////////////////////////////////////////////////////

PIKTpiktcLogScan

        =piktlog_scan(piktc, =nonesuch, =nonesuch)

///////////////////////////////////////////////////////////////////////////////

PIKTpiktc_svcLogScan

        =piktlog_scan(piktc_svc, =nonesuch, =nonesuch)

///////////////////////////////////////////////////////////////////////////////

PIKTpiktdLogScan

        =piktlog_scan(piktd, =nonesuch, =nonesuch)

///////////////////////////////////////////////////////////////////////////////

...

///////////////////////////////////////////////////////////////////////////////

So, we now have one generalized log scanning script macro, and several dozen short and sweet macro references as shown above. What used to be hundreds and hundreds and even thousands of lines of log scanning code, most of it essentially identical, has now been replaced by a hundred or so lines of =piktlog_scan() definition, and several dozen two-line alarm stanzas as shown above. Thousands of lengthy, duplicated, tedious-to-maintain configuration lines boiled down to just a couple hundred (all with little or no duplication). Hooray for PIKT macros!

1st page