Scanning a Log File

(NOTE:  Some of the techniques shown or described on this page--marked in purple--require new features in the latest official PIKT 1.19.0 release (pikt-current.tar.gz) that are unavailable in any previous version.)

PIKT is commonly used as a log file analyzer.  You can employ PIKT to routinely scan and analyze log files and report items of special concern or interest.  Under extraordinary circumstances, you can even program PIKT to take action, such as shut down a network service or lock a user account.

For example, DmesgScan is a Pikt script to scan system dmesg (kernel ring buffer) output.  Let's put it in a separate #include file, log_system_alarms.cfg.  In the top-level macros.cfg config file, we would then reference the #include file this way:

///////////////////////////////////////////////////////////////////////////////
//
// alarms.cfg
//
///////////////////////////////////////////////////////////////////////////////

...

#include <alarms/logs_system_alarms.cfg>

...

///////////////////////////////////////////////////////////////////////////////
Here is the DmesgScan script:
///////////////////////////////////////////////////////////////////////////////
//
// logs_system_alarms.cfg
//
///////////////////////////////////////////////////////////////////////////////

DmesgScan

        init
                status =piktstatus
                level =piktlevel
                task "Scan the dmesg output for troublesome or noteworthy entries"
                input proc "if [ -e =hstdir/log/dmesg=piktalert=_.bak ];
                            then =diff =hstdir/log/dmesg=piktalert=_.bak
                                       =hstdir/log/dmesg=piktalert |
                                 =egrep '>' | =cut -b 3- 2>/dev/null;
                            else =cat =hstdir/log/dmesg=piktalert 2>/dev/null;
                            fi"

        begin
                if $level() !~~ "emerg|alert"
                        =checkpoint(=lalim)
                fi

                doexec wait "=dmesg | =uniq > =hstdir/log/dmesg=piktalert"

        rule
                if    $level() =~~ "emerg|alert"        // alert == urgent
                   && $inlin =~~ "=redflags"
                        leave
                elsif $level() =~~ "critical"
                   && $inlin =~~ "=yellowflags"
                        leave
                elsif $alarm() eq "ScanDmesg"
                   && $inlin =~~ "=redflags|=yellowflags"
                        leave
                else
                        next
                fi

        rule    // bypass this junk
                if    $inlin =~~ "^[[:space:]]*[[:digit:]]+[[:space:]]*$"
                   || $inlin =~~ "^[[:space:]]*not found!"
#if rotterdam
                   || $inlin =~~ "dev fd0"
#endif
                        next
                fi

        rule    // log anything not bypassed
                =output_alarm_log($inlin)

        rule    // ignore scheduled reboots
                if =reboot_period(#daynumber(), #hour())
                        next
                fi

        rule    // bypasses
                if $inlin =~~ "pcie_portdrv_probe->.+has invalid irq|
                               exception support|exception polling|
                               obsolete setsockopt|reset.+speed usb|
                               failed to allocate mem resource|
                               scsi0: aen: warning|set_dentry_child_flags|
                               write protect|handling phase mismatch|
                               too many iterations.+nv_nic_irq|
                               analog subsections not ready"
                        next
                fi

        rule    // report the good stuff
                output mail $inlin

        end     // make dmesg*.bak for diff'ing against next time around
                exec wait "=mv =hstdir/log/dmesg=piktalert
                               =hstdir/log/dmesg=piktalert=_.bak"

...
We actually invoke the DmesgScan script in three different contexts.

First, within the Urgent alert group.  Here is the Urgent stanza from alerts.cfg:

///////////////////////////////////////////////////////////////////////////////

Urgent                  // things that deserve nearly immediate attention

        timing          15,45 * * * * 1

        nicecmd         "=nice -n 19"
        mailcmd         "=mailx -a 'From: piktadmin'
                                -s 'PIKT Alert on =pikthostname: Urgent'
                                =sysadmins-urgent"

        status          active
        level           urgent

        alarms          ...
                        DmesgScan
                        ...

///////////////////////////////////////////////////////////////////////////////
Second, within the Critical alert group in alerts.cfg:
///////////////////////////////////////////////////////////////////////////////

Critical                // things that should be dealt with before too long,
                        // preferably by day's end

        timing          0,30 * * * * 1

        nicecmd         "=nice -n 19"
        mailcmd         "=mailx -a 'From: piktadmin'
                                -s 'PIKT Alert on =pikthostname: Critical'
                                =sysadmins-critical"

        status          active
        level           critical

        alarms          ...
                        DmesgScan
                        ...

///////////////////////////////////////////////////////////////////////////////
Returning to the DmesgScan script, and skipping ahead in the script for a moment, in the begin section, we first check--but for low-priority scripts only--the system load average
                if $level() !~~ "emerg|alert"
                        =checkpoint(=lalim)
                fi
using the =checkpoint() & =lalim macros:
lalim           // load average limit
#if missioncritical
        0.50
#else
#  ifdef debug
        0.10
#  elsedef
        1.00
#  endifdef
#endif

checkpoint(LA)
#ifdef debug
                output $chop($command("=uptime | =awk '{print \$10}'"))
#endifdef
                set #la = #val($chop($command("=uptime | =awk '{print \$10}'")))
                if #la >= (LA)
                        output log "=logdir/Checkpoint.log"
                                   "quit $alarm(), load average $text(#la, 2)"
                        quit
                fi
In other words, if the load average is unusually high (where "high" means > 0.5 on missioncritical systems--defined in systems.cfg--else > 1.0 everywhere else), we abort the script.

(You might consider adding the =checkpoint() macro to the begin section of many of your Pikt scripts, especially the CPU and system resources intensive ones.  There's no sense in PIKT making a bad situation worse!  Having said that, we also can say that with well crafted scripts properly scheduled, PIKT adds very little overhead to the system under normal operations.)

If the load average is not too high, hence we can proceed beyond the =checkpoint(), we capture dmesg output (squelching any duplicate entries by means of the =uniq filter) to a log file:

                doexec wait "=dmesg | =uniq > =hstdir/log/dmesg=piktalert"
Only after processing the begin section do we open the script input.  In the 'input proc' statement
                input proc "if [ -e =hstdir/log/dmesg=piktalert=_.bak ];
                            then =diff =hstdir/log/dmesg=piktalert=_.bak
                                       =hstdir/log/dmesg=piktalert |
                                 =egrep '>' | =cut -b 3- 2>/dev/null;
                            else =cat =hstdir/log/dmesg=piktalert 2>/dev/null;
                            fi"
we check for the existence of a dmesgUrgent.bak (or dmesgCritical.bak or ...) backup file.  If it exists, we diff it against the current dmesgUrgent (or ...) file created in the begin section.  We pipe the diff output to an egrep & cut filter to show just the new stuff (minus the line-beginning '> ').  Otherwise, if there is no backup file, we simply cat the current dmesgUrgent (or ...) file.  (We could use 'dmesg -c' to clear the ring buffer each time we run the script, allowing us to greatly simplify getting at the new stuff.  But we prefer to retain the dmesg contents for other purposes.  Hence all of these complexities.)

So, when all is said and done, the script input is just any new entries in the current dmesg ring buffer.

Entering the script input loop, we first check for entries of interest:

        rule
                if    $level() =~~ "emerg|alert"        // alert == urgent
                   && $inlin =~~ "=redflags"
                        leave
                elsif $level() =~~ "critical"
                   && $inlin =~~ "=yellowflags"
                        leave
                elsif $alarm() eq "ScanDmesg"
                   && $inlin =~~ "=redflags|=yellowflags"
                        leave
                else
                        next
                fi
The =redflags and =yellowflags macros, used here and widely elsewhere within our PIKT setup, are defined as
redflags        fail|error| err |die|down|seg.*fault|terminate|attack|abort

yellowflags     warn|no such|broken|invalid|attempt|time(d )*out|reset|corrupt|
                not responding| bad |going down|unauthorized|exception|halt|
                authentication|obsolete|not ready|not found|abnormal|correctable|
                unexpected|please enable|can.t find|unable to|no device found|
                too many|is bad|can.t read|corrected|does not setup|inconsistent|
                fixed|runaway|mismatch| is off|overdue|could not be|quirk detected|
                cannot allocate|spurious
In other words, the first rule says:
  • If this is a high priority alert, with a level of "emerg" (emergency) or "alert" (aka "urgent", in PIKT-speak), then only check for red-flagged, especially troublesome entries.  (So if the conditions are true, we 'leave' this rule and proceed to subsequent rules.  'next' would have us go on to the next input line, without proceeding to subsequent rules.)
  • Else if this is a not-so-high priority alert, with a level of "critical", only check for less-troubling, yellow-flagged entries.
  • Else if the alarm script's name is "ScanDmesg", a script to be run interactively from the command-line, check for both red-flagged and yellow-flagged entries.
  • Else for low-priority alerts, or for input lines that don't pattern match =redflag or =yellowflag, disregard the current dmesg entry, and move on to the next (in the script input stream).
By testing a script's $level() in this way, we can have it adapt to different circumstances.  Besides testing a script's $level(), we could also, for example, have a Pikt script behave differently depending on how it is invoked based on its name (testing the $alarm() value), its alert group (testing the $alert() value), and by still other means, such as the host it is running on (testing the $hostname()), the time of day (testing the #hour()) or day of week (testing the #daynumber()), and so on.

Of course, you can still use preprocessor #if and #ifdef directives and other controls to customize your scripts before run-time, at the time of installation.  All things considered, these are very powerful features that allow us to write fewer, adaptable, all-purpose scripts rather than many, many inflexible single-use scripts which are basically all alike except for a few details.  (Although sometimes simplicity in script design is better.  It all depends.)

In the DmesgScan script's second rule, we bypass some uninteresting junk entries.  For the rotterdam system, we also bypass any entry reporting that system's flakey floppy disk drive.

In the third rule, we log anything not bypassed so far in a special DmesgScan.log log file using the macro

output_alarm_log(M)     output log "=logdir/" . $alarm() . ".log" (M)
So, later on even if we don't e-mail certain entries because we've filtered them out, we still have them logged, just in case.

In the fourth rule

        rule    // ignore scheduled reboots
                if =reboot_period(#daynumber(), #hour())
                        next
                fi
on certain systems, which are automatically rebooted early in the morning each day, we ignore their reboot entries (much the same day after day) using the macro
reboot_period(D, H)     // (D) is currently unused
#if tabletserver
                        ((H) < 2)       // (after midnight &) before 2 am
#else
                        ((H) < 0)       // i.e., never
#endif
In the fifth rule, we bypass some oft-appearing nuisance entries.  (But remember:  we logged them in the previous rule.)

In the sixth and final rule, here is where we send the alert e-mail.  A typical message might look something like

                                PIKT ALERT
                         Sun Mar 18 21:15:05 2007
                                  milan

CRITICAL:
    DmesgScan
        Scan the dmesg output for troublesome or noteworthy entries

        usb 2-1.2: reset low speed USB device using ohci_hcd and address 18
        usb 2-1.2: reset low speed USB device using ohci_hcd and address 18
        hub 2-1:1.0: cannot reset port 3 (err = -19)
        hub 2-1:1.0: cannot reset port 3 (err = -19)
        hub 2-1:1.0: cannot reset port 3 (err = -19)
        usb 2-1.2: reset low speed USB device using ohci_hcd and address 21
        usb 2-1.2: reset low speed USB device using ohci_hcd and address 23
        ...
Finally, in the script end section, we mv the current dmesgUrgent (or ...) file to its backup version for diffing the next time around.

DmesgScan is another rather elaborate example script showcasing many of PIKT's advanced features.  Don't be intimidated by all the potential complexity!  Make your Pikt scripts as simple or as complex as you wish.  Use #include files, macros, #ifdef and #if directives, self-adapting scripts, and other sophisticated PIKT techniques depending on your needs and according to your style and taste.  Complexity and cleverness are cool, but there's also wisdom in the KISS Principle ("Keep It Simple, Stupid.").

Refer to the PIKT Samples for many, many other script examples, large and small, long and short, simple and complex.

prev page 1st page next page
 
Home | FAQ | News | Intro | Samples | Tutorial | Reference | Software
Developer's Notes | Licensing | Authors | Pikt-Users | Pikt-Workers | Related Projects | Site Index | Privacy Policy | Contact Us
Page best viewed at 1024x768 or greater.   Page last updated 2019-01-12.   This site is PIKT® powered.
Copyright © 1998-2019 Robert Osterlund. All rights reserved.
Home FAQ News Intro Samples Tutorial Reference Software
PIKT Logo
PIKT Page Title
View sample
search terms
Pikt script