Scanning a Log File
(NOTE: Some of the techniques shown or described on this page--marked in purple--require new features in the latest official PIKT 1.19.0 release (pikt-current.tar.gz) that are unavailable in any previous version.)
PIKT is commonly used as a log file analyzer. You can employ PIKT to routinely scan and analyze log files and report items of special concern or interest. Under extraordinary circumstances, you can even program PIKT to take action, such as shut down a network service or lock a user account.
For example, DmesgScan is a Pikt script to scan system dmesg (kernel ring buffer) output. Let's put it in a separate #include file, log_system_alarms.cfg. In the top-level macros.cfg config file, we would then reference the #include file this way:
/////////////////////////////////////////////////////////////////////////////// // // alarms.cfg // /////////////////////////////////////////////////////////////////////////////// ... #include <alarms/logs_system_alarms.cfg> ... ///////////////////////////////////////////////////////////////////////////////Here is the DmesgScan script:
///////////////////////////////////////////////////////////////////////////////
//
// logs_system_alarms.cfg
//
///////////////////////////////////////////////////////////////////////////////
DmesgScan
init
status =piktstatus
level =piktlevel
task "Scan the dmesg output for troublesome or noteworthy entries"
input proc "if [ -e =hstdir/log/dmesg=piktalert=_.bak ];
then =diff =hstdir/log/dmesg=piktalert=_.bak
=hstdir/log/dmesg=piktalert |
=egrep '>' | =cut -b 3- 2>/dev/null;
else =cat =hstdir/log/dmesg=piktalert 2>/dev/null;
fi"
begin
if $level() !~~ "emerg|alert"
=checkpoint(=lalim)
fi
doexec wait "=dmesg | =uniq > =hstdir/log/dmesg=piktalert"
rule
if $level() =~~ "emerg|alert" // alert == urgent
&& $inlin =~~ "=redflags"
leave
elsif $level() =~~ "critical"
&& $inlin =~~ "=yellowflags"
leave
elsif $alarm() eq "ScanDmesg"
&& $inlin =~~ "=redflags|=yellowflags"
leave
else
next
fi
rule // bypass this junk
if $inlin =~~ "^[[:space:]]*[[:digit:]]+[[:space:]]*$"
|| $inlin =~~ "^[[:space:]]*not found!"
#if rotterdam
|| $inlin =~~ "dev fd0"
#endif
next
fi
rule // log anything not bypassed
=output_alarm_log($inlin)
rule // ignore scheduled reboots
if =reboot_period(#daynumber(), #hour())
next
fi
rule // bypasses
if $inlin =~~ "pcie_portdrv_probe->.+has invalid irq|
exception support|exception polling|
obsolete setsockopt|reset.+speed usb|
failed to allocate mem resource|
scsi0: aen: warning|set_dentry_child_flags|
write protect|handling phase mismatch|
too many iterations.+nv_nic_irq|
analog subsections not ready"
next
fi
rule // report the good stuff
output mail $inlin
end // make dmesg*.bak for diff'ing against next time around
exec wait "=mv =hstdir/log/dmesg=piktalert
=hstdir/log/dmesg=piktalert=_.bak"
...
We actually invoke the DmesgScan script in three different contexts.
First, within the Urgent alert group. Here is the Urgent stanza from alerts.cfg:
///////////////////////////////////////////////////////////////////////////////
Urgent // things that deserve nearly immediate attention
timing 15,45 * * * * 1
nicecmd "=nice -n 19"
mailcmd "=mailx -a 'From: piktadmin'
-s 'PIKT Alert on =pikthostname: Urgent'
=sysadmins-urgent"
status active
level urgent
alarms ...
DmesgScan
...
///////////////////////////////////////////////////////////////////////////////
Second, within the Critical alert group in alerts.cfg:
///////////////////////////////////////////////////////////////////////////////
Critical // things that should be dealt with before too long,
// preferably by day's end
timing 0,30 * * * * 1
nicecmd "=nice -n 19"
mailcmd "=mailx -a 'From: piktadmin'
-s 'PIKT Alert on =pikthostname: Critical'
=sysadmins-critical"
status active
level critical
alarms ...
DmesgScan
...
///////////////////////////////////////////////////////////////////////////////
Returning to the DmesgScan script, and skipping ahead in the script for a moment, in the begin section, we first check--but for low-priority scripts only--the system load average
if $level() !~~ "emerg|alert"
=checkpoint(=lalim)
fi
using the =checkpoint() & =lalim macros:
lalim // load average limit
#if missioncritical
0.50
#else
# ifdef debug
0.10
# elsedef
1.00
# endifdef
#endif
checkpoint(LA)
#ifdef debug
output $chop($command("=uptime | =awk '{print \$10}'"))
#endifdef
set #la = #val($chop($command("=uptime | =awk '{print \$10}'")))
if #la >= (LA)
output log "=logdir/Checkpoint.log"
"quit $alarm(), load average $text(#la, 2)"
quit
fi
In other words, if the load average is unusually high (where "high" means > 0.5 on missioncritical systems--defined in systems.cfg--else > 1.0 everywhere else), we abort the script.
(You might consider adding the =checkpoint() macro to the begin section of many of your Pikt scripts, especially the CPU and system resources intensive ones. There's no sense in PIKT making a bad situation worse! Having said that, we also can say that with well crafted scripts properly scheduled, PIKT adds very little overhead to the system under normal operations.)
If the load average is not too high, hence we can proceed beyond the =checkpoint(), we capture dmesg output (squelching any duplicate entries by means of the =uniq filter) to a log file:
doexec wait "=dmesg | =uniq > =hstdir/log/dmesg=piktalert"
Only after processing the begin section do we open the script input. In the 'input proc' statement
input proc "if [ -e =hstdir/log/dmesg=piktalert=_.bak ];
then =diff =hstdir/log/dmesg=piktalert=_.bak
=hstdir/log/dmesg=piktalert |
=egrep '>' | =cut -b 3- 2>/dev/null;
else =cat =hstdir/log/dmesg=piktalert 2>/dev/null;
fi"
we check for the existence of a dmesgUrgent.bak (or dmesgCritical.bak or ...) backup file. If it exists, we diff it against the current dmesgUrgent (or ...) file created in the begin section. We pipe the diff output to an egrep & cut filter to show just the new stuff (minus the line-beginning '> '). Otherwise, if there is no backup file, we simply cat the current dmesgUrgent (or ...) file. (We could use 'dmesg -c' to clear the ring buffer each time we run the script, allowing us to greatly simplify getting at the new stuff. But we prefer to retain the dmesg contents for other purposes. Hence all of these complexities.)
So, when all is said and done, the script input is just any new entries in the current dmesg ring buffer.
Entering the script input loop, we first check for entries of interest:
rule
if $level() =~~ "emerg|alert" // alert == urgent
&& $inlin =~~ "=redflags"
leave
elsif $level() =~~ "critical"
&& $inlin =~~ "=yellowflags"
leave
elsif $alarm() eq "ScanDmesg"
&& $inlin =~~ "=redflags|=yellowflags"
leave
else
next
fi
The =redflags and =yellowflags macros, used here and widely elsewhere within our PIKT setup, are defined as
redflags fail|error| err |die|down|seg.*fault|terminate|attack|abort
yellowflags warn|no such|broken|invalid|attempt|time(d )*out|reset|corrupt|
not responding| bad |going down|unauthorized|exception|halt|
authentication|obsolete|not ready|not found|abnormal|correctable|
unexpected|please enable|can.t find|unable to|no device found|
too many|is bad|can.t read|corrected|does not setup|inconsistent|
fixed|runaway|mismatch| is off|overdue|could not be|quirk detected|
cannot allocate|spurious
In other words, the first rule says:
- If this is a high priority alert, with a level of "emerg" (emergency) or "alert" (aka "urgent", in PIKT-speak), then only check for red-flagged, especially troublesome entries. (So if the conditions are true, we 'leave' this rule and proceed to subsequent rules. 'next' would have us go on to the next input line, without proceeding to subsequent rules.)
- Else if this is a not-so-high priority alert, with a level of "critical", only check for less-troubling, yellow-flagged entries.
- Else if the alarm script's name is "ScanDmesg", a script to be run interactively from the command-line, check for both red-flagged and yellow-flagged entries.
- Else for low-priority alerts, or for input lines that don't pattern match =redflag or =yellowflag, disregard the current dmesg entry, and move on to the next (in the script input stream).
Of course, you can still use preprocessor #if and #ifdef directives and other controls to customize your scripts before run-time, at the time of installation. All things considered, these are very powerful features that allow us to write fewer, adaptable, all-purpose scripts rather than many, many inflexible single-use scripts which are basically all alike except for a few details. (Although sometimes simplicity in script design is better. It all depends.)
In the DmesgScan script's second rule, we bypass some uninteresting junk entries. For the rotterdam system, we also bypass any entry reporting that system's flakey floppy disk drive.
In the third rule, we log anything not bypassed so far in a special DmesgScan.log log file using the macro
output_alarm_log(M) output log "=logdir/" . $alarm() . ".log" (M)So, later on even if we don't e-mail certain entries because we've filtered them out, we still have them logged, just in case.
In the fourth rule
rule // ignore scheduled reboots
if =reboot_period(#daynumber(), #hour())
next
fi
on certain systems, which are automatically rebooted early in the morning each day, we ignore their reboot entries (much the same day after day) using the macro
reboot_period(D, H) // (D) is currently unused
#if tabletserver
((H) < 2) // (after midnight &) before 2 am
#else
((H) < 0) // i.e., never
#endif
In the fifth rule, we bypass some oft-appearing nuisance entries. (But remember: we logged them in the previous rule.)
In the sixth and final rule, here is where we send the alert e-mail. A typical message might look something like
PIKT ALERT
Sun Mar 18 21:15:05 2007
milan
CRITICAL:
DmesgScan
Scan the dmesg output for troublesome or noteworthy entries
usb 2-1.2: reset low speed USB device using ohci_hcd and address 18
usb 2-1.2: reset low speed USB device using ohci_hcd and address 18
hub 2-1:1.0: cannot reset port 3 (err = -19)
hub 2-1:1.0: cannot reset port 3 (err = -19)
hub 2-1:1.0: cannot reset port 3 (err = -19)
usb 2-1.2: reset low speed USB device using ohci_hcd and address 21
usb 2-1.2: reset low speed USB device using ohci_hcd and address 23
...
Finally, in the script end section, we mv the current dmesgUrgent (or ...) file to its backup version for diffing the next time around.
DmesgScan is another rather elaborate example script showcasing many of PIKT's advanced features. Don't be intimidated by all the potential complexity! Make your Pikt scripts as simple or as complex as you wish. Use #include files, macros, #ifdef and #if directives, self-adapting scripts, and other sophisticated PIKT techniques depending on your needs and according to your style and taste. Complexity and cleverness are cool, but there's also wisdom in the KISS Principle ("Keep It Simple, Stupid.").
Refer to the PIKT Samples for many, many other script examples, large and small, long and short, simple and complex.
| | 1st page | next page |