Site-Wide System Scanning
(NOTE: Some of the techniques shown or described on this page--marked in purple--require new features in the latest official PIKT 1.19.0 release (pikt-current.tar.gz) that are unavailable in any previous version.)
Actually, in addition to the Urgent and Critical alert groups, we make use of DmesgScan in one other context within alerts.cfg:
///////////////////////////////////////////////////////////////////////////////
ScanDmesg
status active
level info
scripts DmesgScan
///////////////////////////////////////////////////////////////////////////////
We have similar brief stanzas in alerts.cfg, and a macro in macros.cfg referencing them all:
scripts
DownSystems
DownServers
DownClients
DownRpc
DownRpcServers
DownRpcClients
SysReboots
ScanDmesg
ScanSyslogCritical
ScanSyslogKernel
LoadAverages
Processes
Zombies
RunawayCPUProcs
RunawayMEMProcs
CPUUsage
...
section DownProd
section MissingAcmeProcesses
...
We also make a few additions and/or adjustments to scripts, for example to the =service_downage() script macro:
service_downage(S, I, T, A, M)
...
rule
if $alert() =~~ "client"
if " $missioncritical " =~ " $host "
set $state = %state
next
fi
fi
rule
if $alert() =~ "(A)"
output $host
output =newline
fi
...
rule
if (T)
if $alert() =~ "(A)"
output "$host's (S) services are down ((M))"
output =newline
else
...
fi
next
fi
Note that we have
- added a new macro argument, (A), to the =service_downage() script
- added a new rule to skip mission-critical systems when running client scripts, such as DownRpcClients
- added a new rule to output the $host and a newline, depending on the alert context
- added a new if block to report to screen (not send e-mail via 'output mail') if a host's services are down
#if piktmaster
RpcDown
=service_downage(RPC, =piktc -L +H pikt -H down sick, =rpcfail($host),
DownRpc|DownRpcServers|DownRpcClients,
rpcinfo -p failure)
#endif
We install the listed scripts (with their revised script macro definitions and macro invocations) on the piktmaster and elsewhere using the piktc command
# piktc -ivU +A =scripts -H down sickNow, at the piktmaster system, we can interactively issue commands to check on the health of our systems, for example:
A pikt command, run on the piktmaster, that polls all servers, reporting if any of them are down:
# pikt -U +A DownServers 2>&1 | tee /tmp/DownServers.outA pikt command, run on the piktmaster, that polls all clients, reporting if the RPC services on any client system are down:
# pikt -U +A DownRpcClients 2>&1 | tee /tmp/DownRpcClients.out(RPC services are essential to PIKT operations, so if RPC is down on any client system, the piktmaster can't communicate with that system, and we would want to know that.)
A piktc command, also run on the piktmaster, to run a Pikt script remotely on any PIKT slave system, for example:
# piktc -xvU +C "/pikt/bin/pikt -U +A ScanDmesg" +H helsinkiAnd many other single-purpose scanning scripts to look for problems on any system within our PIKT network.
In fact, we have aggregated all of these system check scripts into one uber-script, /pikt/lib/programs/scansys.sh:
#!/bin/bash
function divider () {
echo ""
echo "###############################################################################"
echo ""
}
function section () {
divider
echo $1
echo ""
}
if [ "$1" = "-s" ]; then
SYS=server
elif [ "$1" = "-c" ]; then
SYS=client
else
SYS=pikt
fi
section DownSystems
if [ $SYS = "server" ]; then
/pikt/bin/pikt -U +A DownServers 2>&1 | tee /tmp/DownSystems.out
elif [ $SYS = "client" ]; then
/pikt/bin/pikt -U +A DownClients 2>&1 | tee /tmp/DownSystems.out
else
/pikt/bin/pikt -U +A DownSystems 2>&1 | tee /tmp/DownSystems.out
fi
section DownRpc
if [ $SYS = "server" ]; then
/pikt/bin/pikt -U +A DownRpcServers 2>&1 | tee /tmp/DownRpc.out
elif [ $SYS = "client" ]; then
/pikt/bin/pikt -U +A DownRpcClients 2>&1 | tee /tmp/DownRpc.out
else
/pikt/bin/pikt -U +A DownRpc 2>&1 | tee /tmp/DownRpc.out
fi
section SysReboots
/pikt/bin/piktc -xU +C "hostname; /pikt/bin/pikt -U +A SysReboots; echo"
+H $SYS -H down sick 2>&1 | tee /tmp/SysReboots.out
section ScanDmesg
/pikt/bin/piktc -xU +C "hostname; /pikt/bin/pikt -U +A ScanDmesg; echo"
+H $SYS -H down sick 2>&1 | tee /tmp/ScanDmesg.out
...
if [ $SYS = "server" -o $SYS = "pikt" ]; then
section DownProd
/pikt/bin/pikt -U +A DownProd 2>&1 | tee /tmp/DownProd.out
section MissingAcmeProcesses
/pikt/bin/piktc -xU +C "hostname; /pikt/bin/pikt -U +A MissingAcmeProcesses; echo"
+H acmeserver -H down sick 2>&1 | tee /tmp/MissingAcmeProcesses.out
...
fi
divider
exit
Now we can issue the command, in this case for server systems (specified in the PIKT systems.cfg file):
# /pikt/lib/programs/scansys.sh -s 2>&1 | tee /tmp/scansys.servers.outor the command, in this case for client systems (defined in systems.cfg):
# /pikt/lib/programs/scansys.sh -c 2>&1 | tee /tmp/scansys.clients.outor this command, without any program argument, for all systems:
# /pikt/lib/programs/scansys.sh 2>&1 | tee /tmp/scansys.clients.outand get output like this:
############################################################################### DownSystems hamburg salonika valencia oslo is down, or off the network (ping failure) rome ... ############################################################################### DownRPC hamburg salonika valencia oslo oslo's RPC services are down (rpcinfo -p failure) rome ... ############################################################################### SysReboots hamburg salonika reboot system boot 2.6.17-gentoo-r5 Thu May 10 10:31 (13:59) valencia oslo rome ... ############################################################################### ScanDmesg hamburg salonika valencia ... glasgow usb 2-1.2: reset low speed USB device using ohci_hcd and address 7 usb 2-1.2: reset low speed USB device using ohci_hcd and address 9 usb 2-1.2: reset low speed USB device using ohci_hcd and address 11 hub 2-1:1.0: cannot reset port 3 (err = -110) hub 2-1:1.0: cannot reset port 3 (err = -110) ... ############################################################################### MissingAcmeProcesses ... granada liverpool zurich ltdapi ltdsrv acmesrv ... ###############################################################################With all of this, we have a handy means of scanning all of our systems and all log files at will, whenever we want (such as when we first come into work each day, or leave work for the day, or especially when leaving for the weekend) or whenever there is a need to (such as when there are unspecified problems reported somewhere on our network).
So, you can see that Pikt scripts are not just for automated problem reporting (and fixing) by way of piktd. We can also run them interactively, either singly, on any PIKT system, or en masse, on the piktmaster, as described above. We'll have more to say about using PIKT interactively at the command line in Enhancing the Command Line.
| | 1st page | next page |