Site-Wide System Scanning
(NOTE: Some of the techniques shown or described on this page--marked in purple--require new features in the latest official PIKT 1.19.0 release (pikt-current.tar.gz) that are unavailable in any previous version.)
Actually, in addition to the Urgent and Critical alert groups, we make use of DmesgScan in one other context within alerts.cfg:
/////////////////////////////////////////////////////////////////////////////// ScanDmesg status active level info scripts DmesgScan ///////////////////////////////////////////////////////////////////////////////We have similar brief stanzas in alerts.cfg, and a macro in macros.cfg referencing them all:
scripts DownSystems DownServers DownClients DownRpc DownRpcServers DownRpcClients SysReboots ScanDmesg ScanSyslogCritical ScanSyslogKernel LoadAverages Processes Zombies RunawayCPUProcs RunawayMEMProcs CPUUsage ... section DownProd section MissingAcmeProcesses ...We also make a few additions and/or adjustments to scripts, for example to the =service_downage() script macro:
service_downage(S, I, T, A, M) ... rule if $alert() =~~ "client" if " $missioncritical " =~ " $host " set $state = %state next fi fi rule if $alert() =~ "(A)" output $host output =newline fi ... rule if (T) if $alert() =~ "(A)" output "$host's (S) services are down ((M))" output =newline else ... fi next fiNote that we have
- added a new macro argument, (A), to the =service_downage() script
- added a new rule to skip mission-critical systems when running client scripts, such as DownRpcClients
- added a new rule to output the $host and a newline, depending on the alert context
- added a new if block to report to screen (not send e-mail via 'output mail') if a host's services are down
#if piktmaster RpcDown =service_downage(RPC, =piktc -L +H pikt -H down sick, =rpcfail($host), DownRpc|DownRpcServers|DownRpcClients, rpcinfo -p failure) #endifWe install the listed scripts (with their revised script macro definitions and macro invocations) on the piktmaster and elsewhere using the piktc command
# piktc -ivU +A =scripts -H down sickNow, at the piktmaster system, we can interactively issue commands to check on the health of our systems, for example:
A pikt command, run on the piktmaster, that polls all servers, reporting if any of them are down:
# pikt -U +A DownServers 2>&1 | tee /tmp/DownServers.outA pikt command, run on the piktmaster, that polls all clients, reporting if the RPC services on any client system are down:
# pikt -U +A DownRpcClients 2>&1 | tee /tmp/DownRpcClients.out(RPC services are essential to PIKT operations, so if RPC is down on any client system, the piktmaster can't communicate with that system, and we would want to know that.)
A piktc command, also run on the piktmaster, to run a Pikt script remotely on any PIKT slave system, for example:
# piktc -xvU +C "/pikt/bin/pikt -U +A ScanDmesg" +H helsinkiAnd many other single-purpose scanning scripts to look for problems on any system within our PIKT network.
In fact, we have aggregated all of these system check scripts into one uber-script, /pikt/lib/programs/scansys.sh:
#!/bin/bash function divider () { echo "" echo "###############################################################################" echo "" } function section () { divider echo $1 echo "" } if [ "$1" = "-s" ]; then SYS=server elif [ "$1" = "-c" ]; then SYS=client else SYS=pikt fi section DownSystems if [ $SYS = "server" ]; then /pikt/bin/pikt -U +A DownServers 2>&1 | tee /tmp/DownSystems.out elif [ $SYS = "client" ]; then /pikt/bin/pikt -U +A DownClients 2>&1 | tee /tmp/DownSystems.out else /pikt/bin/pikt -U +A DownSystems 2>&1 | tee /tmp/DownSystems.out fi section DownRpc if [ $SYS = "server" ]; then /pikt/bin/pikt -U +A DownRpcServers 2>&1 | tee /tmp/DownRpc.out elif [ $SYS = "client" ]; then /pikt/bin/pikt -U +A DownRpcClients 2>&1 | tee /tmp/DownRpc.out else /pikt/bin/pikt -U +A DownRpc 2>&1 | tee /tmp/DownRpc.out fi section SysReboots /pikt/bin/piktc -xU +C "hostname; /pikt/bin/pikt -U +A SysReboots; echo" +H $SYS -H down sick 2>&1 | tee /tmp/SysReboots.out section ScanDmesg /pikt/bin/piktc -xU +C "hostname; /pikt/bin/pikt -U +A ScanDmesg; echo" +H $SYS -H down sick 2>&1 | tee /tmp/ScanDmesg.out ... if [ $SYS = "server" -o $SYS = "pikt" ]; then section DownProd /pikt/bin/pikt -U +A DownProd 2>&1 | tee /tmp/DownProd.out section MissingAcmeProcesses /pikt/bin/piktc -xU +C "hostname; /pikt/bin/pikt -U +A MissingAcmeProcesses; echo" +H acmeserver -H down sick 2>&1 | tee /tmp/MissingAcmeProcesses.out ... fi divider exitNow we can issue the command, in this case for server systems (specified in the PIKT systems.cfg file):
# /pikt/lib/programs/scansys.sh -s 2>&1 | tee /tmp/scansys.servers.outor the command, in this case for client systems (defined in systems.cfg):
# /pikt/lib/programs/scansys.sh -c 2>&1 | tee /tmp/scansys.clients.outor this command, without any program argument, for all systems:
# /pikt/lib/programs/scansys.sh 2>&1 | tee /tmp/scansys.clients.outand get output like this:
############################################################################### DownSystems hamburg salonika valencia oslo is down, or off the network (ping failure) rome ... ############################################################################### DownRPC hamburg salonika valencia oslo oslo's RPC services are down (rpcinfo -p failure) rome ... ############################################################################### SysReboots hamburg salonika reboot system boot 2.6.17-gentoo-r5 Thu May 10 10:31 (13:59) valencia oslo rome ... ############################################################################### ScanDmesg hamburg salonika valencia ... glasgow usb 2-1.2: reset low speed USB device using ohci_hcd and address 7 usb 2-1.2: reset low speed USB device using ohci_hcd and address 9 usb 2-1.2: reset low speed USB device using ohci_hcd and address 11 hub 2-1:1.0: cannot reset port 3 (err = -110) hub 2-1:1.0: cannot reset port 3 (err = -110) ... ############################################################################### MissingAcmeProcesses ... granada liverpool zurich ltdapi ltdsrv acmesrv ... ###############################################################################With all of this, we have a handy means of scanning all of our systems and all log files at will, whenever we want (such as when we first come into work each day, or leave work for the day, or especially when leaving for the weekend) or whenever there is a need to (such as when there are unspecified problems reported somewhere on our network).
So, you can see that Pikt scripts are not just for automated problem reporting (and fixing) by way of piktd. We can also run them interactively, either singly, on any PIKT system, or en masse, on the piktmaster, as described above. We'll have more to say about using PIKT interactively at the command line in Enhancing the Command Line.
prev page | 1st page | next page |