alerts.cfg
In this sample alerts.cfg file, we specify the what, where, and when of alarm scripts--which alarm scripts to run, where and when to run them, at what priority ("nice" level), and where to send their output (typically e-mail).
In this example, we group alarms mainly by importance: Emergency, Urgent, Critical, Warning, etc. We also segregate by staff: SysAdmins, Coders, etc. You may do it this way, and/or you may group alarms by department (for example, Marketing, Development, Finance, etc.), by personnel (for example, AlertsTom, AlertsDick, AlertsHarry, etc.), by functionality (for example, Security, Backups, Users, Patches, etc.), by timing (for example, Morning, Evening, Overnight, Hourly, Daily, etc.), and so on. Do what makes sense in your situation.
We also specify production and test versions, so there is for example both a SysAdminsCritical and SysAdminsCriticalTest alert. See Script Development and Testing.
In this alerts.cfg, several scripts--LoadAverage, for example--are invoked more than once but in different contexts--in the case of LoadAverage, within the EMERGENCY, SysAdminsUrgent, and CodersUrgent alerts, also the LoadAverages script. Within such scripts, we adapt their behavior to their context as determined from their $alert(), $alarm(), and $level() values and by still other means. See Scanning a Log File for a discussion of these techniques.
This example alerts.cfg applies to a business environment where the focus is on keeping production servers up and running smoothly, 24/7, and with as little downtime as possible. See also an older style alerts.cfg from a university environment, where the focus was on user account, process, and disk space management, and a decidedly different way of naming, arranging, and scheduling alerts was used.
This is a rather elaborate example alerts.cfg file, with many different alarm scripts and timing subtleties. Especially for smaller organizations, a typical alerts.cfg might be much simpler than this.
///////////////////////////////////////////////////////////////////////////////
//
// PIKT alerts.cfg -- grouping and scheduling alarm and program scripts
//
///////////////////////////////////////////////////////////////////////////////
//
// (please see the comments prefacing the sample macros.cfg about
// configuration file complexity and parse error debugging)
//
///////////////////////////////////////////////////////////////////////////////
//
// when ordering your alarms, put the most important at the head of the
// list so that they will appear at the top of any emailed alerts
//
///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////
#if piktmaster | piktmistress
RED=pikttest // Liebert AC/power infrastructure stuff
timing 0,5,10,15,20,25,30,35,40,45,50,55 * * * * 0 // every 5 mins,
// every day
nicecmd "=nice -n -10"
mailcmd "=mailx -a 'From: piktadmin'
-s 'PIKT Alert on =pikthostname: RED' =pikt-emergency"
#ifndef pikttest
status active
#elsedef
status testing
#endifdef
level emergency
alarms
#ifndef pikttest
# if piktmaster | piktmistress
PowerDown
ACDown
# endif
# if piktmaster | piktmistress
SNMPLiebert
# endif
DoNothing // placeholder
#elsedef // pikttest
DoNothing // placeholder
#endifdef // pikttest
#endif // piktmaster | piktmistress
///////////////////////////////////////////////////////////////////////////////
Every5Minutes=pikttest // things that need to run often, i.e.,
// every five minutes
// typically for logging purposes
timing 0,5,10,15,20,25,30,35,40,45,50,55 * * * * 0 // sun-sat,
// all hrs
nicecmd "=nice -n 19"
mailcmd "=mailx -a 'From: piktadmin'
-s 'PIKT Alert on =pikthostname: Every5Minutes' =piktadmin"
#ifndef pikttest
status active
#elsedef
status testing
#endifdef
level info
alarms
#ifndef pikttest
# if piktmistress
SysDown
# endif
DoNothing // placeholder
#elsedef // pikttest
# if piktmaster | piktmistress
PIKTHeartbeat
# endif
# if client
RunawayCPUProcsStats
# endif
DoNothing // placeholder
#endifdef // pikttest
///////////////////////////////////////////////////////////////////////////////
EMERGENCY=pikttest // things that require immediate attention
// general interest and truly emergency stuff only
#if missioncritical | piktmaster | piktmistress
timing
0,5,10,15,20,25,30,35,40,45,50,55 0-15 * * 1-5 1 // mon-fri, mrn&day
0,10,20,30,40,50 16-23 * * 1-5 1 // mon-fri, eve hrs
0 * * * 0,6 5 // sat-sun, all hrs
#else
timing
0,10,20,30,40,50 0-15 * * 1-5 1 // mon-fri, mrn hrs
0 16-23 * * 1-5 5 // mon-fri, eve hrs
0 0,6,12,18 * * 0,6 5 // sat-sun, all hrs
#endif
nicecmd "=nice -n 19"
mailcmd "=mailx -a 'From: piktadmin'
-s 'PIKT Alert on =pikthostname: EMERGENCY' =pikt-emergency"
lpcmd "=lp =piktprinter"
#ifndef pikttest
status active
#elsedef
status testing
#endifdef
level emergency
alarms
#ifndef pikttest
# if missioncritical
DiskCap
# else
// full disks not emergency
# endif
# if missioncritical
LoadAverage
ProcessCounts
ZombieCounts
CPUUsage
# else
// any of these issues on non mission-critical
// systems is never considered an emergency
# endif
# if missioncritical
RunawayCPUProcs
RunawayMEMProcs
# else
// RunawayCPUProcs
// RunawayMEMProcs
# endif
DoNothing // placeholder
#elsedef // pikttest
DoNothing // placeholder
#endifdef // pikttest
///////////////////////////////////////////////////////////////////////////////
Hourly=pikttest // things that need to run hourly
timing 0 * * * * 0 // sun-sat, all hrs
nicecmd "=nice -n 19"
mailcmd "=mailx -a 'From: piktadmin'
-s 'PIKT Alert on =pikthostname: Hourly' =piktadmin"
#ifndef pikttest
status active
#elsedef
status testing
#endifdef
level info
alarms
#ifndef pikttest
# if piktmaster
HostnameCheck
# endif
DoNothing // placeholder
#elsedef // pikttest
#if ! piktmaster
PIKTHeartbeat
#endif
DoNothing // placeholder
#endifdef // pikttest
///////////////////////////////////////////////////////////////////////////////
SysAdminsUrgent=pikttest // things that deserve nearly immediate attention
#if missioncritical
timing
15 0-6 * * 1-5 1 // mon-fri, mrn
15 7-15 * * 1-5 2 // mon-fri, day
15 16-23 * * 1-5 2 // mon-fri, eve hrs
15 */2 * * 0,6 2 // sat-sun, all hrs
#else
timing
15 0-15 * * 1-5 2 // mon-fri, mrn&day
15 22 * * 0-4 2 // sun-thu, eve hrs
#endif
nicecmd "=nice -n 19"
mailcmd "=mailx -a 'From: piktadmin'
-s 'PIKT Alert on =pikthostname: Urgent' =sysadmins-urgent"
lpcmd "=lp =piktprinter"
#ifndef pikttest
status active
#elsedef
status testing
#endifdef
level urgent
alarms
#ifndef pikttest
# if piktmaster
SysDown
WinNetDown
# endif
DiskCap
LoadAverage
ProcessCounts
ZombieCounts
CPUUsage
RunawayCPUProcs
DmesgScan
# if ! nometalog
SyslogKernelScan
# endif
#elsedef // pikttest
ProcessSystemDead
# if server
MissingAcmeProcesses
# endif
# if piktmaster | piktmistress
PIKTREDLogScan // moved here from Debug
# endif
DoNothing // placeholder
#endifdef // pikttest
///////////////////////////////////////////////////////////////////////////////
CodersUrgent=pikttest // things that deserve nearly immediate attention
#if missioncritical
timing
45 0-6 * * 1-5 1 // mon-fri, mrn
45 7-15 * * 1-5 2 // mon-fri, day
45 16-23 * * 1-5 2 // mon-fri, eve hrs
45 */2 * * 0,6 2 // sat-sun, all hrs
#else
timing
45 0-15 * * 1-5 2 // mon-fri, mrn&day
45 22 * * 0-4 2 // sun-thu, eve hrs
#endif
nicecmd "=nice -n 19"
mailcmd "=mailx -a 'From: piktadmin'
-s 'PIKT Alert on =pikthostname: Urgent' =coders-urgent"
lpcmd "=lp =piktprinter"
#ifndef pikttest
status active
#elsedef
status testing
#endifdef
level urgent
alarms
#ifndef pikttest
//# if server
LoadAverage
ProcessCounts
ZombieCounts
CPUUsage
RunawayCPUProcs
DmesgScan
# if ! nometalog
SyslogKernelScan
# endif
//# endif
# if server
MissingAcmeProcesses
# endif
# if dbprimary
NdbdOutFileScan
# endif
DoNothing // placeholder
#elsedef // pikttest
ProcessSystemDead
# if server
ACMEProcessListChange
# endif
DoNothing // placeholder
#endifdef // pikttest
///////////////////////////////////////////////////////////////////////////////
SysAdminsCritical=pikttest // things that should be dealt with before too long,
// preferably by day's end; (things reported here
// may not be especially "critical" but are so
// designated to conform with syslog's log levels)
#if missioncritical
timing
30 0-15 * * 1-5 1 // mon-fri, mrn&day
30 18,21 * * 1-5 2 // mon-fri, eve hrs
30 */6 * * 0,6 2 // sat-sun, all hrs
#else
# ifndef holiday
timing
30 0-15 * * 1-5 2 // mon-fri, mrn&day
30 21 * * 0-4 2 // sun-thu, eve hrs
# elsedef
timing =piktnever
# endifdef
#endif
nicecmd "=nice -n 19"
mailcmd "=mailx -a 'From: piktadmin'
-s 'PIKT Alert on =pikthostname: Critical' =sysadmins-critical"
#ifndef pikttest
status active
#elsedef
status testing
#endifdef
level critical
alarms
#ifndef pikttest
# if piktmaster
RpcDown
SshDown
// HttpDown
# endif
# if ! down
SysReboot
# endif
DmesgScan
# if ! nometalog
SyslogCriticalScan
# endif
RunawayMEMProcs
#elsedef // pikttest
# if piktmaster
SmtpDown
# endif
// lphung // not particularly useful
// LpDisabled // not particularly useful
# if passwdserver
PasswdFile
# endif
DoNothing // placeholder
#endifdef // pikttest
///////////////////////////////////////////////////////////////////////////////
CodersCritical=pikttest // things that should be dealt with before too long,
// preferably by day's end; (things reported here
// may not be especially "critical" but are so
// designated to conform with syslog's log levels)
#if missioncritical
timing
0 0-15 * * 1-5 1 // mon-fri, mrn&day
0 18,21 * * 1-5 2 // mon-fri, eve hrs
0 */6 * * 0,6 2 // sat-sun, all hrs
#else
# ifndef holiday
timing
0 0-15 * * 1-5 2 // mon-fri, mrn&day
0 21 * * 0-4 2 // sun-thu, eve hrs
# elsedef
timing =piktnever
# endifdef
#endif
nicecmd "=nice -n 19"
mailcmd "=mailx -a 'From: piktadmin'
-s 'PIKT Alert on =pikthostname: Critical' =coders-critical"
#ifndef pikttest
status active
#elsedef
status testing
#endifdef
level critical
alarms
#ifndef pikttest
RunawayCPUProcs
RunawayMEMProcs
#elsedef // pikttest
DoNothing // placeholder
#endifdef // pikttest
///////////////////////////////////////////////////////////////////////////////
SysAdminsWarning=pikttest // things that need attention, if not today, then
// eventually; after looking at warning alerts, we
// often just delete them at the end of the day,
// clearing the deck for the next day's warnings
#ifndef holiday
timing 30 21 * * * 10 // sun-sat
#elsedef
timing =piktnever
#endifdef
nicecmd "=nice -n 19"
mailcmd "=mailx -a 'From: piktadmin'
-s 'PIKT Alert on =pikthostname: Warning' =sysadmins-warning"
#ifndef pikttest
status active
#elsedef
status testing
#endifdef
level warning
alarms
#ifndef pikttest
DiskCap
NightlyUpdatesLogScan
// DoNothing // placeholder
#elsedef // pikttest
# if ! nometalog
SyslogCrondScan
SyslogPwdfailScan
SyslogSshdScan
// SyslogEverythingScan // not worth the extra load
FixMetaLogfilePermissions
# endif
# if passwdserver
PasswdFile
Pwck
Grpck
# endif
DoNothing // placeholder
#endifdef
///////////////////////////////////////////////////////////////////////////////
CodersWarning=pikttest // things that need attention, if not today, then
// eventually; after looking at warning alerts, we
// often just delete them at the end of the day,
// clearing the deck for the next day's warnings
#ifndef holiday
timing 30 21 * * * 10 // sun-sat
#elsedef
timing =piktnever
#endifdef
nicecmd "=nice -n 19"
mailcmd "=mailx -a 'From: piktadmin'
-s 'PIKT Alert on =pikthostname: Warning' =coders-warning"
#ifndef pikttest
status active
#elsedef
status testing
#endifdef
level warning
alarms
#ifndef pikttest
NightlyUpdatesLogScan
// DoNothing // placeholder
#elsedef // pikttest
DoNothing // placeholder
#endifdef
///////////////////////////////////////////////////////////////////////////////
CodersNotice=pikttest // things that deserve our attention but may or may
// not require action; if we are busy, we can usually
// safely ignore most or all notice alerts (simply
// delete them)
#ifndef holiday
timing 30 20 * * 0-4 10 // sun-thu
#elsedef
timing =piktnever
#endifdef
nicecmd "=nice -n 19"
mailcmd "=mailx -a 'From: piktadmin'
-s 'PIKT Alert on =pikthostname: Notice' =coders-notice"
#ifndef pikttest
status active
#elsedef
status testing
#endifdef
level notice
alarms
#ifndef pikttest
ACMEDATFreshFiles
# if server & ! paris
ACMECrontabChange
# endif
// ACMEScriptChange
ACMEScriptModeCheck
#elsedef // pikttest
ACMEScriptChange
DoNothing // placeholder
#endifdef // pikttest
///////////////////////////////////////////////////////////////////////////////
PIKTAdminInfo // for informational purposes only, or for
// occasional housekeeping tasks; we usually just
// glance at info alerts or ignore them
// altogether (simply delete them)
#ifndef holiday
timing 30 22 * * 0-4 10 // sun-thu
#elsedef
timing =piktnever
#endifdef
nicecmd "=nice -n 19"
mailcmd "=mailx -a 'From: piktadmin'
-s 'PIKT Alert on =pikthostname: Info' =pikt-info"
#ifndef pikttest
status active
#elsedef
status testing
#endifdef
level info
alarms
#if piktmaster
SysUp
NewHosts
#endif
RootCrontabChange
TruncatePIKTLogFiles
FixPIKTLogfilePermissions
#if server
NonAcmeProcesses
#endif
///////////////////////////////////////////////////////////////////////////////
Debug // for PIKT self-monitoring; these deserve
// fairly close attention, especially on the
// piktmaster, where we not only run more often,
// we also cron it
#ifndef holiday
# if piktmaster | piktmistress
// run Debug, via either piktd or cron, every 3 hours,
// but only overnight (not during the day)
// crond runs Debug at alternating intervals like so:
// 55 5,23 * * * /usr/bin/nice -n -10 /pikt/bin/pikt ... +A Debug
timing 55 0,2,4 * * 1-5 5 // mon-fri
# else
timing 55 0,2,4 * * 1-5 5 // mon-fri
# endif
#elsedef
timing =piktnever
#endifdef
nicecmd "=nice -n 19"
mailcmd "=mailx -a 'From: piktadmin'
-s 'PIKT Alert on =pikthostname: Debug' =pikt-debug"
status active
level debug
alarms
// PiktUpdateLog
#if piktmaster | piktmistress
// it's important to run PiktDaemonProblems
// independently in two separate alerts, Critical
// and (our choice) Debug; if you just run it in one
// alert, and that alert hangs, then you miss this
// vital alarm; we recommend, too, that you run the
// Debug alert via cron; in addition to the above
// schedule, where 'pikt +A Debug' is invoked by
// piktd, we also have cron invoke Debug (from our
// root crontab):
//
// 55 1,3,5,7,9,11,13,15,17,19,21,23 * * *
// /usr/bin/nice -10 /pikt/bin/pikt ... +A Debug
//
// so, we run PiktDaemonProblems independently
// under three different schedules:
//
// 30 * * * * [in the Critical alert,
// invoked by piktd]
// 55 0-22/2 * * * [in the Debug alert,
// invoked by piktd]
// 55 1,3,5,7,9,11,13,15,17,19,21,23 * * *
// [in the Debug alert,
// invoked by cron]
//PiktDaemonProblems
#endif
//#if ! piktmaster
// PIKTHeartbeat
//#endif
PIKTStatusCheck
PersistentPiktRun
StalePIKTLockFiles
StalePIKTLogFiles
StalePIKTHstFiles
#if piktmaster | piktmistress
// PIKTREDLogScan // moved to SysAdminsUrgentTest
PIKTREDTestLogScan
#endif
PIKTEMERGENCYLogScan
PIKTEMERGENCYTestLogScan
PIKTEvery5MinutesLogScan
PIKTEvery5MinutesTestLogScan
PIKTHourlyLogScan
PIKTHourlyTestLogScan
PIKTSysAdminsUrgentLogScan
PIKTSysAdminsUrgentTestLogScan
PIKTCodersUrgentLogScan
PIKTCodersUrgentTestLogScan
PIKTSysAdminsCriticalLogScan
PIKTSysAdminsCriticalTestLogScan
PIKTCodersCriticalLogScan
PIKTCodersCriticalTestLogScan
PIKTSysAdminsWarningLogScan
PIKTSysAdminsWarningTestLogScan
PIKTCodersWarningLogScan
PIKTCodersWarningTestLogScan
PIKTCodersNoticeLogScan
PIKTPIKTAdminInfoLogScan
#if piktmaster | piktmistress
PIKTDownSystemsLogScan
#endif
#if piktmaster
PIKTCheckDatabaseLogScan
PIKTDownRpcLogScan
PIKTSysRebootsLogScan
PIKTScanDmesgLogScan
PIKTScanSyslogCriticalLogScan
PIKTMissingAcmeProcessesLogScan
PIKTScriptChangesLogScan
PIKTLoadAveragesLogScan
PIKTCPUUsageLogScan
#endif
PIKTpiktc_svcLogScan
PIKTpiktdLogScan
#if piktmaster
PIKTpiktcLogScan
#endif
///////////////////////////////////////////////////////////////////////////////
#if dbprimary
CheckDatabase
timing 45 0,20,22 * * * 1
mailcmd "=mailx -a 'From: piktadmin'
-s 'PIKT Alert on =pikthostname: CheckDatabase' =pikt-debug"
status active
level urgent
alarms
CheckDB
#endif
///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////
// diagnostic scripts, to be run in emergency situations, for example:
// piktc -x +C "hostname; /pikt/bin/pikt +A ScanDmesg; echo" +H server
///////////////////////////////////////////////////////////////////////////////
#if piktmaster
PIKTStatus
status active
level debug
scripts PIKTStatusCheck
#endif
///////////////////////////////////////////////////////////////////////////////
#if piktmaster
DownSystems
status active
level info
scripts SysDown
#endif
///////////////////////////////////////////////////////////////////////////////
#if piktmaster
DownServers
status active
level info
scripts SysDown
#endif
///////////////////////////////////////////////////////////////////////////////
#if piktmaster
DownClients
status active
level info
scripts SysDown
#endif
///////////////////////////////////////////////////////////////////////////////
#if piktmaster
DownRpc
status active
level info
scripts RpcDown
#endif
///////////////////////////////////////////////////////////////////////////////
#if piktmaster
DownRpcServers
status active
level info
scripts RpcDown
#endif
///////////////////////////////////////////////////////////////////////////////
#if piktmaster
DownRpcClients
status active
level info
scripts RpcDown
#endif
///////////////////////////////////////////////////////////////////////////////
SysReboots
status active
level info
scripts SysReboot
///////////////////////////////////////////////////////////////////////////////
ScanDmesg
status active
level info
scripts DmesgScan
///////////////////////////////////////////////////////////////////////////////
ScanSyslogCritical
status active
level info
scripts SyslogCriticalScan
///////////////////////////////////////////////////////////////////////////////
ScanSyslogKernel
status active
level info
scripts SyslogKernelScan
///////////////////////////////////////////////////////////////////////////////
// shut down 07/03/01, as it is potentially too resource-intensive
/*
# if server
ScanLogFiles
status active
level info
scripts LogFileScan
# endif // server
*/
///////////////////////////////////////////////////////////////////////////////
# if server
MissingAcmeProcesses
status active
level info
scripts MissingAcmeProcesses
# endif // server
///////////////////////////////////////////////////////////////////////////////
ScriptChanges
status active
level info
scripts ACMEScriptChange
///////////////////////////////////////////////////////////////////////////////
LoadAverages
status active
level info
scripts LoadAverage
///////////////////////////////////////////////////////////////////////////////
Processes
status active
level info
scripts ProcessCounts
///////////////////////////////////////////////////////////////////////////////
Zombies
status active
level info
scripts ZombieCounts
///////////////////////////////////////////////////////////////////////////////
CPUUsage
status active
level info
scripts CPUUsage
///////////////////////////////////////////////////////////////////////////////
RunawayCPUProcs
status active
level info
scripts RunawayCPUProcs
///////////////////////////////////////////////////////////////////////////////
RunawayMEMProcs
status active
level info
scripts RunawayMEMProcs
///////////////////////////////////////////////////////////////////////////////
DiskCap
status active
level info
scripts DiskCap
///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////
#ifdef pikttest
//# if piktmaster
Test // use this for testing newly developed alarm scripts;
// install with 'piktc -iv +D test +A Test +H ...'
// or maybe 'piktc -iv +D test debug verbose -D page doexec +A Test +H ...'
// after testing, remove all traces of
// the Test alert with 'piktc -tv +A Test +H ...'
timing 45 6 * * * 0
// timing */5 * * * * 0
// timing 5 * * * * 0
// timing =piktnever
// drift 30
nicecmd "=nice -n 19"
mailcmd "=mailx -a 'From: piktadmin'
-s 'PIKT Alert on =pikthostname: Test' =pikt-test"
lpcmd "=lp =piktprinter"
status testing
level debug
alarms
// SNMPLiebert
// ShoutTest
//#if piktmaster
// SysUp
// NewHosts
//#endif
DiskCap
DoNothing
//# endif // piktmaster
#endifdef // pikttest
///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////
(A note about page layout: In the interest of readability, we have added artificial line wraps in many examples. Even though displayed here broken up across several screen lines, in general quoted strings, preprocessor directives, macro definitions, .log & .conf entries, and so on should all be unbroken on a single line.)
For more examples, see Samples.