alerts.cfg
In this sample alerts.cfg file, we specify the what, where, and when of alarm scripts--which alarm scripts to run, where and when to run them, at what priority ("nice" level), and where to send their output (typically e-mail).
In this example, we group alarms mainly by importance: Emergency, Urgent, Critical, Warning, etc. We also segregate by staff: SysAdmins, Coders, etc. You may do it this way, and/or you may group alarms by department (for example, Marketing, Development, Finance, etc.), by personnel (for example, AlertsTom, AlertsDick, AlertsHarry, etc.), by functionality (for example, Security, Backups, Users, Patches, etc.), by timing (for example, Morning, Evening, Overnight, Hourly, Daily, etc.), and so on. Do what makes sense in your situation.
We also specify production and test versions, so there is for example both a SysAdminsCritical and SysAdminsCriticalTest alert. See Script Development and Testing.
In this alerts.cfg, several scripts--LoadAverage, for example--are invoked more than once but in different contexts--in the case of LoadAverage, within the EMERGENCY, SysAdminsUrgent, and CodersUrgent alerts, also the LoadAverages script. Within such scripts, we adapt their behavior to their context as determined from their $alert(), $alarm(), and $level() values and by still other means. See Scanning a Log File for a discussion of these techniques.
This example alerts.cfg applies to a business environment where the focus is on keeping production servers up and running smoothly, 24/7, and with as little downtime as possible. See also an older style alerts.cfg from a university environment, where the focus was on user account, process, and disk space management, and a decidedly different way of naming, arranging, and scheduling alerts was used.
This is a rather elaborate example alerts.cfg file, with many different alarm scripts and timing subtleties. Especially for smaller organizations, a typical alerts.cfg might be much simpler than this.
/////////////////////////////////////////////////////////////////////////////// // // PIKT alerts.cfg -- grouping and scheduling alarm and program scripts // /////////////////////////////////////////////////////////////////////////////// // // (please see the comments prefacing the sample macros.cfg about // configuration file complexity and parse error debugging) // /////////////////////////////////////////////////////////////////////////////// // // when ordering your alarms, put the most important at the head of the // list so that they will appear at the top of any emailed alerts // /////////////////////////////////////////////////////////////////////////////// /////////////////////////////////////////////////////////////////////////////// #if piktmaster | piktmistress RED=pikttest // Liebert AC/power infrastructure stuff timing 0,5,10,15,20,25,30,35,40,45,50,55 * * * * 0 // every 5 mins, // every day nicecmd "=nice -n -10" mailcmd "=mailx -a 'From: piktadmin' -s 'PIKT Alert on =pikthostname: RED' =pikt-emergency" #ifndef pikttest status active #elsedef status testing #endifdef level emergency alarms #ifndef pikttest # if piktmaster | piktmistress PowerDown ACDown # endif # if piktmaster | piktmistress SNMPLiebert # endif DoNothing // placeholder #elsedef // pikttest DoNothing // placeholder #endifdef // pikttest #endif // piktmaster | piktmistress /////////////////////////////////////////////////////////////////////////////// Every5Minutes=pikttest // things that need to run often, i.e., // every five minutes // typically for logging purposes timing 0,5,10,15,20,25,30,35,40,45,50,55 * * * * 0 // sun-sat, // all hrs nicecmd "=nice -n 19" mailcmd "=mailx -a 'From: piktadmin' -s 'PIKT Alert on =pikthostname: Every5Minutes' =piktadmin" #ifndef pikttest status active #elsedef status testing #endifdef level info alarms #ifndef pikttest # if piktmistress SysDown # endif DoNothing // placeholder #elsedef // pikttest # if piktmaster | piktmistress PIKTHeartbeat # endif # if client RunawayCPUProcsStats # endif DoNothing // placeholder #endifdef // pikttest /////////////////////////////////////////////////////////////////////////////// EMERGENCY=pikttest // things that require immediate attention // general interest and truly emergency stuff only #if missioncritical | piktmaster | piktmistress timing 0,5,10,15,20,25,30,35,40,45,50,55 0-15 * * 1-5 1 // mon-fri, mrn&day 0,10,20,30,40,50 16-23 * * 1-5 1 // mon-fri, eve hrs 0 * * * 0,6 5 // sat-sun, all hrs #else timing 0,10,20,30,40,50 0-15 * * 1-5 1 // mon-fri, mrn hrs 0 16-23 * * 1-5 5 // mon-fri, eve hrs 0 0,6,12,18 * * 0,6 5 // sat-sun, all hrs #endif nicecmd "=nice -n 19" mailcmd "=mailx -a 'From: piktadmin' -s 'PIKT Alert on =pikthostname: EMERGENCY' =pikt-emergency" lpcmd "=lp =piktprinter" #ifndef pikttest status active #elsedef status testing #endifdef level emergency alarms #ifndef pikttest # if missioncritical DiskCap # else // full disks not emergency # endif # if missioncritical LoadAverage ProcessCounts ZombieCounts CPUUsage # else // any of these issues on non mission-critical // systems is never considered an emergency # endif # if missioncritical RunawayCPUProcs RunawayMEMProcs # else // RunawayCPUProcs // RunawayMEMProcs # endif DoNothing // placeholder #elsedef // pikttest DoNothing // placeholder #endifdef // pikttest /////////////////////////////////////////////////////////////////////////////// Hourly=pikttest // things that need to run hourly timing 0 * * * * 0 // sun-sat, all hrs nicecmd "=nice -n 19" mailcmd "=mailx -a 'From: piktadmin' -s 'PIKT Alert on =pikthostname: Hourly' =piktadmin" #ifndef pikttest status active #elsedef status testing #endifdef level info alarms #ifndef pikttest # if piktmaster HostnameCheck # endif DoNothing // placeholder #elsedef // pikttest #if ! piktmaster PIKTHeartbeat #endif DoNothing // placeholder #endifdef // pikttest /////////////////////////////////////////////////////////////////////////////// SysAdminsUrgent=pikttest // things that deserve nearly immediate attention #if missioncritical timing 15 0-6 * * 1-5 1 // mon-fri, mrn 15 7-15 * * 1-5 2 // mon-fri, day 15 16-23 * * 1-5 2 // mon-fri, eve hrs 15 */2 * * 0,6 2 // sat-sun, all hrs #else timing 15 0-15 * * 1-5 2 // mon-fri, mrn&day 15 22 * * 0-4 2 // sun-thu, eve hrs #endif nicecmd "=nice -n 19" mailcmd "=mailx -a 'From: piktadmin' -s 'PIKT Alert on =pikthostname: Urgent' =sysadmins-urgent" lpcmd "=lp =piktprinter" #ifndef pikttest status active #elsedef status testing #endifdef level urgent alarms #ifndef pikttest # if piktmaster SysDown WinNetDown # endif DiskCap LoadAverage ProcessCounts ZombieCounts CPUUsage RunawayCPUProcs DmesgScan # if ! nometalog SyslogKernelScan # endif #elsedef // pikttest ProcessSystemDead # if server MissingAcmeProcesses # endif # if piktmaster | piktmistress PIKTREDLogScan // moved here from Debug # endif DoNothing // placeholder #endifdef // pikttest /////////////////////////////////////////////////////////////////////////////// CodersUrgent=pikttest // things that deserve nearly immediate attention #if missioncritical timing 45 0-6 * * 1-5 1 // mon-fri, mrn 45 7-15 * * 1-5 2 // mon-fri, day 45 16-23 * * 1-5 2 // mon-fri, eve hrs 45 */2 * * 0,6 2 // sat-sun, all hrs #else timing 45 0-15 * * 1-5 2 // mon-fri, mrn&day 45 22 * * 0-4 2 // sun-thu, eve hrs #endif nicecmd "=nice -n 19" mailcmd "=mailx -a 'From: piktadmin' -s 'PIKT Alert on =pikthostname: Urgent' =coders-urgent" lpcmd "=lp =piktprinter" #ifndef pikttest status active #elsedef status testing #endifdef level urgent alarms #ifndef pikttest //# if server LoadAverage ProcessCounts ZombieCounts CPUUsage RunawayCPUProcs DmesgScan # if ! nometalog SyslogKernelScan # endif //# endif # if server MissingAcmeProcesses # endif # if dbprimary NdbdOutFileScan # endif DoNothing // placeholder #elsedef // pikttest ProcessSystemDead # if server ACMEProcessListChange # endif DoNothing // placeholder #endifdef // pikttest /////////////////////////////////////////////////////////////////////////////// SysAdminsCritical=pikttest // things that should be dealt with before too long, // preferably by day's end; (things reported here // may not be especially "critical" but are so // designated to conform with syslog's log levels) #if missioncritical timing 30 0-15 * * 1-5 1 // mon-fri, mrn&day 30 18,21 * * 1-5 2 // mon-fri, eve hrs 30 */6 * * 0,6 2 // sat-sun, all hrs #else # ifndef holiday timing 30 0-15 * * 1-5 2 // mon-fri, mrn&day 30 21 * * 0-4 2 // sun-thu, eve hrs # elsedef timing =piktnever # endifdef #endif nicecmd "=nice -n 19" mailcmd "=mailx -a 'From: piktadmin' -s 'PIKT Alert on =pikthostname: Critical' =sysadmins-critical" #ifndef pikttest status active #elsedef status testing #endifdef level critical alarms #ifndef pikttest # if piktmaster RpcDown SshDown // HttpDown # endif # if ! down SysReboot # endif DmesgScan # if ! nometalog SyslogCriticalScan # endif RunawayMEMProcs #elsedef // pikttest # if piktmaster SmtpDown # endif // lphung // not particularly useful // LpDisabled // not particularly useful # if passwdserver PasswdFile # endif DoNothing // placeholder #endifdef // pikttest /////////////////////////////////////////////////////////////////////////////// CodersCritical=pikttest // things that should be dealt with before too long, // preferably by day's end; (things reported here // may not be especially "critical" but are so // designated to conform with syslog's log levels) #if missioncritical timing 0 0-15 * * 1-5 1 // mon-fri, mrn&day 0 18,21 * * 1-5 2 // mon-fri, eve hrs 0 */6 * * 0,6 2 // sat-sun, all hrs #else # ifndef holiday timing 0 0-15 * * 1-5 2 // mon-fri, mrn&day 0 21 * * 0-4 2 // sun-thu, eve hrs # elsedef timing =piktnever # endifdef #endif nicecmd "=nice -n 19" mailcmd "=mailx -a 'From: piktadmin' -s 'PIKT Alert on =pikthostname: Critical' =coders-critical" #ifndef pikttest status active #elsedef status testing #endifdef level critical alarms #ifndef pikttest RunawayCPUProcs RunawayMEMProcs #elsedef // pikttest DoNothing // placeholder #endifdef // pikttest /////////////////////////////////////////////////////////////////////////////// SysAdminsWarning=pikttest // things that need attention, if not today, then // eventually; after looking at warning alerts, we // often just delete them at the end of the day, // clearing the deck for the next day's warnings #ifndef holiday timing 30 21 * * * 10 // sun-sat #elsedef timing =piktnever #endifdef nicecmd "=nice -n 19" mailcmd "=mailx -a 'From: piktadmin' -s 'PIKT Alert on =pikthostname: Warning' =sysadmins-warning" #ifndef pikttest status active #elsedef status testing #endifdef level warning alarms #ifndef pikttest DiskCap NightlyUpdatesLogScan // DoNothing // placeholder #elsedef // pikttest # if ! nometalog SyslogCrondScan SyslogPwdfailScan SyslogSshdScan // SyslogEverythingScan // not worth the extra load FixMetaLogfilePermissions # endif # if passwdserver PasswdFile Pwck Grpck # endif DoNothing // placeholder #endifdef /////////////////////////////////////////////////////////////////////////////// CodersWarning=pikttest // things that need attention, if not today, then // eventually; after looking at warning alerts, we // often just delete them at the end of the day, // clearing the deck for the next day's warnings #ifndef holiday timing 30 21 * * * 10 // sun-sat #elsedef timing =piktnever #endifdef nicecmd "=nice -n 19" mailcmd "=mailx -a 'From: piktadmin' -s 'PIKT Alert on =pikthostname: Warning' =coders-warning" #ifndef pikttest status active #elsedef status testing #endifdef level warning alarms #ifndef pikttest NightlyUpdatesLogScan // DoNothing // placeholder #elsedef // pikttest DoNothing // placeholder #endifdef /////////////////////////////////////////////////////////////////////////////// CodersNotice=pikttest // things that deserve our attention but may or may // not require action; if we are busy, we can usually // safely ignore most or all notice alerts (simply // delete them) #ifndef holiday timing 30 20 * * 0-4 10 // sun-thu #elsedef timing =piktnever #endifdef nicecmd "=nice -n 19" mailcmd "=mailx -a 'From: piktadmin' -s 'PIKT Alert on =pikthostname: Notice' =coders-notice" #ifndef pikttest status active #elsedef status testing #endifdef level notice alarms #ifndef pikttest ACMEDATFreshFiles # if server & ! paris ACMECrontabChange # endif // ACMEScriptChange ACMEScriptModeCheck #elsedef // pikttest ACMEScriptChange DoNothing // placeholder #endifdef // pikttest /////////////////////////////////////////////////////////////////////////////// PIKTAdminInfo // for informational purposes only, or for // occasional housekeeping tasks; we usually just // glance at info alerts or ignore them // altogether (simply delete them) #ifndef holiday timing 30 22 * * 0-4 10 // sun-thu #elsedef timing =piktnever #endifdef nicecmd "=nice -n 19" mailcmd "=mailx -a 'From: piktadmin' -s 'PIKT Alert on =pikthostname: Info' =pikt-info" #ifndef pikttest status active #elsedef status testing #endifdef level info alarms #if piktmaster SysUp NewHosts #endif RootCrontabChange TruncatePIKTLogFiles FixPIKTLogfilePermissions #if server NonAcmeProcesses #endif /////////////////////////////////////////////////////////////////////////////// Debug // for PIKT self-monitoring; these deserve // fairly close attention, especially on the // piktmaster, where we not only run more often, // we also cron it #ifndef holiday # if piktmaster | piktmistress // run Debug, via either piktd or cron, every 3 hours, // but only overnight (not during the day) // crond runs Debug at alternating intervals like so: // 55 5,23 * * * /usr/bin/nice -n -10 /pikt/bin/pikt ... +A Debug timing 55 0,2,4 * * 1-5 5 // mon-fri # else timing 55 0,2,4 * * 1-5 5 // mon-fri # endif #elsedef timing =piktnever #endifdef nicecmd "=nice -n 19" mailcmd "=mailx -a 'From: piktadmin' -s 'PIKT Alert on =pikthostname: Debug' =pikt-debug" status active level debug alarms // PiktUpdateLog #if piktmaster | piktmistress // it's important to run PiktDaemonProblems // independently in two separate alerts, Critical // and (our choice) Debug; if you just run it in one // alert, and that alert hangs, then you miss this // vital alarm; we recommend, too, that you run the // Debug alert via cron; in addition to the above // schedule, where 'pikt +A Debug' is invoked by // piktd, we also have cron invoke Debug (from our // root crontab): // // 55 1,3,5,7,9,11,13,15,17,19,21,23 * * * // /usr/bin/nice -10 /pikt/bin/pikt ... +A Debug // // so, we run PiktDaemonProblems independently // under three different schedules: // // 30 * * * * [in the Critical alert, // invoked by piktd] // 55 0-22/2 * * * [in the Debug alert, // invoked by piktd] // 55 1,3,5,7,9,11,13,15,17,19,21,23 * * * // [in the Debug alert, // invoked by cron] //PiktDaemonProblems #endif //#if ! piktmaster // PIKTHeartbeat //#endif PIKTStatusCheck PersistentPiktRun StalePIKTLockFiles StalePIKTLogFiles StalePIKTHstFiles #if piktmaster | piktmistress // PIKTREDLogScan // moved to SysAdminsUrgentTest PIKTREDTestLogScan #endif PIKTEMERGENCYLogScan PIKTEMERGENCYTestLogScan PIKTEvery5MinutesLogScan PIKTEvery5MinutesTestLogScan PIKTHourlyLogScan PIKTHourlyTestLogScan PIKTSysAdminsUrgentLogScan PIKTSysAdminsUrgentTestLogScan PIKTCodersUrgentLogScan PIKTCodersUrgentTestLogScan PIKTSysAdminsCriticalLogScan PIKTSysAdminsCriticalTestLogScan PIKTCodersCriticalLogScan PIKTCodersCriticalTestLogScan PIKTSysAdminsWarningLogScan PIKTSysAdminsWarningTestLogScan PIKTCodersWarningLogScan PIKTCodersWarningTestLogScan PIKTCodersNoticeLogScan PIKTPIKTAdminInfoLogScan #if piktmaster | piktmistress PIKTDownSystemsLogScan #endif #if piktmaster PIKTCheckDatabaseLogScan PIKTDownRpcLogScan PIKTSysRebootsLogScan PIKTScanDmesgLogScan PIKTScanSyslogCriticalLogScan PIKTMissingAcmeProcessesLogScan PIKTScriptChangesLogScan PIKTLoadAveragesLogScan PIKTCPUUsageLogScan #endif PIKTpiktc_svcLogScan PIKTpiktdLogScan #if piktmaster PIKTpiktcLogScan #endif /////////////////////////////////////////////////////////////////////////////// #if dbprimary CheckDatabase timing 45 0,20,22 * * * 1 mailcmd "=mailx -a 'From: piktadmin' -s 'PIKT Alert on =pikthostname: CheckDatabase' =pikt-debug" status active level urgent alarms CheckDB #endif /////////////////////////////////////////////////////////////////////////////// /////////////////////////////////////////////////////////////////////////////// // diagnostic scripts, to be run in emergency situations, for example: // piktc -x +C "hostname; /pikt/bin/pikt +A ScanDmesg; echo" +H server /////////////////////////////////////////////////////////////////////////////// #if piktmaster PIKTStatus status active level debug scripts PIKTStatusCheck #endif /////////////////////////////////////////////////////////////////////////////// #if piktmaster DownSystems status active level info scripts SysDown #endif /////////////////////////////////////////////////////////////////////////////// #if piktmaster DownServers status active level info scripts SysDown #endif /////////////////////////////////////////////////////////////////////////////// #if piktmaster DownClients status active level info scripts SysDown #endif /////////////////////////////////////////////////////////////////////////////// #if piktmaster DownRpc status active level info scripts RpcDown #endif /////////////////////////////////////////////////////////////////////////////// #if piktmaster DownRpcServers status active level info scripts RpcDown #endif /////////////////////////////////////////////////////////////////////////////// #if piktmaster DownRpcClients status active level info scripts RpcDown #endif /////////////////////////////////////////////////////////////////////////////// SysReboots status active level info scripts SysReboot /////////////////////////////////////////////////////////////////////////////// ScanDmesg status active level info scripts DmesgScan /////////////////////////////////////////////////////////////////////////////// ScanSyslogCritical status active level info scripts SyslogCriticalScan /////////////////////////////////////////////////////////////////////////////// ScanSyslogKernel status active level info scripts SyslogKernelScan /////////////////////////////////////////////////////////////////////////////// // shut down 07/03/01, as it is potentially too resource-intensive /* # if server ScanLogFiles status active level info scripts LogFileScan # endif // server */ /////////////////////////////////////////////////////////////////////////////// # if server MissingAcmeProcesses status active level info scripts MissingAcmeProcesses # endif // server /////////////////////////////////////////////////////////////////////////////// ScriptChanges status active level info scripts ACMEScriptChange /////////////////////////////////////////////////////////////////////////////// LoadAverages status active level info scripts LoadAverage /////////////////////////////////////////////////////////////////////////////// Processes status active level info scripts ProcessCounts /////////////////////////////////////////////////////////////////////////////// Zombies status active level info scripts ZombieCounts /////////////////////////////////////////////////////////////////////////////// CPUUsage status active level info scripts CPUUsage /////////////////////////////////////////////////////////////////////////////// RunawayCPUProcs status active level info scripts RunawayCPUProcs /////////////////////////////////////////////////////////////////////////////// RunawayMEMProcs status active level info scripts RunawayMEMProcs /////////////////////////////////////////////////////////////////////////////// DiskCap status active level info scripts DiskCap /////////////////////////////////////////////////////////////////////////////// /////////////////////////////////////////////////////////////////////////////// #ifdef pikttest //# if piktmaster Test // use this for testing newly developed alarm scripts; // install with 'piktc -iv +D test +A Test +H ...' // or maybe 'piktc -iv +D test debug verbose -D page doexec +A Test +H ...' // after testing, remove all traces of // the Test alert with 'piktc -tv +A Test +H ...' timing 45 6 * * * 0 // timing */5 * * * * 0 // timing 5 * * * * 0 // timing =piktnever // drift 30 nicecmd "=nice -n 19" mailcmd "=mailx -a 'From: piktadmin' -s 'PIKT Alert on =pikthostname: Test' =pikt-test" lpcmd "=lp =piktprinter" status testing level debug alarms // SNMPLiebert // ShoutTest //#if piktmaster // SysUp // NewHosts //#endif DiskCap DoNothing //# endif // piktmaster #endifdef // pikttest /////////////////////////////////////////////////////////////////////////////// ///////////////////////////////////////////////////////////////////////////////
(A note about page layout: In the interest of readability, we have added artificial line wraps in many examples. Even though displayed here broken up across several screen lines, in general quoted strings, preprocessor directives, macro definitions, .log & .conf entries, and so on should all be unbroken on a single line.)
For more examples, see Samples.