Broken Link Checker (external links)
The CheckBrokenLinksExternal link checker is a Pikt script to check the validity of external URL references within a collection of HTML documents. We were driven to develop our own broken links checker when our previous link checker failed us badly by reporting too many false negatives (too few genuine breaks), and too many false positives (transient breaks that would go away in a day or two).
In the 'input proc' statement below, we use the Unix find command to find all .htm files within the web documents tree. (In this example, we are checking for broken links at the Early MusiChicago website. Substitute .html, a different document root directory, and other particulars as needed.) For every .htm file, we invoke the find_urls.pl script to find all URLs within the file. Filtering through the '=sort | =uniq' commands, the 'input proc' statement spits out a list of all URLs anywhere within the website's HTML page collection. (Since the documents are local, entirely accessible on disk, we avoid any complicated, recursive spidering schemes of our own directly accessible website.)
In the 'dat $url [1]' statement, we automatically assign each input line to the $url variable. With the 'keys $url' statement, we make possible past-value references (e.g., to %timebroken) later on in the script.
In the opening 'begin' block, we read in the HTTPStatusCodes.obj file (see http_status_codes_objects.cfg), since for any broken link, we want to see a reason (e.g., HTTP_NOT_FOUND) instead of just a cryptic status code number (e.g., 404). For this purpose, we use the =read_http_status_codes() PIKT macro:
read_http_status_codes(V) if #fopen(CODES, "=objdir/HTTPStatusCodes.obj", "r") != #err() while #read(CODES) > 0 do #split($rdlin) set $(V)[$1] = $2 endwhile do #fclose(CODES) else output mail "Can't open =objdir/HTTPStatusCodes.obj for reading!" quit fi
Following a variable initialization rule, we do some bypasses. First, we bypass local URLs, since these are handled elswhere by a different Pikt script (see CheckBrokenLinksInternal). Second, we skip any image or other special data files. (Include these if you wish.) Next, we skip some commercial URLs and other miscellaneous URL references that are problematic for one reason or another. (Again, include these if you wish.) Then we skip some other URLs to bulletin board pages, ftp: references, etc.
At last, we check the current URL. url_status.pl fetches the resource (web page, image file, etc.) at the given URL, and reports the HTTP return code, for example:
403 http://www.ticketmaster.com
If the $stat result is non-empty (has non-zero #length()), or if the /tmp/urlstat.tmp file is zero bytes, we enter the 'if ... fi' block. (url_status.pl only reports anything for unsuccessful HTTP requests--or if nothing was retrieved at the URL location, despite what the HTTP status code might suggest.)
Remember that '%timebroken' would be the #now() time value for the given URL the last time this CheckBrokenLinksExternal script was run (see #now()). If this URL is newly "broken" (%timebroken is <= 0, i.e., #timebroken for this URL was set to 0 last time; see the very first rule in this script, where we initialize the #timebroken value for the current URL)--we set the present #timebroken to the #now() timevalue. Otherwise, we set the current #timebroken value to what it was last time.
If the URL is newly broken, we set its #daysbroken value to 1, else we compute the number of days that this URL has been broken with the statement
set #daysbroken = #int((#now() - #timebroken)/=secs_in_day)
(Note that =secs_in_day is a standard PIKT macro defined as '(60*60*24)'.)
If this URL has been broken at least 3 days, we (a) report the URL, (b) report its status (both status code and identifier), also how many days it's been broken, and (c) all files containing this broken link (by means of the find and egrep combination).
In the final rule, so as to be a good, polite Net citizen (so as not to hammer in too quick succession some target site with repeated URL checks), we pause a second before proceeding to the next URL.
If you wish, you could report broken links immediately, as in
if #daysbroken >= 0
But sometimes a URL is off-line momentarily (for example, the web server is down for maintenance, there is a network breakdown somewhere, or some other transient glitch). Unless a URL has been broken for at least a day or longer (in our case 3 days), we consider this "brokenness" to be a possible false positive (or would that be a false negative?). We don't want immediately to remove a "broken" URL every time our broken link checker (or anybody else's) reports a "break".
We could go farther with this, reporting for example 404s (HTTP_NOT_FOUND) immediately, other types of break after 3 days, and still others after 5 days, etc. Complicate this script to suit your own situation as needed.
More possible refinements: Ignore certain error return codes entirely. Add rules to ignore special cases (for example, URLs with a status code of 200 but that return no content). And so on.
The complete CheckBrokenLinksExternal script follows.
#if emcwebsys BrokenLinksExternalEMC init status =piktstatus level =piktlevel task "Check for broken external EMC links" input proc "=find =emcwebdir -name \\*.htm -exec /usr/local/bin/find_urls.pl {} \\; | =sort | =uniq" dat $url [1] keys $url begin =read_http_status_codes(st) rule // initialize for current url set #timebroken = 0 rule // bypass local urls, since we check these elsewhere if $url =~ "http:\/\/earlymusichicago.org" next fi rule // bypass images and other special data files if $url =~ "images/|\\.jpg|\\.gif|\\.wav" next fi rule // bypass commercial urls and other misc that return a status // code of 200 but no data (result in an empty urlstat.tmp // file; see below); or that we want to disregard for some // other reason if $url =~ "www\\.amazon\\.com/exec/obidos" || $url =~ "www.google\\.com/images" || $url =~ "wikipedia\\.org/wiki" || $url =~ "www\\.sheetmusicplus\\.com/.+search\\.html" || $url =~ "webring\\.com" next fi rule // bypass these internal pages if $url =~ "yabb_emc/YaBB.cgi?board=" next fi rule // ignore ftp: and others if $inlin !~ "^http:\/\/" next fi rule // okay, now we investigate the url set $stat = $command("/usr/local/bin/urlstat.pl '$url' /tmp/urlstat.tmp") do #split($stat) set $statcode = $1 // set $url = $2 // $url already set // exceptions follow rule // status code 200 if $statcode eq "200" && ($url =~~ "www\\.google\\.com") next fi rule // status code 403 if $statcode eq "403" && ( $url =~~ "www\\.ticketmaster\\.com" || $url =~~ "directory\\.google\\.com" ) next fi rule // status code 406 if $statcode eq "406" && ( $url =~~ "www\\.bagpiper\\.com" || $url =~~ "www\\.bagpipeweb\\.com" ) next fi rule // status code 500 if $statcode eq "500" && ($url =~~ "earlymusicchicago\\.org") next fi rule if #length($stat) || -z "/tmp/urlstat.tmp" if %timebroken <= 0 set #timebroken = #now() set #daysbroken = 1 else set #timebroken = %timebroken set #daysbroken = #int((#now() - #timebroken)/(=secs_in_day)) fi if #daysbroken >= 3 output mail "$url" output mail " $statcode [$st[$statcode]], broken $text(#daysbroken) days" // report the files with this url do #popen(FIND, "=find =emcwebdir -name \\*.htm -exec =egrep -il '$url' {} \\;", "r") while #read(FIND) > 0 output mail " $rdlin" endwhile do #pclose(FIND) output mail $newline() fi fi rule // be polite pause 1 #endif // emcwebsys
Here is some example script output:
http://www.Ticketmaster.com 403 [HTTP_FORBIDDEN], broken 3 days /var/www/html/emc/events_december_2003.htm http://www.avdgs.org.au/ 500 [HTTP_INTERNAL_SERVER_ERROR], broken 3 days /var/www/html/emc/instruments_strings.htm 403 [HTTP_FORBIDDEN], broken 8 days /var/www/html/emc/composers_renaissance.htm ...
For more examples, see Samples.