Broken Link Checker (external links)

The CheckBrokenLinksExternal link checker is a Pikt script to check the validity of external URL references within a collection of HTML documents. We were driven to develop our own broken links checker when our previous link checker failed us badly by reporting too many false negatives (too few genuine breaks), and too many false positives (transient breaks that would go away in a day or two).

In the 'input proc' statement below, we use the Unix find command to find all .htm files within the web documents tree. (In this example, we are checking for broken links at the Early MusiChicago website. Substitute .html, a different document root directory, and other particulars as needed.) For every .htm file, we invoke the find_urls.pl script to find all URLs within the file. Filtering through the '=sort | =uniq' commands, the 'input proc' statement spits out a list of all URLs anywhere within the website's HTML page collection. (Since the documents are local, entirely accessible on disk, we avoid any complicated, recursive spidering schemes of our own directly accessible website.)

In the 'dat $url [1]' statement, we automatically assign each input line to the $url variable. With the 'keys $url' statement, we make possible past-value references (e.g., to %timebroken) later on in the script.

In the opening 'begin' block, we read in the HTTPStatusCodes.obj file (see http_status_codes_objects.cfg), since for any broken link, we want to see a reason (e.g., HTTP_NOT_FOUND) instead of just a cryptic status code number (e.g., 404). For this purpose, we use the =read_http_status_codes() PIKT macro:

read_http_status_codes(V)
                if #fopen(CODES, "=objdir/HTTPStatusCodes.obj", "r") != #err()
                        while #read(CODES) > 0
                                do #split($rdlin)
                                set $(V)[$1] = $2
                        endwhile
                        do #fclose(CODES)
                else
                        output mail "Can't open =objdir/HTTPStatusCodes.obj for reading!"
                        quit
                fi

Following a variable initialization rule, we do some bypasses. First, we bypass local URLs, since these are handled elswhere by a different Pikt script (see CheckBrokenLinksInternal). Second, we skip any image or other special data files. (Include these if you wish.) Next, we skip some commercial URLs and other miscellaneous URL references that are problematic for one reason or another. (Again, include these if you wish.) Then we skip some other URLs to bulletin board pages, ftp: references, etc.

At last, we check the current URL. url_status.pl fetches the resource (web page, image file, etc.) at the given URL, and reports the HTTP return code, for example:

403 http://www.ticketmaster.com

If the $stat result is non-empty (has non-zero #length()), or if the /tmp/urlstat.tmp file is zero bytes, we enter the 'if ... fi' block. (url_status.pl only reports anything for unsuccessful HTTP requests--or if nothing was retrieved at the URL location, despite what the HTTP status code might suggest.)

Remember that '%timebroken' would be the #now() time value for the given URL the last time this CheckBrokenLinksExternal script was run (see #now()). If this URL is newly "broken" (%timebroken is <= 0, i.e., #timebroken for this URL was set to 0 last time; see the very first rule in this script, where we initialize the #timebroken value for the current URL)--we set the present #timebroken to the #now() timevalue. Otherwise, we set the current #timebroken value to what it was last time.

If the URL is newly broken, we set its #daysbroken value to 1, else we compute the number of days that this URL has been broken with the statement

                                set #daysbroken = #int((#now() - #timebroken)/=secs_in_day)

(Note that =secs_in_day is a standard PIKT macro defined as '(60*60*24)'.)

If this URL has been broken at least 3 days, we (a) report the URL, (b) report its status (both status code and identifier), also how many days it's been broken, and (c) all files containing this broken link (by means of the find and egrep combination).

In the final rule, so as to be a good, polite Net citizen (so as not to hammer in too quick succession some target site with repeated URL checks), we pause a second before proceeding to the next URL.

If you wish, you could report broken links immediately, as in

                        if #daysbroken >= 0

But sometimes a URL is off-line momentarily (for example, the web server is down for maintenance, there is a network breakdown somewhere, or some other transient glitch). Unless a URL has been broken for at least a day or longer (in our case 3 days), we consider this "brokenness" to be a possible false positive (or would that be a false negative?). We don't want immediately to remove a "broken" URL every time our broken link checker (or anybody else's) reports a "break".

We could go farther with this, reporting for example 404s (HTTP_NOT_FOUND) immediately, other types of break after 3 days, and still others after 5 days, etc. Complicate this script to suit your own situation as needed.

More possible refinements: Ignore certain error return codes entirely. Add rules to ignore special cases (for example, URLs with a status code of 200 but that return no content). And so on.

The complete CheckBrokenLinksExternal script follows.

#if emcwebsys

BrokenLinksExternalEMC

        init
                status =piktstatus
                level =piktlevel
                task "Check for broken external EMC links"
                input proc "=find =emcwebdir
                            -name \\*.htm
                            -exec /usr/local/bin/find_urls.pl
                            {} \\; | =sort | =uniq"
                dat $url [1]
                keys $url

        begin
                =read_http_status_codes(st)

        rule    // initialize for current url
                set #timebroken = 0

        rule    // bypass local urls, since we check these elsewhere
                if $url =~ "http:\/\/earlymusichicago.org"
                        next
                fi

        rule    // bypass images and other special data files
                if $url =~ "images/|\\.jpg|\\.gif|\\.wav"
                        next
                fi

        rule    // bypass commercial urls and other misc that return a status
                // code of 200 but no data (result in an empty urlstat.tmp
                // file; see below); or that we want to disregard for some
                // other reason
                if    $url =~ "www\\.amazon\\.com/exec/obidos"
                   || $url =~ "www.google\\.com/images"
                   || $url =~ "wikipedia\\.org/wiki"
                   || $url =~ "www\\.sheetmusicplus\\.com/.+search\\.html"
                   || $url =~ "webring\\.com"
                        next
                fi

        rule    // bypass these internal pages
                if $url =~ "yabb_emc/YaBB.cgi?board="
                        next
                fi

        rule    // ignore ftp: and others
                if $inlin !~ "^http:\/\/"
                        next
                fi

        rule    // okay, now we investigate the url
                set $stat = $command("/usr/local/bin/urlstat.pl '$url' /tmp/urlstat.tmp")
                do #split($stat)
                set $statcode = $1
                // set $url = $2        // $url already set

        // exceptions follow

        rule    // status code 200
                if    $statcode eq "200"
                   && ($url =~~ "www\\.google\\.com")
                        next
                fi

        rule    // status code 403
                if    $statcode eq "403"
                   && (   $url =~~ "www\\.ticketmaster\\.com"
                       || $url =~~ "directory\\.google\\.com"
                      )
                        next
                fi

        rule    // status code 406
                if    $statcode eq "406"
                   && (   $url =~~ "www\\.bagpiper\\.com"
                       || $url =~~ "www\\.bagpipeweb\\.com"
                      )
                        next
                fi

        rule    // status code 500
                if    $statcode eq "500"
                   && ($url =~~ "earlymusicchicago\\.org")
                        next
                fi

        rule
                if    #length($stat)
                   || -z "/tmp/urlstat.tmp"
                        if %timebroken <= 0
                                set #timebroken = #now()
                                set #daysbroken = 1
                        else
                                set #timebroken = %timebroken
                                set #daysbroken = #int((#now() - #timebroken)/(=secs_in_day))
                        fi
                        if #daysbroken >= 3
                                output mail "$url"
                                output mail "  $statcode [$st[$statcode]], broken $text(#daysbroken) days"
                                // report the files with this url
                                do #popen(FIND, "=find =emcwebdir -name \\*.htm
                                                 -exec =egrep -il '$url' {} \\;", "r")
                                while #read(FIND) > 0
                                        output mail "  $rdlin"
                                endwhile
                                do #pclose(FIND)
                                output mail $newline()
                        fi
                fi

        rule    // be polite
                pause 1

#endif  // emcwebsys

Here is some example script output:

http://www.Ticketmaster.com
  403 [HTTP_FORBIDDEN], broken 3 days
  /var/www/html/emc/events_december_2003.htm

http://www.avdgs.org.au/
  500 [HTTP_INTERNAL_SERVER_ERROR], broken 3 days
  /var/www/html/emc/instruments_strings.htm

  403 [HTTP_FORBIDDEN], broken 8 days
  /var/www/html/emc/composers_renaissance.htm

...

For more examples, see Samples.