Broken Link Checker (external links)

The CheckBrokenLinksExternal link checker is a Pikt script to check the validity of external URL references within a collection of HTML documents.  We were driven to develop our own broken links checker when our previous link checker failed us badly by reporting too many false negatives (too few genuine breaks), and too many false positives (transient breaks that would go away in a day or two).

In the 'input proc' statement below, we use the Unix find command to find all .htm files within the web documents tree.  (In this example, we are checking for broken links at the Early MusiChicago website.  Substitute .html, a different document root directory, and other particulars as needed.)  For every .htm file, we invoke the find_urls.pl script to find all URLs within the file.  Filtering through the '=sort | =uniq' commands, the 'input proc' statement spits out a list of all URLs anywhere within the website's HTML page collection.  (Since the documents are local, entirely accessible on disk, we avoid any complicated, recursive spidering schemes of our own directly accessible website.)

In the 'dat $url [1]' statement, we automatically assign each input line to the $url variable.  With the 'keys $url' statement, we make possible past-value references (e.g., to %timebroken) later on in the script.

In the opening 'begin' block, we read in the HTTPStatusCodes.obj file (see http_status_codes_objects.cfg), since for any broken link, we want to see a reason (e.g., HTTP_NOT_FOUND) instead of just a cryptic status code number (e.g., 404).  For this purpose, we use the =read_http_status_codes() PIKT macro:

read_http_status_codes(V)
		if #fopen(CODES, "=objdir/HTTPStatusCodes.obj", "r") != #err()
			while #read(CODES) > 0
				do #split($rdlin)
				set $(V)[$1] = $2
			endwhile
			do #fclose(CODES)
		else
			output mail "Can't open =objdir/HTTPStatusCodes.obj for reading!"
			quit
		fi

Following a variable initialization rule, we do some bypasses.  First, we bypass local URLs, since these are handled elswhere by a different Pikt script (see CheckBrokenLinksInternal).  Second, we skip any image or other special data files.  (Include these if you wish.)  Next, we skip some commercial URLs and other miscellaneous URL references that are problematic for one reason or another.  (Again, include these if you wish.)  Then we skip some other URLs to bulletin board pages, ftp: references, etc.

At last, we check the current URL.  url_status.pl fetches the resource (web page, image file, etc.) at the given URL, and reports the HTTP return code, for example:

403 http://www.ticketmaster.com

If the $stat result is non-empty (has non-zero #length()), or if the /tmp/urlstat.tmp file is zero bytes, we enter the 'if ... fi' block.  (url_status.pl only reports anything for unsuccessful HTTP requests--or if nothing was retrieved at the URL location, despite what the HTTP status code might suggest.)

Remember that '%timebroken' would be the #now() time value for the given URL the last time this CheckBrokenLinksExternal script was run (see #now()).  If this URL is newly "broken" (%timebroken is <= 0, i.e., #timebroken for this URL was set to 0 last time; see the very first rule in this script, where we initialize the #timebroken value for the current URL)--we set the present #timebroken to the #now() timevalue.  Otherwise, we set the current #timebroken value to what it was last time.

If the URL is newly broken, we set its #daysbroken value to 1, else we compute the number of days that this URL has been broken with the statement

                                set #daysbroken = #int((#now() - #timebroken)/=secs_in_day)

(Note that =secs_in_day is a standard PIKT macro defined as '(60*60*24)'.)

If this URL has been broken at least 3 days, we (a) report the URL, (b) report its status (both status code and identifier), also how many days it's been broken, and (c) all files containing this broken link (by means of the find and egrep combination).

In the final rule, so as to be a good, polite Net citizen (so as not to hammer in too quick succession some target site with repeated URL checks), we pause a second before proceeding to the next URL.

If you wish, you could report broken links immediately, as in

                        if #daysbroken >= 0

But sometimes a URL is off-line momentarily (for example, the web server is down for maintenance, there is a network breakdown somewhere, or some other transient glitch).  Unless a URL has been broken for at least a day or longer (in our case 3 days), we consider this "brokenness" to be a possible false positive (or would that be a false negative?).  We don't want immediately to remove a "broken" URL every time our broken link checker (or anybody else's) reports a "break".

We could go farther with this, reporting for example 404s (HTTP_NOT_FOUND) immediately, other types of break after 3 days, and still others after 5 days, etc.  Complicate this script to suit your own situation as needed.

More possible refinements:  Ignore certain error return codes entirely.  Add rules to ignore special cases (for example, URLs with a status code of 200 but that return no content).  And so on.

The complete CheckBrokenLinksExternal script follows.

#if emcwebsys

BrokenLinksExternalEMC

	init
		status =piktstatus
		level =piktlevel
		task "Check for broken external EMC links"
		input proc "=find =emcwebdir
                            -name \\*.htm
                            -exec /usr/local/bin/find_urls.pl
                            {} \\; | =sort | =uniq"
		dat $url [1]
		keys $url

	begin
		=read_http_status_codes(st)

	rule	// initialize for current url
		set #timebroken = 0

	rule	// bypass local urls, since we check these elsewhere
		if $url =~ "http:\/\/earlymusichicago.org"
			next
		fi

	rule	// bypass images and other special data files
		if $url =~ "images/|\\.jpg|\\.gif|\\.wav"
			next
		fi

	rule	// bypass commercial urls and other misc that return a status
		// code of 200 but no data (result in an empty urlstat.tmp
		// file; see below); or that we want to disregard for some
		// other reason
		if    $url =~ "www\\.amazon\\.com/exec/obidos"
		   || $url =~ "www.google\\.com/images"
		   || $url =~ "wikipedia\\.org/wiki"
		   || $url =~ "www\\.sheetmusicplus\\.com/.+search\\.html"
		   || $url =~ "webring\\.com"
			next
		fi

	rule	// bypass these internal pages
		if $url =~ "yabb_emc/YaBB.cgi?board="
			next
		fi

	rule	// ignore ftp: and others
		if $inlin !~ "^http:\/\/"
			next
		fi

	rule	// okay, now we investigate the url
		set $stat = $command("/usr/local/bin/urlstat.pl '$url' /tmp/urlstat.tmp")
		do #split($stat)
		set $statcode = $1
		// set $url = $2	// $url already set

	// exceptions follow

	rule	// status code 200
		if    $statcode eq "200"
		   && ($url =~~ "www\\.google\\.com")
			next
		fi

	rule	// status code 403
		if    $statcode eq "403"
		   && (   $url =~~ "www\\.ticketmaster\\.com"
		       || $url =~~ "directory\\.google\\.com"
		      )
			next
		fi

	rule	// status code 406
		if    $statcode eq "406"
		   && (   $url =~~ "www\\.bagpiper\\.com"
		       || $url =~~ "www\\.bagpipeweb\\.com"
		      )
			next
		fi

	rule	// status code 500
		if    $statcode eq "500"
		   && ($url =~~ "earlymusicchicago\\.org")
			next
		fi

	rule
		if    #length($stat)
		   || -z "/tmp/urlstat.tmp"
			if %timebroken <= 0
				set #timebroken = #now()
				set #daysbroken = 1
			else
				set #timebroken = %timebroken
				set #daysbroken = #int((#now() - #timebroken)/(=secs_in_day))
			fi
			if #daysbroken >= 3
				output mail "$url"
				output mail "  $statcode [$st[$statcode]], broken $text(#daysbroken) days"
				// report the files with this url
				do #popen(FIND, "=find =emcwebdir -name \\*.htm
                                                 -exec =egrep -il '$url' {} \\;", "r")
				while #read(FIND) > 0
					output mail "  $rdlin"
				endwhile
				do #pclose(FIND)
				output mail $newline()
			fi
		fi

	rule	// be polite
		pause 1

#endif  // emcwebsys

Here is some example script output:

http://www.Ticketmaster.com
  403 [HTTP_FORBIDDEN], broken 3 days
  /var/www/html/emc/events_december_2003.htm

http://www.avdgs.org.au/
  500 [HTTP_INTERNAL_SERVER_ERROR], broken 3 days
  /var/www/html/emc/instruments_strings.htm

http://www.google.com/Top/Arts/People/C/Campion,_Thomas/
  403 [HTTP_FORBIDDEN], broken 8 days
  /var/www/html/emc/composers_renaissance.htm

...

For more examples, see Samples.

 
Home | FAQ | News | Intro | Samples | Tutorial | Reference | Software
Developer's Notes | Licensing | Authors | Pikt-Users | Pikt-Workers | Related Projects | Site Index | Privacy Policy | Contact Us
Page best viewed at 1024x768 or greater.   Page last updated 2019-01-12.   This site is PIKT® powered.
Copyright © 1998-2019 Robert Osterlund. All rights reserved.
Home FAQ News Intro Samples Tutorial Reference Software
PIKT Logo
PIKT Page Title
View sample
Adsense/YPN
conflict
Pikt script