Broken Link Checker (internal links)
The =broken_links_internal link checker is a Pikt script macro to check the validity of internal (same-site) URL references within a collection of HTML documents.
In the 'input proc' statement below, we use the Unix find command to find all .htm files within the web documents tree.
For every .htm or .html file, we invoke the find_urls.pl script to find all URLs within the file. (Since the documents are local, entirely accessible on disk, we avoid any complicated, recursive spidering schemes of our own directly accessible website.)
We bypass ftp: and other references. (Include these if you wish.) We also bypass all "foreign" (off-site) URLs, since we check these using a different Pikt script (see CheckBrokenLinksExternal).
In the http: pattern match, we determine the $site name and file $relpath (relative path). Next, we have to determine the file absolute path, according to several categories (for example, references to pages in the parent directory, cgi-bin files, relative path references, absolute path references, and so on). Note that we simply ignore the complication of named anchors. If the file is there, good enough (even if the named anchor within that file is missing or misnamed).
Finally, we do a simple '-e' file test to test for the file's existence. If not found, we report the broken URL (incorrect file reference).
#if emcwebsys broken_links_internal(site, website, webdir) init status =piktstatus level =piktlevel task "Check for broken internal links" input proc "=find =httpd_doc_root/(site) -name \\*.htm -o -name \\*.html -print" rule set $filename = $inlin rule // bypass /tmp and other pages if $filename =~~ "(/tmp/)" next fi rule if #popen(URLS, "/usr/local/bin/find_urls.pl $filename", "r") != #err() while #read(URLS) > 0 #ifdef debug output $rdlin #endifdef // ignore ftp: and others if $rdlin !~ "^http:\/\/" next fi // ignore off-site URLs, since we check these elsewhere if $rdlin =~~ "http.*:\/\/([^/]+)(.+)" set $site = $1 set $relpath = $2 if $site !~~ "(website)" cont fi else set $relpath = $rdlin fi #ifdef debug output $relpath #endifdef // set the absolute path if $relpath =~~ "(\\.\\./)(.+)" set $abspath = "(webdir)" . "/$2" elsif $relpath =~~ "^/cgi-bin/" set $abspath = "/var/www" . $relpath if $abspath =~~ "^(/var/www/cgi-bin/yabb_emc/YaBB\\.cgi).*" set $abspath = $1 fi elsif $relpath =~~ "^/var/www" set $abspath = $relpath elsif $left($relpath,1) eq "/" set $abspath = "(webdir)" . "$relpath" else set $abspath = "(webdir)" . "/$relpath" fi // ignore named anchors if $abspath =~~ "^(.+\\.htm)#.+$" || $abspath =~~ "^(.+\\.html)#.+$" set $abspath = $1 fi #ifdef debug output $abspath #endifdef if ! -e $abspath output mail "in $filename: $abspath not found" fi endwhile do #pclose(URLS) else output mail "Can't open '/usr/local/bin/find_urls.pl $filename' for processing!" quit fi #endif // emcwebsys
For both the PIKT and Early MusiChicago websites, we invoke the =broken_links_internal() macro from within alarms.cfg (or one of its #include files):
/////////////////////////////////////////////////////////////////////////////// BrokenLinksInternalPIKT =broken_links_internal(pikt, pikt\\\\.org, =httpd_doc_root) /////////////////////////////////////////////////////////////////////////////// BrokenLinksInternalEMC =broken_links_internal(emc, earlymusichicago\\\\.org, =httpd_doc_root/emc) ///////////////////////////////////////////////////////////////////////////////
Here is some example BrokenLinksInternalEMC script output:
in /var/www/html/emc/events_january_2004.htm: /var/www/html/emc/soloists_instrumental_keyboard.htm not found in /var/www/html/emc/events_november_2004.htm: /var/www/html/emc/soloists_instrumental_strings.htm not found ...
For more examples, see Samples.