Broken Link Checker (internal links)
The =broken_links_internal link checker is a Pikt script macro to check the validity of internal (same-site) URL references within a collection of HTML documents.
In the 'input proc' statement below, we use the Unix find command to find all .htm files within the web documents tree.
For every .htm or .html file, we invoke the find_urls.pl script to find all URLs within the file. (Since the documents are local, entirely accessible on disk, we avoid any complicated, recursive spidering schemes of our own directly accessible website.)
We bypass ftp: and other references. (Include these if you wish.) We also bypass all "foreign" (off-site) URLs, since we check these using a different Pikt script (see CheckBrokenLinksExternal).
In the http: pattern match, we determine the $site name and file $relpath (relative path). Next, we have to determine the file absolute path, according to several categories (for example, references to pages in the parent directory, cgi-bin files, relative path references, absolute path references, and so on). Note that we simply ignore the complication of named anchors. If the file is there, good enough (even if the named anchor within that file is missing or misnamed).
Finally, we do a simple '-e' file test to test for the file's existence. If not found, we report the broken URL (incorrect file reference).
#if emcwebsys
broken_links_internal(site, website, webdir)
init
status =piktstatus
level =piktlevel
task "Check for broken internal links"
input proc "=find =httpd_doc_root/(site)
-name \\*.htm -o -name \\*.html -print"
rule
set $filename = $inlin
rule // bypass /tmp and other pages
if $filename =~~ "(/tmp/)"
next
fi
rule
if #popen(URLS, "/usr/local/bin/find_urls.pl $filename", "r") != #err()
while #read(URLS) > 0
#ifdef debug
output $rdlin
#endifdef
// ignore ftp: and others
if $rdlin !~ "^http:\/\/"
next
fi
// ignore off-site URLs, since we check these elsewhere
if $rdlin =~~ "http.*:\/\/([^/]+)(.+)"
set $site = $1
set $relpath = $2
if $site !~~ "(website)"
cont
fi
else
set $relpath = $rdlin
fi
#ifdef debug
output $relpath
#endifdef
// set the absolute path
if $relpath =~~ "(\\.\\./)(.+)"
set $abspath = "(webdir)" . "/$2"
elsif $relpath =~~ "^/cgi-bin/"
set $abspath = "/var/www" . $relpath
if $abspath =~~ "^(/var/www/cgi-bin/yabb_emc/YaBB\\.cgi).*"
set $abspath = $1
fi
elsif $relpath =~~ "^/var/www"
set $abspath = $relpath
elsif $left($relpath,1) eq "/"
set $abspath = "(webdir)" . "$relpath"
else
set $abspath = "(webdir)" . "/$relpath"
fi
// ignore named anchors
if $abspath =~~ "^(.+\\.htm)#.+$"
|| $abspath =~~ "^(.+\\.html)#.+$"
set $abspath = $1
fi
#ifdef debug
output $abspath
#endifdef
if ! -e $abspath
output mail "in $filename: $abspath not found"
fi
endwhile
do #pclose(URLS)
else
output mail "Can't open '/usr/local/bin/find_urls.pl $filename'
for processing!"
quit
fi
#endif // emcwebsys
For both the PIKT and Early MusiChicago websites, we invoke the =broken_links_internal() macro from within alarms.cfg (or one of its #include files):
/////////////////////////////////////////////////////////////////////////////// BrokenLinksInternalPIKT =broken_links_internal(pikt, pikt\\\\.org, =httpd_doc_root) /////////////////////////////////////////////////////////////////////////////// BrokenLinksInternalEMC =broken_links_internal(emc, earlymusichicago\\\\.org, =httpd_doc_root/emc) ///////////////////////////////////////////////////////////////////////////////
Here is some example BrokenLinksInternalEMC script output:
in /var/www/html/emc/events_january_2004.htm:
/var/www/html/emc/soloists_instrumental_keyboard.htm not found
in /var/www/html/emc/events_november_2004.htm:
/var/www/html/emc/soloists_instrumental_strings.htm not found
...
For more examples, see Samples.