Broken Link Checker (internal links)

The =broken_links_internal link checker is a Pikt script macro to check the validity of internal (same-site) URL references within a collection of HTML documents.

In the 'input proc' statement below, we use the Unix find command to find all .htm files within the web documents tree.

For every .htm or .html file, we invoke the find_urls.pl script to find all URLs within the file.  (Since the documents are local, entirely accessible on disk, we avoid any complicated, recursive spidering schemes of our own directly accessible website.)

We bypass ftp: and other references.  (Include these if you wish.)  We also bypass all "foreign" (off-site) URLs, since we check these using a different Pikt script (see CheckBrokenLinksExternal).

In the http: pattern match, we determine the $site name and file $relpath (relative path).  Next, we have to determine the file absolute path, according to several categories (for example, references to pages in the parent directory, cgi-bin files, relative path references, absolute path references, and so on).  Note that we simply ignore the complication of named anchors.  If the file is there, good enough (even if the named anchor within that file is missing or misnamed).

Finally, we do a simple '-e' file test to test for the file's existence.  If not found, we report the broken URL (incorrect file reference).

#if emcwebsys

broken_links_internal(site, website, webdir)

	init
		status =piktstatus
		level =piktlevel
		task "Check for broken internal links"
		input proc "=find =httpd_doc_root/(site)
                            -name \\*.htm -o -name \\*.html -print"

	rule
		set $filename = $inlin

	rule	// bypass /tmp and other pages
		if $filename =~~ "(/tmp/)"
			next
		fi

	rule
		if #popen(URLS, "/usr/local/bin/find_urls.pl $filename", "r") != #err()
			while #read(URLS) > 0
#ifdef debug
				output $rdlin
#endifdef
				// ignore ftp: and others
				if $rdlin !~ "^http:\/\/"
					next
				fi
				// ignore off-site URLs, since we check these elsewhere
				if $rdlin =~~ "http.*:\/\/([^/]+)(.+)"
					set $site = $1
					set $relpath = $2
					if $site !~~ "(website)"
						cont
					fi
				else
					set $relpath = $rdlin
				fi
#ifdef debug
				output $relpath
#endifdef
				// set the absolute path
				if $relpath =~~ "(\\.\\./)(.+)"
					set $abspath = "(webdir)" . "/$2"
				elsif $relpath =~~ "^/cgi-bin/"
					set $abspath = "/var/www" . $relpath
					if $abspath =~~ "^(/var/www/cgi-bin/yabb_emc/YaBB\\.cgi).*"
						set $abspath = $1
					fi
				elsif $relpath =~~ "^/var/www"
					set $abspath = $relpath
				elsif $left($relpath,1) eq "/"
					set $abspath = "(webdir)" . "$relpath"
				else
					set $abspath = "(webdir)" . "/$relpath"
				fi
				// ignore named anchors
				if    $abspath =~~ "^(.+\\.htm)#.+$"
				   || $abspath =~~ "^(.+\\.html)#.+$"
					set $abspath = $1
				fi
#ifdef debug
				output $abspath
#endifdef
				if ! -e $abspath
					output mail "in $filename: $abspath not found"
				fi
			endwhile
			do #pclose(URLS)
		else
                        output mail "Can't open '/usr/local/bin/find_urls.pl $filename'
                                     for processing!"
			quit
		fi

#endif  // emcwebsys

For both the PIKT and Early MusiChicago websites, we invoke the =broken_links_internal() macro from within alarms.cfg (or one of its #include files):

///////////////////////////////////////////////////////////////////////////////
BrokenLinksInternalPIKT

	=broken_links_internal(pikt, pikt\\\\.org, =httpd_doc_root)

///////////////////////////////////////////////////////////////////////////////

BrokenLinksInternalEMC

	=broken_links_internal(emc, earlymusichicago\\\\.org, =httpd_doc_root/emc)

///////////////////////////////////////////////////////////////////////////////

Here is some example BrokenLinksInternalEMC script output:

                in /var/www/html/emc/events_january_2004.htm:
                  /var/www/html/emc/soloists_instrumental_keyboard.htm not found
                in /var/www/html/emc/events_november_2004.htm:
                  /var/www/html/emc/soloists_instrumental_strings.htm not found
                ...

For more examples, see Samples.

 
Home | FAQ | News | Intro | Samples | Tutorial | Reference | Software
Developer's Notes | Licensing | Authors | Pikt-Users | Pikt-Workers | Related Projects | Site Index | Privacy Policy | Contact Us
Page best viewed at 1024x768 or greater.   Page last updated 2019-01-12.   This site is PIKT® powered.
Copyright © 1998-2019 Robert Osterlund. All rights reserved.
Home FAQ News Intro Samples Tutorial Reference Software
PIKT Logo
PIKT Page Title
View sample
HTML
font color
macros