Find URLs
find_urls.pl is a simple Perl script to find URLs within an HTML file.
For every line, we look for the anchor tag pattern <a href="...">. The quoted string following the 'href=' is the URL. Since a line might have more than one URL, after finding the first URL, we need to keep looking within that line. We do this by setting $line anew to everything following the anchor tag, then looking again for another anchor tag within the remainder of the line. We stay within the while loop until all anchor tags and URLs (if any) in the current line are found, then move on to the next line.
The script follows.
#!/usr/bin/perl # find_urls.pl -- find the urls within an html file open(HTML, $ARGV[0]); while (<HTML>) { $line = $_; while ($line =~ /<a\s+href\s*=\s*[\'\"]([^\'\"]+)[\'\"]\s*>(.*)$/i) { print "$1\n"; $line = $2; } } close(HTML);
find_urls.pl is used by CheckBrokenLinksExternal, CheckBrokenLinksInternal, and other Pikt scripts.
For more examples, see Samples.