Find URLs

find_urls.pl is a simple Perl script to find URLs within an HTML file.

For every line, we look for the anchor tag pattern <a href="...">.  The quoted string following the 'href=' is the URL.  Since a line might have more than one URL, after finding the first URL, we need to keep looking within that line.  We do this by setting $line anew to everything following the anchor tag, then looking again for another anchor tag within the remainder of the line.  We stay within the while loop until all anchor tags and URLs (if any) in the current line are found, then move on to the next line.

The script follows.

#!/usr/bin/perl

# find_urls.pl -- find the urls within an html file

open(HTML, $ARGV[0]);
while (<HTML>) {
        $line = $_;
        while ($line =~ /<a\s+href\s*=\s*[\'\"]([^\'\"]+)[\'\"]\s*>(.*)$/i) {
                print "$1\n";
                $line = $2;
        }
}
close(HTML);

find_urls.pl is used by CheckBrokenLinksExternal, CheckBrokenLinksInternal, and other Pikt scripts.

Open Hand For more examples, see Samples.

Home | FAQ | News | Intro | Samples | Tutorial | Reference | Software
Developer's Notes | Licensing | Authors | Pikt-Users | Pikt-Workers | Related Projects | Site Index | Privacy Policy | Contact Us
Page best viewed at 1024x768 or greater.   Page last updated 2008-03-27.   This site is PIKT® powered.
PIKT® is a registered trademark of the University of Chicago.   Copyright © 1998-2008 Robert Osterlund. All rights reserved.
Home FAQ News Intro Samples Tutorial Reference Software
PIKT Logo
PIKT Page Title
View sample
HTML
anchor tag
macros