Report Recent Google Googlebot and Mediabot Visits
googlebot.pl is a simple Perl script to scan the Apache web server access logs and report recent visits by the Google googlebot (search engine spider) and Adsense mediabot (ad server crawler). (The ad server agent identifies itself as "Mediapartners-Google").
googlebot.pl might produce output like the following:
authors.html | 09/Sep/2005:01:45:05 | 20/Aug/2005:08:37:10 changes.html | 05/Sep/2005:19:00:56 | 07/Apr/2005:04:17:56 ... faq.html | 09/Sep/2005:01:46:59 | 08/Sep/2005:09:51:33 index.html | 08/Sep/2005:21:52:05 | 08/Sep/2005:10:28:15 intro/intro.html | 09/Sep/2005:01:47:26 | 06/Sep/2005:17:49:29 ...
Here is the Perl script:
#!/usr/bin/perl $site = $ARGV[0]; # inventory the pages open (PAGES, "/usr/bin/find /var/www/html/$site -name \\*.htm -o -name \\*.html -print |"); while (<PAGES>) { chomp; s/\.\///g; s/\/var\/www\/html\/$site\///; $pages{$_}++; } close(PAGES); # scan the access log(s), and for each accessed page, note the latest access # date and time open(LOG, "/bin/cat /var/log/httpd/access_log_$site.07* /var/log/httpd/access_log_$site |"); while (<LOG>) { /^.+\[(\d+\/\w+\/\d+:\d+:\d+:\d+).+\"(get|head)\s(\S+)\s .+\"\s+\d+\s+\S+\s?\"([^\"]+)\".+\"([^\"]+)\"$/i; $date = $1; $page = $3; $agent = $5; if ($page =~ /\/pikt\/(.+)/) { $page = $1; } if (($page eq "/") || ($page eq "/pikt/")) { $page = "index.html"; } if ($page =~ /^\/(.+)/) { $page = $1; } if ($agent =~ /(googlebot|google\.com\/bot\.html)/i) { $googlebot{$page} = $date; } if ($agent =~ /mediapartners-google/i) { $mediabot{$page} = $date; } } close(LOG); # for all inventoried pages, report the date and time of last googlebot and # mediabot access foreach $page (sort keys %pages) { print "$page | $googlebot{$page} | $mediabot{$page}\n"; } exit 0;
googlebot.pl is called by the GooglebotVisitPIKT and GoogleMediabotVisitPIKT Pikt scripts.
For more examples, see Samples.