debuggable

 
Contact Us
 

Crawl Google, they do the same to you ; )

Posted on 10/6/08 by Felix Geisendörfer

Hey folks,

Marc Grabanski just had the great idea of using google to help with the migration of your site to a new domain / url schema. Just get a list of all pages google has indexed from your site and then use that as your basis for checking if your migration worked or not. This is very convenient because you do not have to know all your own urls yourself, and you'll only get the relevant ones (if they are not in google they are unlikely to have traffic).

So here is some quick code for crawling Google instead of being crawled by them in CakePHP:

class GoogleIndexShell extends Shell {
  function main() {
    App::import('HttpSocket');
    list($site) = $this->args;
    $Socket = new HttpSocket();
    $links = array();

    $start = 0;
    $num = 100;
    do {
      $r = $Socket->get('http://www.google.com/search', array(
        'hl' => 'en',
        'as_sitesearch' => $site,
        'num' => $num,
        'filter' => 0,
        'start' => $start,
      ));
      if (!preg_match_all('/href="([^"]+)" class="?l"?/is', $r, $matches)) {
        die($this->out('Error: Could not parse google results'));
      }
      $links = array_merge($links, $matches[1]);
      $start = $start + $num;
    } while (count($matches[1]) >= $num);

    $links = array_unique($links);
    $this->out(sprintf('-> Found %d links on google:', count($links)));
    $this->hr();
    $this->out(join("\n", $links));
  }
}

Usage is as simple as running:

./cake google_index debuggable.com

Which should produce an output like this:

Welcome to CakePHP v1.2.0.7125 beta Console
---------------------------------------------------------------
App : app
Path: /Users/felix/dev/www/php5/debuggable/app
---------------------------------------------------------------
-> Found 293 links on google:
---------------------------------------------------------------
http://debuggable.com/
http://debuggable.com/contracting
http://debuggable.com/contact
http://debuggable.com/workshops
http://debuggable.com/open-source/fixtures-shell
http://debuggable.com/open-source/google-analytics-api
http://debuggable.com/posts/thinking-what:480f4dd5-5f1c-4d37-99b0-4768cbdd56cb
http://debuggable.com/posts/jquerycamp07:480f4dd6-8d40-44e1-8551-4a58cbdd56cb
...

Oh and if you want to see more shell sample code, also check out our FixtureShell and the blog post for it.

-- Felix Geisendörfer aka the_undefined

PS: Please note that this is a quick hack, and any non-trivial change in the markup google uses will break. This is only meant for temporary usage.

 
&nsbp;

You can skip to the end and add a comment.

Kim Biesbjerg  said on Jun 10, 2008:

Cool! Just as I'm about to finish a cake website that is going to replace an old one! Any tips for redirecting these urls to the new ones? I usually do it in htaccess.

Khaled  said on Jun 10, 2008:

Great post ... I didn't expect to be simple like that ... thanks Felix

Felix Geisendörfer said on Jun 11, 2008:

Kim: mod_rewrite is a good choice if you don't have lot of urls (< 1000?). For everything else I would catch CakePHPs error404 using an AppError handler and then check a table called legacy_urls for the correct mapping.

Marc said on Jun 11, 2008:

This worked beautifully by the way. Thanks a lot.

Kim Biesbjerg said on Jun 13, 2008:

Felix: Right, might actually make a useful plugin for my CMS. Fill the database with indexed urls and have a user interface where you can map the indexed url to the corresponding content on the new site. Great! Thanks for the code snippet - Really enjoy your posts!

Jean said on Jul 01, 2008:

Cool indeed.

This post is too old. We do not allow comments here anymore in order to fight spam. If you have real feedback or questions for the post, please contact us.