Monday, 6 December 2010

Chrome Crawler - A web-crawler written in Javascript



EDIT: I now have a newer, better version of this called "Recursive"

Depending on your level of geekness you may or may not enjoy this one.

I proudly present Chrome Crawler, my latest Google Chrome extension:



The idea is simple really. You just give it a URL, it then goes off and finds all the links on that page then follows them to more pages then gets all the links and follows them and so on and so on.

Along the way it checks each page to see if there are any 'interesting' files linked there, if it finds an interesting link it will flag it for you so you can check it out.

Theres an options page that lets you customise the way it all works:



If you are still confused check out the video below:



So why did I make this? Well to be frank, I made it mostly "just 'cause I can"!

Also having learned from my last Chrome Extension project PostToTumblr I realised the Chrome API allowed you to do some things that you wouldn't normally be allowed to do on a website (nameley the Cross-Origin XHR) and I wanted to do something to take advantage of it.

It didnt take me long to knock out this project, one lazy Saturday for the majority of the code and today for a quick fix or two and to write this post and make the video. As such I expect there to be many bugs and problems so if you encounter one drop me an email (my address is found in the options page).

Oh finally, I wouldnt try using this on a google page as you will likely end up seeing this quite often:



Anyways you can grab it over on the Chrome extensions gallery here. If you enjoy it please leave me a review / comment, much love!

13 comments:

  1. [...] started the second version of my Chrome Crawler extension a little while back. I have been using the language HaXe to develop it in. It’s a [...]

    ReplyDelete
  2. This is some nice code to make custom crawlers with, thanks so much for sharing it.
    Love the way you handled the settings, didnt have much experience with get and set but Ill be using them in the future.

    One thing that might be a nice addition to this is to offer an option to block images as it downloads the images when it makes the page and thats really unnecessary. Unfortunately the only way I know how to do that (content settings in options doesnt work) is to block images in the whole browser, I couldnt figure a way to just target stuff from the page being crawled....If you ever find a way Id love to hear it.
    Heres how I block all images in Chrome from being downloaded (I put this in a separate extension)....

    chrome.webRequest.onBeforeRequest.addListener(
    function(info) {
    return {cancel: true};
    },
    // filters
    {
    urls: [
    "",
    ],
    types: ["image"]
    },
    // extraInfoSpec
    ["blocking"]);

    ReplyDelete
  3. urggh...the code didnt come out right (stripping tags I guess)
    the URLS should be left sharp bracket "no idea what their really called;)" all_urls right sharp bracket, inside the quotes

    ReplyDelete
  4. hello sir, this extension is really osum...i really liked it...

    i need a help from you .
    could you please contact me when you come online...or givwe me your mail id.

    my id : kkrajdurai@gmail.com

    ReplyDelete
  5. Dumbest reason to implement a new scraper I have heard of. Basically: "I know how to be a criminal, so I thought I would be one."

    ReplyDelete
  6. not sure what you mean there? criminal?

    ReplyDelete
  7. Like the plugin! Think she needs a little polish but gets the job done!

    ReplyDelete
  8. Hi - this is great but how can I crawl for pages with specific content/phrases?
    Yours thickly,
    Matt

    ReplyDelete
  9. Hello,

    Can you fork it on github please? I would like to add features.

    Thanks !

    ReplyDelete
  10. I am working on a similar project on chrome extensions which needs to do exactly what your extension does!!; get all outbound links from the current website. Unfortunately, I have not been able to implement it or understand the algorithm or strategy behind making a recursive call to scan all links one-by-one. I am decent at javascript and it would be great if you could share some tips with me! :)

    ReplyDelete
  11. Hi mate,You should search my site for "Recursive" it was a follow up project I did that works on the same principle, full source included with plenty of infos!Mike

    ReplyDelete
  12. It was wonderful to read about chrome crawler which is web crawler written in JavaScript. It was nice of you to share the options of the chrome crawler with image, as it was easy to understand through the image that you have shared.

    ReplyDelete
  13. I absolutely love your site.. Excellent colors & theme.
    Did you create this website yourself? Please
    reply back as I'm hoping to create my own personal blog and would like to learn where you got this from or exactly what the theme is named.
    Kudos!

    Have a look at my weblog ... 超歓迎 トレンチコート 上品

    ReplyDelete