mikecann.co.uk: Chrome Crawler - A web-crawler written in Javascript

Monday, 6 December 2010

Chrome Crawler - A web-crawler written in Javascript

EDIT: I now have a newer, better version of this called "Recursive"

Depending on your level of geekness you may or may not enjoy this one.

I proudly present Chrome Crawler, my latest Google Chrome extension:

The idea is simple really. You just give it a URL, it then goes off and finds all the links on that page then follows them to more pages then gets all the links and follows them and so on and so on.

Along the way it checks each page to see if there are any 'interesting' files linked there, if it finds an interesting link it will flag it for you so you can check it out.

Theres an options page that lets you customise the way it all works:

If you are still confused check out the video below:

So why did I make this? Well to be frank, I made it mostly "just 'cause I can"!

Also having learned from my last Chrome Extension project PostToTumblr I realised the Chrome API allowed you to do some things that you wouldn't normally be allowed to do on a website (nameley the Cross-Origin XHR) and I wanted to do something to take advantage of it.

It didnt take me long to knock out this project, one lazy Saturday for the majority of the code and today for a quick fix or two and to write this post and make the video. As such I expect there to be many bugs and problems so if you encounter one drop me an email (my address is found in the options page).

Oh finally, I wouldnt try using this on a google page as you will likely end up seeing this quite often:

Anyways you can grab it over on the Chrome extensions gallery here. If you enjoy it please leave me a review / comment, much love!

13 comments:

MikeCann.co.uk » Blog Archive » Chrome Crawler, HaXe, Three.js, WebGL and 2D Sprites12 June 2011 at 11:13
[...] started the second version of my Chrome Crawler extension a little while back. I have been using the language HaXe to develop it in. It’s a [...]
ReplyDelete
Replies
PAEz16 January 2012 at 00:02
This is some nice code to make custom crawlers with, thanks so much for sharing it.
Love the way you handled the settings, didnt have much experience with get and set but Ill be using them in the future.

One thing that might be a nice addition to this is to offer an option to block images as it downloads the images when it makes the page and thats really unnecessary. Unfortunately the only way I know how to do that (content settings in options doesnt work) is to block images in the whole browser, I couldnt figure a way to just target stuff from the page being crawled....If you ever find a way Id love to hear it.
Heres how I block all images in Chrome from being downloaded (I put this in a separate extension)....

chrome.webRequest.onBeforeRequest.addListener(
function(info) {
return {cancel: true};
},
// filters
{
urls: [
"",
],
types: ["image"]
},
// extraInfoSpec
["blocking"]);
ReplyDelete
Replies
PAEz16 January 2012 at 00:06
urggh...the code didnt come out right (stripping tags I guess)
the URLS should be left sharp bracket "no idea what their really called;)" all_urls right sharp bracket, inside the quotes
ReplyDelete
Replies
Raja9 June 2012 at 00:39
hello sir, this extension is really osum...i really liked it...

i need a help from you .
could you please contact me when you come online...or givwe me your mail id.

my id : kkrajdurai@gmail.com
ReplyDelete
Replies
Angry23 March 2013 at 21:10
Dumbest reason to implement a new scraper I have heard of. Basically: "I know how to be a criminal, so I thought I would be one."
ReplyDelete
Replies
mikecann24 March 2013 at 03:06
not sure what you mean there? criminal?
ReplyDelete
Replies
Chaz2 April 2013 at 12:05
Like the plugin! Think she needs a little polish but gets the job done!
ReplyDelete
Replies
Matt19 August 2013 at 08:41
Hi - this is great but how can I crawl for pages with specific content/phrases?
Yours thickly,
Matt
ReplyDelete
Replies
Aurélien19 August 2013 at 11:45
Hello,

Can you fork it on github please? I would like to add features.

Thanks !
ReplyDelete
Replies
Shivin Saxena29 September 2013 at 21:02
I am working on a similar project on chrome extensions which needs to do exactly what your extension does!!; get all outbound links from the current website. Unfortunately, I have not been able to implement it or understand the algorithm or strategy behind making a recursive call to scan all links one-by-one. I am decent at javascript and it would be great if you could share some tips with me! :)
ReplyDelete
Replies
mikecann29 September 2013 at 22:53
Hi mate,You should search my site for "Recursive" it was a follow up project I did that works on the same principle, full source included with plenty of infos!Mike
ReplyDelete
Replies
windows 8 upgrade6 November 2013 at 04:23
It was wonderful to read about chrome crawler which is web crawler written in JavaScript. It was nice of you to share the options of the chrome crawler with image, as it was easy to understand through the image that you have shared.
ReplyDelete
Replies
超歓迎トレンチコート上品10 December 2013 at 19:57
I absolutely love your site.. Excellent colors & theme.
Did you create this website yourself? Please
reply back as I'm hoping to create my own personal blog and would like to learn where you got this from or exactly what the theme is named.
Kudos!

Have a look at my weblog ... 超歓迎トレンチコート上品
ReplyDelete
Replies

Add comment