Monday 6 December 2010
Chrome Crawler - A web-crawler written in Javascript
EDIT: I now have a newer, better version of this called "Recursive"
Depending on your level of geekness you may or may not enjoy this one.
I proudly present Chrome Crawler, my latest Google Chrome extension:
The idea is simple really. You just give it a URL, it then goes off and finds all the links on that page then follows them to more pages then gets all the links and follows them and so on and so on.
Along the way it checks each page to see if there are any 'interesting' files linked there, if it finds an interesting link it will flag it for you so you can check it out.
Theres an options page that lets you customise the way it all works:
If you are still confused check out the video below:
So why did I make this? Well to be frank, I made it mostly "just 'cause I can"!
Also having learned from my last Chrome Extension project PostToTumblr I realised the Chrome API allowed you to do some things that you wouldn't normally be allowed to do on a website (nameley the Cross-Origin XHR) and I wanted to do something to take advantage of it.
It didnt take me long to knock out this project, one lazy Saturday for the majority of the code and today for a quick fix or two and to write this post and make the video. As such I expect there to be many bugs and problems so if you encounter one drop me an email (my address is found in the options page).
Oh finally, I wouldnt try using this on a google page as you will likely end up seeing this quite often:
Anyways you can grab it over on the Chrome extensions gallery here. If you enjoy it please leave me a review / comment, much love!
Labels:
chrome,
Chrome Crawler,
crawl,
Download,
extension,
files,
Google,
HTML,
Javascript,
options,
Personal Projects,
Programming,
Project,
simple,
spider
Subscribe to:
Post Comments (Atom)
[...] started the second version of my Chrome Crawler extension a little while back. I have been using the language HaXe to develop it in. It’s a [...]
ReplyDeleteThis is some nice code to make custom crawlers with, thanks so much for sharing it.
ReplyDeleteLove the way you handled the settings, didnt have much experience with get and set but Ill be using them in the future.
One thing that might be a nice addition to this is to offer an option to block images as it downloads the images when it makes the page and thats really unnecessary. Unfortunately the only way I know how to do that (content settings in options doesnt work) is to block images in the whole browser, I couldnt figure a way to just target stuff from the page being crawled....If you ever find a way Id love to hear it.
Heres how I block all images in Chrome from being downloaded (I put this in a separate extension)....
chrome.webRequest.onBeforeRequest.addListener(
function(info) {
return {cancel: true};
},
// filters
{
urls: [
"",
],
types: ["image"]
},
// extraInfoSpec
["blocking"]);
urggh...the code didnt come out right (stripping tags I guess)
ReplyDeletethe URLS should be left sharp bracket "no idea what their really called;)" all_urls right sharp bracket, inside the quotes
hello sir, this extension is really osum...i really liked it...
ReplyDeletei need a help from you .
could you please contact me when you come online...or givwe me your mail id.
my id : kkrajdurai@gmail.com
Dumbest reason to implement a new scraper I have heard of. Basically: "I know how to be a criminal, so I thought I would be one."
ReplyDeletenot sure what you mean there? criminal?
ReplyDeleteLike the plugin! Think she needs a little polish but gets the job done!
ReplyDeleteHi - this is great but how can I crawl for pages with specific content/phrases?
ReplyDeleteYours thickly,
Matt
Hello,
ReplyDeleteCan you fork it on github please? I would like to add features.
Thanks !
I am working on a similar project on chrome extensions which needs to do exactly what your extension does!!; get all outbound links from the current website. Unfortunately, I have not been able to implement it or understand the algorithm or strategy behind making a recursive call to scan all links one-by-one. I am decent at javascript and it would be great if you could share some tips with me! :)
ReplyDeleteHi mate,You should search my site for "Recursive" it was a follow up project I did that works on the same principle, full source included with plenty of infos!Mike
ReplyDeleteIt was wonderful to read about chrome crawler which is web crawler written in JavaScript. It was nice of you to share the options of the chrome crawler with image, as it was easy to understand through the image that you have shared.
ReplyDeleteI absolutely love your site.. Excellent colors & theme.
ReplyDeleteDid you create this website yourself? Please
reply back as I'm hoping to create my own personal blog and would like to learn where you got this from or exactly what the theme is named.
Kudos!
Have a look at my weblog ... 超歓迎 トレンチコート 上品