If you got past the title of this post, chances are you’ve executed a command something like this:
wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://example.org
This crawls the Web site and creates a local copy of the content so it could be served much like the original remote material.
A search engine depends on being able to see the content to be able to index it. If the back end of the system is going to change, then it won’t be possible to make an archive of the populated pages. A cache will be of no use if it doesn’t contain the material it’s meant to be caching.
Many years ago AaronSw wrote Bake, Don’t Fry, in which he argues the advantages of having a static , Baked filesystem-served site over having the pages being Fried, ie. dynamically generated from a database. A
 For the pedants I’ll note that ‘static’ isn’t an absolutely accurate term, the filesystem is acting as database for the pages. But filesystems are typically a layer or two down the software stack than eg. a MySQL DB, and files are so familiar that they might as well be written in stone.
A Web crawler typically maintains a list of URLs over which to operate, and contains a few core functional units that are used in sequence over each page:
- page getter – retrieve a representation of a URL in the list from the Web
- link extractor – parse/scrape a given page, pull out the URLs, add them to the list
- URL filter – typically only pages within a particular domain or path will be required
- URL-to-filename translator
- page saver – dump the page, with converted links, to the local filesystem
Incidentally, I’ve written quite a few crawlers in a variety of languages over the years (if you’re involved in Web coding, an archetypal pattern for learning a language probably starts something like: “Hello World!”, TODO List Manager, Web Crawler, Blog Engine…). So I can confidently say the trickiest part is the matching and translating of the URLs, it can get messy.
I had a look around at the various options available, and the one that looked easiest was Selenium. From the site:
What is Selenium?
Selenium automates browsers. That’s it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) be automated as well.
It’s a sizeable framework which can run on Windows, Linux, and OS X. It has a DSL and bindings for a host of different languages. It even has an IDE! (A Firefox plugin). Apache 2 license.
But only a small (though significant) part of the framework is actually needed here. The work involved is minimal. I’ve been using a lot of Python recently so that’s what I went for – most other popular languages are supported. I did find a snippet that came close to what I needed (alas, lost the link). The key bits of code are just :
from selenium import webdriver ... driver = webdriver.Firefox() ... # retrieve the page driver.get(page) ... # extract the links from the DOM elements = driver.find_elements_by_xpath("//a") ... content = driver.page_source ... driver.quit()
It is necessary to put a suitable driver on the system path. I used the geckodriver, and as I already had Firefox installed, for me it was a simple matter of copying the driver file to /usr/local/bin.
I wasted a few hours with the script, the links were getting garbled, but I couldn’t see why, tried loads of things – even starting a blog post in the hope that would clear my thoughts… At some point I deleted one special case link from the site I was crawling, soon after noticed it was still showing up in the automated Firefox. D’oh! Firefox was caching.
Unfortunately there’s no way to turn this off programmatically (yet?), but it is straightforward to achieve by creating a custom Firefox profile.
First locate a usable starter profile. If you cd ~/.mozilla/firefox/ and open profile.ini there will be lines like:
[Profile0] Name=default IsRelative=1 Path=70abdmdv.default Default=1
That’s the only profile on this machine (I generally use Chrome browser), hopefully most of the settings I want will be the defaults. So I copy the whole directory:
cp -r 70abdmdv.default profile.Selenium
And edit the new profile:
Ew! It looks like I lost a bit of this write-up (or got bored).
Anyhow, the good news is I’ve got some code basically working, so I gave it a name and popped it in a github repo : Clonio.
It’s written in Python (2.*). Right now the configuration is just a few lines at the top of the file, should be self-explanatory.
As and when I have time, I’ll tidy it up, put together some proper docs (and maybe port it to Node as well).