Downloading top X sites’ data with ZombieJS / Stoyan's phpied.com

Update: Easier way to get top X URLs: http://httparchive.org/urls.php, thanks @souders

Update: found and commented an offensive try{}catch(e){throw e;} in zombie.js (q.js, line 126), now the script doesn't fatal that often

Say you want to experiment or do some research with what's out there on the web. You need data, real data from the web's state of the art.

In the past I've scripted IE with the help of HTTPWatch's API and exported data from Fiddler. I've also fetched stuff form HTTPArchive. It doesn't have everything, but still you can manage to refetch the missing pieces.

Today I played with something called Zombie.js and thought I should share.

Task: fetch all HTML for Alexa top 1000 sites

Alexa

Buried in an FAQ is a link to download the top 1 million sites according to Alexa. Fetch, unzip, parse csv for the first 1000:

$ curl http://s3.amazonaws.com/alexa-static/top-1m.csv.zip > top1m.zip
$ unzip top1m.zip
$ head -1000 top-1m.csv | awk 'BEGIN { FS = "," } ; { print $2 }' > top1000.txt

Zombie.js

Zombie is a headless browser, written in JS. It's not webkit or any other real engine but for the purposes of fetching stuff is perfectly fine. Is it any better than curl? Yup, it executes javascript for one. Also provides DOM API and sizzling selectors to hunt for stuff on the page.

$ npm install zombie

Taking Zombie for a spin

You can just fiddle with it the Node console:

$ node

> var Zombie = require("zombie");
undefined
> var browser = new Zombie();
undefined
> browser.visit("http://phpied.com/", function() {console.log(browser.html())})
[object Object]
> <html><head>
<meta charset="UTF-8" />....

The html() method returns generated HTML after JS has had a chance to run which is pretty cool (there may be problems with document.write though, but, hell, document-stinking-write!?).

You can also get a part of the HTML using a CSS selector, e.g. browser.html('head') or browser.html('#sidebar'). More APIs...

The script

var Zombie = require("zombie");
var read = require('fs').readFileSync;
var write = require('fs').writeFileSync;
 
read('top1000.txt', 'utf8').trim().split('\n').forEach(function(url, idx) {
  var browser = new Zombie();
  browser.visit('http://www.' + url, function () {
    if (!browser.success) {
      return;
    }
    write('fetched/' + (idx + 1) + '.html', browser.html());
 
  });
});

Next challenge: fetch all CSS and strip JS

The API provides browser.resources which is an array of JS resources and really useful - HTTP request/responses/headers and everything.

Since I'll need all this HTML and CSS for a visual test later on, I don't want the page to change from one run to the next, e.g. to include different ads. So let's strip all JavaScript. While at it, also get rid of other possibly changing content, like iframes, embeds, objects. Finally all images should be spacer gifs, so there will be no surprises there either.

The bad

After some experimentation, I think I can conclude that Zombie.js is not (yet) suitable for this kind of "in the wild" downloading of sites. It's a jungle out there.

It could be that my script is not good enough but I couldn't find a way to catch errors gracefully (wrapping browser.visit() in a try-catch didn't help)
All JS errors show up, some cause the script to hang
The script hangs on 404s.
Sometimes it just exits.
Relative URLs (e.g. to fetch a CSS file) don't seem to work.
I couldn't get the CSS resources included (probably a bug in the version I'm using) so had to hack the defaults to enable CSS.
I had to kill the script when it hangs and restart it pretty often, that's why I write file with the index of the next top X file.
All in all I had something like 25% errors fetching the HTML and CSS.

The good

Loading a page in a fast headless browser with DOM access - priceless!

And it worked! (Well, in 75% of the sites out there)

I'm sure in more controlled environment with error-free JS and no 404s, it will behave way better.

The script

var Zombie = require("zombie");
var read = require('fs').readFileSync;
var write = require('fs').writeFileSync;
 
var urls = read('top1000.txt', 'utf8').trim().split('\n');
 
// where are we in the list of URLs
var idx = parseInt(read('idx.txt'));
console.log(idx);
 
// go do it
download(idx);
 
function download(idx) {
  // remember the place in the likely scenario that
  // the script crashes
  write('idx.txt', idx + 1);
 
  var url = urls[idx];
 
  if (!url) {
    // we're done!
    console.log('yo!');
    process.exit();
  }
 
  var browser = new Zombie();
 
  browser.visit('http://www.' + url, function () {
    if (!browser.success) {
      return;
    }
 
    var map = {}; // need to mathch link hrefs to resource indices
    browser.resources.forEach(function(r, i) {
      map[r.request.url] = i;
    });
 
    // collect all CSS, external and inline
    var css = [];
    var sss = 'link[rel=stylesheet], style'; // Select me Some Styles
    [].slice.call(browser.querySelectorAll(sss)).forEach(function(e) {
      if (e.nodeName.toLowerCase() === 'style') {
        css.push('/**** inline ****/');
        css.push(e.textContent);
      } else {
        var i = map[e.href];
        if (i && browser.resources[i].response) {
          css.push('/**** ' + e.href + ' ****/');
          css.push(browser.resources[i].response.body);
        }
      }
    })
 
    // remove style and these nodes that may cause the UI to change
    // from one run to the next
    var stripem = 'style, iframe, object, embed, link, script';
    [].slice.call(browser.querySelectorAll(stripem)).forEach(function(node) {
      if (node.parentNode) {
        node.parentNode.removeChild(node);
      }
    });
 
    // 
    [].slice.call(browser.querySelectorAll('img')).forEach(function(node) {
      node.src = 'spacer.gif';
    })
 
    // placeholder, probably useless
    browser.body.appendChild(
      browser.document.createComment('specialcommenthere'))
 
    // we got the stuffs!
    var html = browser.html();
    css = css.join('\n');
 
    if (html && css) { // do we? ... got? ... the stuffs?
      write('fetched/' + (idx + 1) + '.html', html);
      write('fetched/' + (idx + 1) + '.css', css);
      console.log(idx + " [OK]");
    } else {
      console.log(idx + " [SKIP]");
    }
 
    download(++idx); // next!
 
  });
 
}

Tell your friends about this post on Facebook and Twitter

Sorry, comments disabled and hidden due to excessive spam.

Meanwhile, hit me up on twitter @stoyanstefanov