Simple Sharding Logic

May 6th, 2011. Tagged: performance

Sharding is the technique where you split static components across multiple domains so the browser can fetch more components in parallel? (Which browser? How many components per hostname? Ask browserscope.org)

So you have, say, image-a.png and image-b.png. You want to serve image-a from domain1.example.org and image-b.png from domain2.example.org

The thing is you don't want to randomize which image goes to which host in different page views. Otherwise you'll do like cnn.com where the same image (a 1x1 gif) is served from three different domains (in the same pageview even) causing three HTTP requests.

Here's a simple solution: use modulus to split components into buckets based on their URL (or path name).

Billy "Zoompf" Hoffman mentioned this as a simple strategy - if you want to split to two hostnames, take the length of the image path. If it divides by two - serve from host 2. Otherwise - host 1.

Same simple logic can apply for splitting into as many hostname buckets as you want. Need three buckets? % 3. Four buckets? % 4.

You can even do it per browser, for example if it's IE6, split to more buckets.

Here's the SSL (Simple Sharding Logic 🙂 ) expressed in JavaScript:

function getBucket(url, numbuckets) {
  var number = url.length,
       group = number % numbuckets;
  return group;
}

If you think most of your paths will have same length, e.g. /images/top.png, /images/nav.png, /images/bot.png you can use some other number, e.g. the character code of the middle letter in the path.

var number = url.charCodeAt(parseInt(url.length/2));

Or the code of the letter in 1/4 of the path + the one in the middle + the on in 3/4 of the path. You get the point - all you need is a number that can be produced from the file path (or content or anything) and won't change from one page view to the next. You need stable file-to-hostname resolutions above all.

You can also pass an array of components and get a multi-array of the components, grouped into buckets. Here's what I tried on cnn.com and worked beautifully:

function toBuckets(stuff, numbuckets) {
  var numbuckets = parseInt(numbuckets, 10),
      url, group,
      buckets = Array(numbuckets),
      cache = {};
  for (var i = 0, max = stuff.length; i < max; i++) {
    url = stuff[i].src;
 
    if (typeof cache[url] === 'number') {
      continue;
    } 
    group = getBucket(url, numbuckets);
    if (!buckets[group]) {
      buckets[group] = [];
    }
    buckets[group].push(url);
    cache[url] = group;
  }
  return buckets;
}
 
console.log(toBuckets(document.images, 3));

This gave me (on cnn.com) a nice list of three buckets and the URLs in each:

[
  ["http://i2.cdn.turner.co...seals.02.cnn%5B1%5D.jpg", 
   "http://i2.cdn.turner.co...ves/tzvids.osama.gi.jpg", 
   "http://i.cdn.turner.com.../misc/advertisement.gif", 5 more...], 
  ["http://i.cdn.turner.com...der/hat/arrow_black.png",
   "http://i2.cdn.turner.co...ds.military.dogs.gi.jpg",
   "http://i2.cdn.turner.co...bl.house.cnn.120x68.jpg", 8 more...], 
  ["http://i.cdn.turner.com...element/img/3.0/1px.gif",
   "http://i.cdn.turner.com...bal/header/hdr-main.gif", 
   "http://i.cdn.turner.com...al/header/nav-arrow.gif", 9 more...]
]

Not too bad for such simplicity.

As you see, I have a cache object, so it takes care of any duplicates so I don't end up with the same component repeating.

The drawback of course is that it's unlikely that you'll get exactly the same number of components in each bucket. But from what I tried on a few sites, that's not a big issue. The benefit of having a stable and simple (no databases, cookies, local storage, etc) resolution of file path to a hostname is a win.

Tell your friends about this post: Facebook, Twitter, Google+

Sorry, comments disabled and hidden due to excessive spam. Working on restoring the existing comments...

Meanwhile, hit me up on twitter @stoyanstefanov