Simple Sharding Logic

Sharding is the technique where you split static components across multiple domains so the browser can fetch more components in parallel? (Which browser? How many components per hostname? Ask browserscope.org)

So you have, say, image-a.png and image-b.png. You want to serve image-a from domain1.example.org and image-b.png from domain2.example.org

The thing is you don't want to randomize which image goes to which host in different page views. Otherwise you'll do like cnn.com where the same image (a 1x1 gif) is served from three different domains (in the same pageview even) causing three HTTP requests.

Here's a simple solution: use modulus to split components into buckets based on their URL (or path name).

Billy "Zoompf" Hoffman mentioned this as a simple strategy - if you want to split to two hostnames, take the length of the image path. If it divides by two - serve from host 2. Otherwise - host 1.

Same simple logic can apply for splitting into as many hostname buckets as you want. Need three buckets? % 3. Four buckets? % 4.

You can even do it per browser, for example if it's IE6, split to more buckets.

Here's the SSL (Simple Sharding Logic :) ) expressed in JavaScript:

function getBucket(url, numbuckets) {
  var number = url.length,
       group = number % numbuckets;
  return group;
}

If you think most of your paths will have same length, e.g. /images/top.png, /images/nav.png, /images/bot.png you can use some other number, e.g. the character code of the middle letter in the path.

var number = url.charCodeAt(parseInt(url.length/2));

Or the code of the letter in 1/4 of the path + the one in the middle + the on in 3/4 of the path. You get the point - all you need is a number that can be produced from the file path (or content or anything) and won't change from one page view to the next. You need stable file-to-hostname resolutions above all.

You can also pass an array of components and get a multi-array of the components, grouped into buckets. Here's what I tried on cnn.com and worked beautifully:

function toBuckets(stuff, numbuckets) {
  var numbuckets = parseInt(numbuckets, 10),
      url, group,
      buckets = Array(numbuckets),
      cache = {};
  for (var i = 0, max = stuff.length; i < max; i++) {
    url = stuff[i].src;

    if (typeof cache[url] === 'number') {
      continue;
    }
    group = getBucket(url, numbuckets);
    if (!buckets[group]) {
      buckets[group] = [];
    }
    buckets[group].push(url);
    cache[url] = group;
  }
  return buckets;
}

console.log(toBuckets(document.images, 3));

This gave me (on cnn.com) a nice list of three buckets and the URLs in each:

[
  ["http://i2.cdn.turner.co...seals.02.cnn%5B1%5D.jpg",
   "http://i2.cdn.turner.co...ves/tzvids.osama.gi.jpg",
   "http://i.cdn.turner.com.../misc/advertisement.gif", 5 more...],
  ["http://i.cdn.turner.com...der/hat/arrow_black.png",
   "http://i2.cdn.turner.co...ds.military.dogs.gi.jpg",
   "http://i2.cdn.turner.co...bl.house.cnn.120x68.jpg", 8 more...],
  ["http://i.cdn.turner.com...element/img/3.0/1px.gif",
   "http://i.cdn.turner.com...bal/header/hdr-main.gif",
   "http://i.cdn.turner.com...al/header/nav-arrow.gif", 9 more...]
]

Not too bad for such simplicity.

As you see, I have a cache object, so it takes care of any duplicates so I don't end up with the same component repeating.

The drawback of course is that it's unlikely that you'll get exactly the same number of components in each bucket. But from what I tried on a few sites, that's not a big issue. The benefit of having a stable and simple (no databases, cookies, local storage, etc) resolution of file path to a hostname is a win.

This entry was posted on Friday, May 6th, 2011 and is filed under performance. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.


Get notification for future posts: follow me on Twitter or subscribe to my RSS feed

11 Responses to “Simple Sharding Logic”

  1. Morgan Cheng Says:

    Even if we have multiple CDN domains for our static files, each domain should have a complete file set. Is that right?

    For example, we have “i1.cdn.com” and “i2.cdn.com”. http://i1.cdn.com/image.jpg and http://i2.cdn.com/image.jpg should represents same resource, since we might change bucketing strategy.

  2. pixelastic Says:

    Another method would be to do the modulo on the first letter of the md5 hash of the filepath.

  3. Adrian Yee Says:

    I wrote an article about sharding algorithms for GTmetrix:

    http://gtmetrix.com/parallelize-downloads-across-hostnames-implementation-guide.html

    Using the 32-bit integer value of the md5 hash of the path modulo with the number of shards usually distributes the files amongst the shards fairly evenly.

    You could simplify it even more to just the first letter of the md5 sum as pixelastic mentions, but I believe the distribution might not be as even (IANAM [mathematician] ;) )

  4. Billy Hoffman Says:

    Thanks for the great article Stoyan!

    Another method we implement with clients is to sum all the ASCII codes of the URL together and mod against that. Tends to give you a more even distribution when slotting into more than 2 buckets. This is also faster than MD5ing the string.

  5. Steve Webster Says:

    I tried a number of techniques to get the best even split across domains whilst ensuring that the same file was always hashed to the same domain. In the end, this proved to be both efficient and fairly evenly spread:

    $hosts = Array(‘host1…’, ‘host2…’, ‘host3…’);
    $num_hosts = count($hosts);

    $host = $hosts[ crc16($filename) % $num_hosts ];

    (Actually, the implementation is in Perl, but you get the idea).

  6. Adrian Yee Says:

    While there are a ton of different hashing algorithms, I think it’s best you actually write some code to see how evenly the algorithm you choose actually distributes the resources amongst the shards for your own site. Just grab a list of all the resources on the your page(s) and run it through. I found that for some sites, a simple crc would work great, but be extremely bad in other cases. In my testing, I found that md5 seemed fairly balanced in most cases.

    Also remember to actually check the effect it has on the loading in the various browser waterfalls.

  7. Sergey Chernyshev Says:

    @Billy, thanks for the tip! I was looking into using md5 / crc32, but adding ASCII codes will probably work as good.

  8. Aaron Peters Says:

    Make sure you test the impact on load time and user experience.
    Even distribution is not a goal in itself, the goal is to maximize the user experience.

    It can make sense to have a few specific files be served from the same domain, and others may simply be evenly distributed.

    Watch the waterfalls, watch how the page renders, etc etc.

    Tweak until you get it right.
    Maybe that is with even distribution, maybe not.

  9. Ryan Says:

    Why are you trying to coin a term under the acronym “SSL”. We don’t need more ambiguous acronyms, especially not for bullshit like this. PHP-using idiot!

  10. fwolf Says:

    @ Ryan:

    Just shut the fuck up, bullshit-using moron ;)

    cu, w0lf.

  11. Steve J Says:

    To change the actual domain name, something like this will work (Jquery version) that I’m sure someone can improve upon:

    function getBucket(url, numbuckets) {
    var number = url.length,
    group = number % numbuckets;
    return group;
    }

    function toBuckets(stuff, numbuckets) {
    var numbuckets = parseInt(numbuckets, 10),
    url, group,
    buckets = Array(numbuckets),
    cache = {};
    for (var i = 0, max = stuff.length; i < max; i++) {
    url = $('img').filter(function(index){

    return index == i;
    }).attr('src');

    if (typeof cache[url] === 'number') {
    continue;
    }
    group = getBucket(url, numbuckets);
    if (!buckets[group]) {
    buckets[group] = [];
    }

    path = "http://images"+group+".mydomain.com"+url;

    buckets[group].push(path);
    cache[path] = group;
    }
    return buckets;
    }

    console.log(toBuckets(document.images, 4));

Leave a Reply