Digging into the HTTP archive

December 28th, 2012. Tagged: performance

Update: Second part

One way to do web performance research is to dig into what's out there. It's a tradition dating back from Steve Souders and his HPWS where he was looking at the top 10 Alexa sites for proof that best practices are or aren't followed. This involves loading each pages and inspecting the body or the response headers. Pros: real sites. Cons: manual labor, small sample.

I've done some image and CSS optimization research grabbing data in any way that looks easy: using the Yahoo image search API to get URLs or using Fiddler to monitor and export the traffic and loading a bazillion sites in IE with a script. Or using HTTPWatch. Pros: big sample. Cons: reinvent the wheel and use a different sampling criteria every time.

Today we have httparchive.org which makes performance research so much easier. It already has a bunch of data you can export and dive into immediately. It's also a common starting point so two people can examine the same data independently and compare or reproduce each other's results.

Let's see how to get started with the HTTP archive's data.

speak it

(Assuming MacOS, but the differences with other OS are negligible)

1. Install MySQL
2. Your mysql binary will be in /usr/local/mysql/bin/mysql. Feel free to create an alias. Your username is root and no password. This is of course terribly insecure but for a local machine with no important data, it's probably tolerable. Connect:

$ /usr/local/mysql/bin/mysql -u root

You'll see some text and a friendly cursor:

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.


3. Create your new shiny database:

mysql> create database httparchive;
Query OK, 1 row affected (0.00 sec)

Look into the new DB, it's empty, no tables or data, as expected:

mysql> \u httparchive
Database changed
mysql> show tables;
Empty set (0.00 sec)

4. Quit mysql for now:

mysql> quit;

5. Go the archive and fetch the database schema.

$ curl http://httparchive.org/downloads/httparchive_schema.sql > ~/Downloads/schema.sql

While you're there get the latest DB dump. That would be the link that says IE. Today it says Dec 15 and is 2.5GB. So be prepared. Save it and unzip it as, say, ~/Downloads/dump.sql

6. Recreate the DB tables:

$ /usr/local/mysql/bin/mysql -u root httparchive < ~/Downloads/schema.sql

7. Import the data (takes a while):

$ /usr/local/mysql/bin/mysql -u root httparchive < ~/Downloads/dump.sql

8. Log back into mysql and look around:

$ /usr/local/mysql/bin/mysql -u root httparchive;

[yadda, yadda...]

mysql> show tables;
| Tables_in_httparchive |
| pages                 |
| pagesmobile           |
| requests              |
| requestsmobile        |
| stats                 |
5 rows in set (0.00 sec)


What's in the requests table I couldn't help but wonder (damn you, SATC)

mysql> describe requests;
| Field                  | Type             | Null | Key | Default | Extra          |
| requestid              | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| pageid                 | int(10) unsigned | NO   | MUL | NULL    |                |
| startedDateTime        | int(10) unsigned | YES  | MUL | NULL    |                |
| time                   | int(10) unsigned | YES  |     | NULL    |                |
| method                 | varchar(32)      | YES  |     | NULL    |                |
| url                    | text             | YES  |     | NULL    |                |
| urlShort               | varchar(255)     | YES  |     | NULL    |                |
| redirectUrl            | text             | YES  |     | NULL    |                |
| firstReq               | tinyint(1)       | NO   |     | NULL    |                |
| firstHtml              | tinyint(1)       | NO   |     | NULL    |                |
| reqHttpVersion         | varchar(32)      | YES  |     | NULL    |                |
| reqHeadersSize         | int(10) unsigned | YES  |     | NULL    |                |
| reqBodySize            | int(10) unsigned | YES  |     | NULL    |                |
| reqCookieLen           | int(10) unsigned | NO   |     | NULL    |                |
| reqOtherHeaders        | text             | YES  |     | NULL    |                |
| status                 | int(10) unsigned | YES  |     | NULL    |                |
| respHttpVersion        | varchar(32)      | YES  |     | NULL    |                |
| respHeadersSize        | int(10) unsigned | YES  |     | NULL    |                |
| respBodySize           | int(10) unsigned | YES  |     | NULL    |                |
| respSize               | int(10) unsigned | YES  |     | NULL    |                |
| respCookieLen          | int(10) unsigned | NO   |     | NULL    |                |
| mimeType               | varchar(255)     | YES  |     | NULL    |                |

Hm, I wonder what are common mime types these days. Limiting to 10000 or more occurrences of the same mime type, because there's a lot of garbage out there. If you've never looked into real web data, you'd surprised how much misconfiguration is going on. It's a small miracle the web even works.

9. Most common mime types:

select count(requestid) as ct, mimeType 
  from requests 
  group by mimeType 
  having ct > 10000 
  order by ct desc;
| ct      | mimeType                      |
| 7448471 | image/jpeg                    |
| 4640536 | image/gif                     |
| 4293966 | image/png                     |
| 2843749 | text/html                     |
| 1837887 | application/x-javascript      |
| 1713899 | text/javascript               |
| 1455097 | text/css                      |
| 1093004 | application/javascript        |
|  619605 |                               |
|  343018 | application/x-shockwave-flash |
|  188799 | image/x-icon                  |
|  169928 | text/plain                    |
|   70226 | text/xml                      |
|   50439 | font/eot                      |
|   45416 | application/xml               |
|   41052 | application/octet-stream      |
|   38618 | application/json              |
|   30201 | text/x-cross-domain-policy    |
|   25248 | image/vnd.microsoft.icon      |
|   20513 | image/jpg                     |
|   12854 | application/vnd.ms-fontobject |
|   11788 | image/pjpeg                   |
22 rows in set (2 min 25.18 sec)

So the web is mostly made of JPEGs. GIFs are still more than PNGs despite all best efforts. Although OTOH (assuming these are comparable datasets), PNG is definitely gaining compared to this picture from two and a half years ago. Anyway.

It's you time!

So this is how easy it is to get started with the HTTPArchive. What experiment would you run with this data?

Tell your friends about this post: Facebook, Twitter, Google+

30 Responses

  1. [...] Stoyan's blog about (x)html, ajax, bookmarklets, browsers, css, firebug, javascript, json, mdb2, mysql, pear, performance, php, phpbb, tools, yslow, yui, writing, music,… life and everything. « Digging into the HTTP archive [...]

  2. Stoyan, I’ve been going through your steps, altering your command line instructions a bit, as my mysql install is located within MAMP. Step 5 from the command line didn’t work for me, but I got around this by using the “Import” tab in phpMyAdmin.

    However, I am stuck at step 7. I’ve tried importing dump.sql using your command while logged out of mysql, in addition to trying it while logged in just in case that was the issue, and I don’t see a way to solve this using phpMyAdmin either. I would love to be able to complete this tutorial so thanks for any advice you might have. Sorry for any shortcomings in my terminology as I’m relatively new to working with mysql.

  3. Wow, Stoyan – thanks for writing this up! A top priority for this year is getting more people to use the HTTP Archive data. I’m lacking on documentation and tutorials so this post is really really valuable.

    My current project with HA is rewriting the DB schema. The downside is people who are using the DB now will see changes. The upside is the DB will be smaller and more compartmentalized. For example, it’ll be possible to NOT download/restore the “requests” table making it take minutes instead of hours to restore a single crawl.

    I encourage people who are interested to subscribe to the HTTP Archive mailing list to discuss issues and hear about changes.

  4. Nice post Stoyan thanks for the example of using having.

    You should not encourage people to unzip the httparchive.org datasets. Not only does it require less diskspace (current datasets are 2.5 GB compressed but about 30 GB uncompressed and nearly 40 GB in a database) but it is also faster to pipe the decompression straight into the database:
    gzip -d -c archive.gz | mysql5 -uroot httparchive

  5. [...] * Stoyan Stefanov digs into the HTTP archive. [...]

  6. [...] current data. For example, do what Stoyan Stevanov did by asking yourself some questions: “Hm, I wonder what are common mime types these days”. Once you’ve setup the database, you can easily query anything you [...]

  7. It’s not my first time to go to see this web page, i am browsing this web site dailly and obtain fastidious facts from here every day.

  8. [...] for a day and pull some revealing stats or trends about some web sites of your choice? Or maybe dig around the HTTP archive? You can do that and more by setting up your own HTTP [...]

  9. Write more, thats all I have to say. Literally, it seems as though you
    relied on the video to make your point. You obviously know what youre talking about,
    why waste your intelligence on just posting
    videos to your site when you could be giving us something informative to

  10. [...] for a day and pull some revealing stats or trends about some web sites of your choice? Or maybe dig around the HTTP archive? You can do that and more by setting up your own HTTP [...]

  11. Je vois direct que vous connaissez très bien ce quee
    vous avancez

  12. Hello there, Nice posting. There’s an issue together with your web site within website traveler, may well follow through? For example still is the market fundamental and a good portion connected with other people will probably abandon ones spectacular composing due to this trouble.

  13. Je vois irectement qque vous conaissez bieen ce que vous dites

  14. Une fois dee plus un article effectivement attractif

  15. Je terminerai de voir tout ça demain

  16. Unee fois de plus un superbe poste, j’espère en parler ce soir avec
    des voisins

  17. Excellent post, j’espère en parler dans la semaine avec mes voisins

  18. J’ai comme l’impression que cet article
    va aller sur mon site web personnel

  19. Je suis entièrement d’accord avec vous

  20. Je finirai de regarder ça plus tard

  21. C’est un véritable bonheur de visiter ce site internet

  22. Est-il possible de reprendre 2 3 lignes sur un blog
    perso ?

  23. Bon bah je vais en parler sur un blog

  24. L’ensemble de ces post sont véritablement instructifs

  25. Je vais vouss dire que ce n’est guère incohérent !!

  26. Je n’ai guère terminé de lire toutefois je reviendrai
    ce soir

  27. Euhh êtes vous certain de ce que vous affirmez ?

  28. Je vais te dire que cce n’est guère erroné ..

  29. Humm êtes vous certain de ce que vous avancez ??

  30. Tremendous things here. I’m very satisfied to see
    your article. Thanks so much and I’m having a look forward to touch you.
    Will you kindly drop me a e-mail?

Leave a Reply