Digging into the HTTP archive

Update: Second part

One way to do web performance research is to dig into what's out there. It's a tradition dating back from Steve Souders and his HPWS where he was looking at the top 10 Alexa sites for proof that best practices are or aren't followed. This involves loading each pages and inspecting the body or the response headers. Pros: real sites. Cons: manual labor, small sample.

I've done some image and CSS optimization research grabbing data in any way that looks easy: using the Yahoo image search API to get URLs or using Fiddler to monitor and export the traffic and loading a bazillion sites in IE with a script. Or using HTTPWatch. Pros: big sample. Cons: reinvent the wheel and use a different sampling criteria every time.

Today we have httparchive.org which makes performance research so much easier. It already has a bunch of data you can export and dive into immediately. It's also a common starting point so two people can examine the same data independently and compare or reproduce each other's results.

Let's see how to get started with the HTTP archive's data.

speak it

(Assuming MacOS, but the differences with other OS are negligible)

1. Install MySQL
2. Your mysql binary will be in /usr/local/mysql/bin/mysql. Feel free to create an alias. Your username is root and no password. This is of course terribly insecure but for a local machine with no important data, it's probably tolerable. Connect:

$ /usr/local/mysql/bin/mysql -u root

You'll see some text and a friendly cursor:

...
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

3. Create your new shiny database:

mysql> create database httparchive;
Query OK, 1 row affected (0.00 sec)

Look into the new DB, it's empty, no tables or data, as expected:

mysql> \u httparchive
Database changed
mysql> show tables;
Empty set (0.00 sec)

4. Quit mysql for now:

mysql> quit;
Bye

5. Go the archive and fetch the database schema.

$ curl http://httparchive.org/downloads/httparchive_schema.sql > ~/Downloads/schema.sql

While you're there get the latest DB dump. That would be the link that says IE. Today it says Dec 15 and is 2.5GB. So be prepared. Save it and unzip it as, say, ~/Downloads/dump.sql

6. Recreate the DB tables:

$ /usr/local/mysql/bin/mysql -u root httparchive < ~/Downloads/schema.sql

7. Import the data (takes a while):

$ /usr/local/mysql/bin/mysql -u root httparchive < ~/Downloads/dump.sql

8. Log back into mysql and look around:

$ /usr/local/mysql/bin/mysql -u root httparchive;

[yadda, yadda...]

mysql> show tables;
+-----------------------+
| Tables_in_httparchive |
+-----------------------+
| pages                 |
| pagesmobile           |
| requests              |
| requestsmobile        |
| stats                 |
+-----------------------+
5 rows in set (0.00 sec)

Dataaaa!

What's in the requests table I couldn't help but wonder (damn you, SATC)

mysql> describe requests;
+------------------------+------------------+------+-----+---------+----------------+
| Field                  | Type             | Null | Key | Default | Extra          |
+------------------------+------------------+------+-----+---------+----------------+
| requestid              | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| pageid                 | int(10) unsigned | NO   | MUL | NULL    |                |
| startedDateTime        | int(10) unsigned | YES  | MUL | NULL    |                |
| time                   | int(10) unsigned | YES  |     | NULL    |                |
| method                 | varchar(32)      | YES  |     | NULL    |                |
| url                    | text             | YES  |     | NULL    |                |
| urlShort               | varchar(255)     | YES  |     | NULL    |                |
| redirectUrl            | text             | YES  |     | NULL    |                |
| firstReq               | tinyint(1)       | NO   |     | NULL    |                |
| firstHtml              | tinyint(1)       | NO   |     | NULL    |                |
| reqHttpVersion         | varchar(32)      | YES  |     | NULL    |                |
| reqHeadersSize         | int(10) unsigned | YES  |     | NULL    |                |
| reqBodySize            | int(10) unsigned | YES  |     | NULL    |                |
| reqCookieLen           | int(10) unsigned | NO   |     | NULL    |                |
| reqOtherHeaders        | text             | YES  |     | NULL    |                |
| status                 | int(10) unsigned | YES  |     | NULL    |                |
| respHttpVersion        | varchar(32)      | YES  |     | NULL    |                |
| respHeadersSize        | int(10) unsigned | YES  |     | NULL    |                |
| respBodySize           | int(10) unsigned | YES  |     | NULL    |                |
| respSize               | int(10) unsigned | YES  |     | NULL    |                |
| respCookieLen          | int(10) unsigned | NO   |     | NULL    |                |
| mimeType               | varchar(255)     | YES  |     | NULL    |                |
.....

Hm, I wonder what are common mime types these days. Limiting to 10000 or more occurrences of the same mime type, because there's a lot of garbage out there. If you've never looked into real web data, you'd surprised how much misconfiguration is going on. It's a small miracle the web even works.

9. Most common mime types:

select count(requestid) as ct, mimeType 
  from requests 
  group by mimeType 
  having ct > 10000 
  order by ct desc;
+---------+-------------------------------+
| ct      | mimeType                      |
+---------+-------------------------------+
| 7448471 | image/jpeg                    |
| 4640536 | image/gif                     |
| 4293966 | image/png                     |
| 2843749 | text/html                     |
| 1837887 | application/x-javascript      |
| 1713899 | text/javascript               |
| 1455097 | text/css                      |
| 1093004 | application/javascript        |
|  619605 |                               |
|  343018 | application/x-shockwave-flash |
|  188799 | image/x-icon                  |
|  169928 | text/plain                    |
|   70226 | text/xml                      |
|   50439 | font/eot                      |
|   45416 | application/xml               |
|   41052 | application/octet-stream      |
|   38618 | application/json              |
|   30201 | text/x-cross-domain-policy    |
|   25248 | image/vnd.microsoft.icon      |
|   20513 | image/jpg                     |
|   12854 | application/vnd.ms-fontobject |
|   11788 | image/pjpeg                   |
+---------+-------------------------------+
22 rows in set (2 min 25.18 sec)

So the web is mostly made of JPEGs. GIFs are still more than PNGs despite all best efforts. Although OTOH (assuming these are comparable datasets), PNG is definitely gaining compared to this picture from two and a half years ago. Anyway.

It's you time!

So this is how easy it is to get started with the HTTPArchive. What experiment would you run with this data?

This entry was posted on Friday, December 28th, 2012 and is filed under performance. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.


Get notification for future posts: follow me on Twitter or subscribe to my RSS feed

6 Responses to “Digging into the HTTP archive”

  1. Digging into the HTTP archive #2 / Stoyan's phpied.com Says:

    [...] Stoyan's blog about (x)html, ajax, bookmarklets, browsers, css, firebug, javascript, json, mdb2, mysql, pear, performance, php, phpbb, tools, yslow, yui, writing, music,… life and everything. « Digging into the HTTP archive [...]

  2. Ben Says:

    Stoyan, I’ve been going through your steps, altering your command line instructions a bit, as my mysql install is located within MAMP. Step 5 from the command line didn’t work for me, but I got around this by using the “Import” tab in phpMyAdmin.

    However, I am stuck at step 7. I’ve tried importing dump.sql using your command while logged out of mysql, in addition to trying it while logged in just in case that was the issue, and I don’t see a way to solve this using phpMyAdmin either. I would love to be able to complete this tutorial so thanks for any advice you might have. Sorry for any shortcomings in my terminology as I’m relatively new to working with mysql.

  3. Steve Souders Says:

    Wow, Stoyan – thanks for writing this up! A top priority for this year is getting more people to use the HTTP Archive data. I’m lacking on documentation and tutorials so this post is really really valuable.

    My current project with HA is rewriting the DB schema. The downside is people who are using the DB now will see changes. The upside is the DB will be smaller and more compartmentalized. For example, it’ll be possible to NOT download/restore the “requests” table making it take minutes instead of hours to restore a single crawl.

    I encourage people who are interested to subscribe to the HTTP Archive mailing list to discuss issues and hear about changes.

  4. Charlie Says:

    Nice post Stoyan thanks for the example of using having.

    You should not encourage people to unzip the httparchive.org datasets. Not only does it require less diskspace (current datasets are 2.5 GB compressed but about 30 GB uncompressed and nearly 40 GB in a database) but it is also faster to pipe the decompression straight into the database:
    gzip -d -c archive.gz | mysql5 -uroot httparchive

  5. Community News: Predictions for 2013, a Look Back at 2012 & More | New Relic blog Says:

    [...] * Stoyan Stefanov digs into the HTTP archive. [...]

  6. Setup your own HTTP Archive to track and query your site trends | Be better and faster Says:

    [...] current data. For example, do what Stoyan Stevanov did by asking yourself some questions: “Hm, I wonder what are common mime types these days”. Once you’ve setup the database, you can easily query anything you [...]

Leave a Reply