Digging into the HTTP archive #2
Continuing from earlier tonight, let's see how you can use the HTTP archive as a starting point and continue examining the Internet at large.
Task: figure out what % of the JPEGs out there on the web today are progressive vs baseline. Ann Robson has an article for the perfplanet calendar later tonight with all the juicy details.
Problemo: there's no such information in HTTPArchive. However there's table requests with a list of URLs as you can see in the previous post.
Solution: Get a list of 1000 random jpegs (mimeType='image/jpeg'), download them all and run imagemagick's identify to figure out the percentage.
How?
You have a copy of the DB as described in the previous post. Now connect to mysql (assuming you have an alias by now):
$ mysql -u root httparchive
Now just for kicks, let's get one jpeg:
mysql> select requestid, url, mimeType from requests \
where mimeType = 'image/jpeg' limit 1;
+-----------+--------------------------------------------+------------+
| requestid | url | mimeType |
+-----------+--------------------------------------------+------------+
| 404421629 | http://www.studymode.com/education-blog....| image/jpeg |
+-----------+--------------------------------------------+------------+
1 row in set (0.01 sec)
Looks promising.
Now let's fetch 1000 random images, while at the same time dump them into a file. For convenience let's make this file a shell script so it's easy to run. And the contents will be one curl command per line. Let's use mysql to do all the string concatenation.
Testing with one image:
mysql> select concat('curl -o ', requestid, '.jpg "', url, '"') from requests\
where mimeType = 'image/jpeg' limit 1;
+-----------------------------------------------------------+
| concat('curl -o ', requestid, '.jpg "', url, '"') |
+-----------------------------------------------------------+
| curl -o 404421629.jpg "http://www.studymode.com/educ..." |
+-----------------------------------------------------------+
1 row in set (0.00 sec)
All looks good. I'm using the requestid as file name, so the experiment is always reproducible.
mysql> SELECT concat('curl -o ', requestid, '.jpg "', url, '"') INTO OUTFILE '/tmp/jpegs.sh' LINES TERMINATED BY '\n' FROM requests WHERE mimeType = 'image/jpeg' ORDER by rand() LIMIT 1000;
Query OK, 1000 rows affected (2 min 25.04 sec)
Lo and behold, three minutes later, we have generated a shell script in /tmp/jpegs.sh that looks like:
curl -o 422877532.jpg "http://www.friendster.dk/file/pic/user/SellDiablo_60.jpg" curl -o 406113210.jpg "http://profile.ak.fbcdn.net/hprofile-ak-ash4/370543_100004326543130_454577697_q.jpg" curl -o 423577106.jpg "http://www.moreliainvita.com/Banner_index/Cantinelas.jpg" curl -o 429625174.jpg "http://newnews.ca/apics/92964906IMG_9424--1.jpg" ....
Now, nothing left to do but run this script and download a bunch of images:
$ mkdir /tmp/jpegs $ sh ../jpegs.sh
curl output flashes by and some minutes later you have almost 1000 images, mostly NSFW. Not 1000 because of timeouts, unreachable hosts, etc.
$ ls | wc -l
983
Now back to the original task: how many baseline and how many progressive JPEGs:
$ identify -verbose *.jpg | grep "Interlace: None" | wc -l
XXX
$ identify -verbose *.jpg | grep "Interlace: JPEG" | wc -l
YYY
For the actual values of XXX and YYY, check Ann's post later tonight
Also turns out 983 - XXX - YYY = 26 because some of the downloaded images were not really images, but 404 pages and other non-image files.
This entry was posted on Friday, December 28th, 2012 and is filed under performance. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
Get notification for future posts: follow me on Twitter or subscribe to my RSS feed

December 29th, 2012 at 12:24 am
ImageMagick does not distinguish baseline from sequential extended?
December 31st, 2012 at 2:05 pm
[...] phpbb, tools, yslow, yui, writing, music,… life and everything. « Deck the Halls 2012 Digging into the HTTP archive #2 [...]
March 31st, 2013 at 12:49 am
i love him
March 31st, 2013 at 12:53 am
hy i am kaysar ahammeh i am…….