Digging into the HTTP archive #2 / Stoyan's phpied.com

Continuing from earlier tonight, let's see how you can use the HTTP archive as a starting point and continue examining the Internet at large.

Task: figure out what % of the JPEGs out there on the web today are progressive vs baseline. Ann Robson has an article for the perfplanet calendar later tonight with all the juicy details.

Problemo: there's no such information in HTTPArchive. However there's table requests with a list of URLs as you can see in the previous post.

Solution: Get a list of 1000 random jpegs (mimeType='image/jpeg'), download them all and run imagemagick's identify to figure out the percentage.

How?

You have a copy of the DB as described in the previous post. Now connect to mysql (assuming you have an alias by now):

$ mysql -u root httparchive

Now just for kicks, let's get one jpeg:

mysql> select requestid, url, mimeType from requests \
    where mimeType = 'image/jpeg' limit 1;
+-----------+--------------------------------------------+------------+
| requestid | url                                        | mimeType   |
+-----------+--------------------------------------------+------------+
| 404421629 | http://www.studymode.com/education-blog....| image/jpeg |
+-----------+--------------------------------------------+------------+
1 row in set (0.01 sec)

Looks promising.

Now let's fetch 1000 random images, while at the same time dump them into a file. For convenience let's make this file a shell script so it's easy to run. And the contents will be one curl command per line. Let's use mysql to do all the string concatenation.

Testing with one image:

mysql> select concat('curl -o ', requestid, '.jpg "', url, '"') from requests\
    where mimeType = 'image/jpeg' limit 1;
+-----------------------------------------------------------+
| concat('curl -o ', requestid, '.jpg "', url, '"')         |
+-----------------------------------------------------------+
| curl -o 404421629.jpg "http://www.studymode.com/educ..."  |
+-----------------------------------------------------------+
1 row in set (0.00 sec)

All looks good. I'm using the requestid as file name, so the experiment is always reproducible.

mysql>
 SELECT concat('curl -o ', requestid, '.jpg "', url, '"') 
  INTO OUTFILE '/tmp/jpegs.sh' 
  LINES TERMINATED BY '\n' FROM requests
  WHERE mimeType = 'image/jpeg'
  ORDER by rand() 
  LIMIT 1000;

Query OK, 1000 rows affected (2 min 25.04 sec)

Lo and behold, three minutes later, we have generated a shell script in /tmp/jpegs.sh that looks like:

curl -o 422877532.jpg "http://www.friendster.dk/file/pic/user/SellDiablo_60.jpg"
curl -o 406113210.jpg "http://profile.ak.fbcdn.net/hprofile-ak-ash4/370543_100004326543130_454577697_q.jpg"
curl -o 423577106.jpg "http://www.moreliainvita.com/Banner_index/Cantinelas.jpg"
curl -o 429625174.jpg "http://newnews.ca/apics/92964906IMG_9424--1.jpg"
....

Now, nothing left to do but run this script and download a bunch of images:

$ mkdir /tmp/jpegs
$ sh ../jpegs.sh

curl output flashes by and some minutes later you have almost 1000 images, mostly NSFW. Not 1000 because of timeouts, unreachable hosts, etc.

$ ls | wc -l
     983

Now back to the original task: how many baseline and how many progressive JPEGs:

$ identify -verbose *.jpg | grep "Interlace: None" | wc -l
     XXX
$ identify -verbose *.jpg | grep "Interlace: JPEG" | wc -l
     YYY

For the actual values of XXX and YYY, check Ann's post later tonight 🙂

Also turns out 983 - XXX - YYY = 26 because some of the downloaded images were not really images, but 404 pages and other non-image files.

Tell your friends about this post on Facebook and Twitter

Sorry, comments disabled and hidden due to excessive spam.

Meanwhile, hit me up on twitter @stoyanstefanov