Archive for the 'HTTP' Category

Automating HTTPWatch with PHP #3

Monday, March 7th, 2011

The first part is here, the second is here. This third post is more about PHP and COM, rather than HTTPWatch or monitoring web performance, so feel free to skip if the title mislead you :) Keep reading if you want to use and improve/update my HTTPWatch class in the future.

The problem

After running a HTTPWatch-ed browser via a script I wanted to have an easy way to dump all the data collected. Since the PHP-HTTPWatch bridge is via COM interface, all the objects returned by HTTPWatch are Variants and not ready for introspection with the usual PHP functions, like get_object_vars() for example.

The solution

Interestingly, turns out there exist a function called com_print_typeinfo(). You give it a COM object (works with those variants HTTPWatch gives you) and it returns you source code for a PHP class defining this COM object. So the hack here is to evaluate this source code with eval() (oh, the horror!) and then inspect it with get_class_vars(). Luckily there's no naming collision between the PHP built-in classes and those defied by HTTPWatch.

// the input is a class name 
// and an object of that class
$class = "Entry";
$object = $http->watch->Log->Entries->Item(0);
 
// buffer output
ob_start();
// print out class definition
// derived from an object
com_print_typeinfo($obj);
$typeinfo = ob_get_contents();
ob_end_clean();
 
// evaluate the generated PHP source
eval($typeinfo);
 
// get the properties
$properties = get_class_vars($class);
 
// Horay!

In order for this to work, as you can see, we need to know the class names and we need access to an example object of each class. This is where I needed to study the HTTPWatch API and find a suitable example page that will generate enough objects to derive the API from.

The free HTTPWatch version

The free version has restrictions where you don't have access to all properties. I wanted my class to be able to do the best job possible with or without the presence of restrictions. That's why I first load google.com which is unrestricted and then an image on my blog, which is restricted.

From the first page I derive the complete API (that I'm interested in) and then I use the derived API to study the second URL request. Accessing each property in a try-catch blows up when a property is restricted, so I write it to a second array of API properties $paidproperties.

Source: the end

In the end when you run the script dumpapi.php. It uses my HTTPWatch class to derive the API for the HTPWatch class itself. How meta! The result of the run you write to an API file and then this file is included by the class. Nice and clean after a messy hack :)

Run:

$ php dumpapi.php > HTTPWatchAPI.php

(This is a one-off operation, no need to run it at all if you don't need to change anything in the class)

Then, before you instantiate the constructor, you go:

// point to the API dump
HTTPWatch::$apipath = "HTTPWatchAPI.php";
 
// the usual
$http = new HTTPWatch();

The default name and location for that API dump is the same directory and file name HTTPWatchAPI.php, so you skip that first line unless you have a valid reason to store the API in a different location.

 

Automating HTTPWatch with PHP #2

Monday, March 7th, 2011

In part 1 I demonstrated how you can use PHP to script and automate HTTPWatch. And how you can get data back, either reading the API docs or using a quick HAR hack to get a lot of data in one go.

Now I want to share a little class I wrote to make all that a little easier.

The code is here on GitHub.

Basic usage

Open IE, navigate, close:

$http = new HTTPWatch();
$http->go('http://phpied.com/');
$http->done();

To do the same in Firefox, just pass "ff" to the constructor:

$http = new HTTPWatch('ff');
$http->go('http://phpied.com/');
$http->done();

The constructor accepts a second param with options, like empty cache, hide browser (ie only), etc, largely underused for the time being.

Handle to the HTTPWatch plugin

After $http = new HTTPWatch(); a watch property will be added to $http. This is the HTTPWatch instance which gives you access to all its APIs, so you can do e.g.

$summary = $http->watch->Log->Entries->Summary;

Data out

My main motivation behind this class, other than simpler api, has been to provide the ability to just dump all the data that HTTPWatch collects in a quick print_r(). That has been a challenge with the COM PHP bridge, but I found a hack around it. In any event, most of the HTTPWatch API I've exported to a second PHP file - the HTTPWatchAPI.php script. (This is an auto-generated file, created by another script, but let's leave that out for now.)

So after you've navigated to a page you have two convenient methods to grab a bunch of data from HTTPWatch. The first is:

$http->getSummary();

This gives a summary stats for the http observation session. The second is

$http->getEntries();

It gives you details about every HTTPWatch log entry - be it cached or an actual HTTP request.

Here's an example of what getSummary() can give you. Here's how this file was generated:

$http = new HTTPWatch();
$http->go("http://google.com");
print_r($http->getSummary());
$http->done();

And here's some output print_r()-ed from getEntries(). Here's the code that produced it:

$http = new HTTPWatch();
$http->go("http://google.com");
print_r($http->getEntries());
$http->done();

If you look carefully at the dump, you may notice something like [Stream] => [BYTESTREAM]. Most of the times you don't need the raw HTTP streams (gzipped, chunked, etc), but you can get them if you want by setting:

$http->skipStreams = false;

Here's the same google.com example, this time including the raw streams. And the code:

$http = new HTTPWatch();
$http->go("http://google.com");
$http->skipStreams = false;
print_r($http->getEntries());
$http->done();

Free vs. paid

One pain with HTTPWatch is that the free version has restrictions. The summary for example doesn't include TimingSummaries and WarningSummaries properties. The entry log has almost nothing - no headers or content or streams. My class handles that by giving you as much as it can. If you're using the free version, it will return the limited data for the restricted URLs, but still the full data for those URLs that HTTPWatch's demo version allows - the top Alexa sites.

So here's a dump of visiting http://givepngachance.com with my free HTTPWatch edition.

The data has restricted information about givepngachance.com URL but full data related to the embeded youtube.com resource.

The code:

$http = new HTTPWatch();
$http->go("http://givepngachance.com");
print_r($http->getEntries());
$http->done();

Again with the video

If you've read part 1, you've probably seen the video, but here's the link again (try the HD version). This is a screencapture of loading FF and IE using my new class. The code that produced it is:

$ie = new HTTPWatch();
$ie->go('http://google.com/');
$sum = $ie->getSummary();
$ff = new HTTPWatch('ff');
$ff->go('http://google.com/');
$sumff = $ff->getSummary();
 
echo "\nRun 1 ";
echo $ie->watch->Log->BrowserName, ' ';
echo $ie->watch->Log->BrowserVersion;
echo "\nSent: ", $sum['BytesSent'], "; Received: ", $sum['BytesReceived'];
 
echo "\nRun 2 ";
echo $ff->watch->Log->BrowserName, ' ';
echo $ff->watch->Log->BrowserVersion;
echo "\nSent: ", $sumff['BytesSent'], "; Received: ", $sumff['BytesReceived'];
 
$ie->done();
$ff->done();

As you can see I'm accessing the HTTPWatch plugin object ($http->watch) directly to get the browser version and name. I didn't think this was worth wrapping in a more convenient API the way I did with getSummary() and getEntries().

The result of this is:

$ php examples.php

Run 1 Internet Explorer 6.0.2900.5512
Sent: 7102; Received: 89188
Run 2 Firefox 3.5.6
Sent: 6388; Received: 166473 

If you're wondering why FF gets twice the bytes, it's because google.com in IE6 is very basic - no search-as-you-type so much less JavaScript and one less sprite.

That's all folks

Enjoy, fork and keep an eye on what's up with the HTTP traffic. What goes through the tubes is too important not to be observed and monitored :)

 

Automating HTTPWatch with PHP

Saturday, March 5th, 2011

HTTPWatch is a nice tool to inspect HTTP traffic in easy and convenient way and it works in both IE and FF now. Drawback - windows-only and paid. But the free version is good enough for many tasks.

HTTPWatch can be automated and scripted which is pretty cool for a number of monitoring-like tasks. Their site and help section lists C# and Ruby+Watir examples. So I was curious - what about PHP (and no Watir).

In general with PHP you can open/close/navigate IE using COM (whatever that is) which is nice, but you can't do that with Firefox as it doesn't expose a COM interface. But HTTPWatch fills the gap. K, let's see an example.

Prerequisites

OS: Windows
Install: IE, Firefox, HTTPWatch, php (command-line is fine, no need for Apache, MySQL, etc)

Getting started

Create a file, say C:\http.php, open command prompt and go:

cd \
C:\>php http.php

Now all that's left is to put something worth executing in http.php :)

Instantiating HTTPWatch

$controller = new COM("HttpWatch.Controller");
if(!method_exists($controller, 'IE')) {
  throw new Exception('failed to enable HTTPWatch');
}

Opening a new Firefox window:

$plugin = $controller->Firefox->New();

BTW, it's the same for IE:

$plugin = $controller->IE->New();

Disabling any filters (filters defined in HTTPWatch that is)

$plugin->Log->EnableFilter(false);

Clear HTTPWatch's log (the list of requests), clear the browser cache and start recording traffic:

// clear log and cache
$plugin->Clear();
$plugin->ClearCache();
 
// start
$plugin->Record();

Navigate to a URL and wait for it to complete - that means wait a bit after onload even

// browse
$plugin->GotoUrl('http://google.com');
$controller->Wait($plugin, -1);

Stop monitoring traffic and quit the browser:

$plugin->Stop();
$plugin->CloseBrowser();

This is nice, we opened the browser, visited a URL and closed. Now we can even get some meaningful data out of the whole experience.

$plugin->Log->Entries is an object that has a list of all requests. It also has a property Summary. So we can see how many bytes we sent and how many received as a result of this visit to google.com

$sum = $plugin->Log->Entries->Summary;
echo "in: {$sum->BytesReceived}, out: {$sum->BytesSent}";

Note: oh, you need to get your data before closing the browser, otherwise the Log object gets destroyed it seems

So the result:

C:\>php http.php
in: 89185, out: 7102

Yeah!

This may look like nothing, but is pretty impressive in an of itself. At least I know I was happy the first time it worked. Because, you see, any monitoring that doesn't use a real browser is kinda smelly, isn't that right? Plus this is awesome for performance tests, research and experiments. You can create page A and page B and go out for a walk. Meanwhile your script can load the pages 200 times in the two browsers (at least, because you can have FF+IE[678]), with empty and full cache... and you come back for the results! Tired of all the walking, not of hitting REFRESH.

Below you can see (HD!) video of a script that opens IE and FF, loads Google and then gives you the bytes in/out in the two browsers. This example uses a PHP class I created and will talk about later, but you can still see the idea.

A better experience in IE

One thing I don't like is that HTTPWatch won't let you control the browser very well. Two features I'm looking for: being able to see HTTPWatch's log while running (for testing) and then being able to completely hide the window (for "production"). Luckily IE let's you do that and HTTPWatch let's you "attach" an already running IE instance.

So. We open IE with its own COM interface:

$browser = new COM("InternetExplorer.Application");
if(!method_exists($browser, 'Navigate')) {
  throw new Exception('didn\'t create IE obj');
}
$browser->Visible = true;

As you can see - not very different. But there's Visibile which can be false if you so like. This way you can still work on something while tests are running in the background without windows popping up all the time.

Also if you open HTTPWatch manually and close the browser, then the next time (in your scripted runs) HTTPWatch will stay open and you can check what's up.

So, connecting HTTPWatch with the IE instance means instantiating HTTPWatch as before and passing the IE object to Attach() method (was New() before).

// watch this!
$controller = new COM("HttpWatch.Controller");
if(!method_exists($controller, 'IE')) {
  throw new Exception('failed to enable HTTPWatch');
}
 
// enable plugin
$plugin = $controller->IE->Attach($browser);

The rest is all the same.

There's more

The most interesting part is getting data back from HTTPWatch. Dunno about you, but I love just dumping whatever structure I have with print_r() or var_dump() and then deciding what I want from it and how to to go about getting it.

That doesn't happen here because these COM objects are Variants and you can't just dump'em. You have to read the API docs. That sucks. So I did a hack (next post) and also read the APIs ("Stoyan: reading the APIs so you don't have to!") to enable just dumping the httpwatch's log.

Meawhile...

HAR

HTTPWatch can write you a HAR file with the log. Not everything is in there, but it's still a lot and it's easy. HAR is JSON so you json_decode() it and voila - a log!

$filename = tempnam('/tmp', 'watchmenowimgoindown');
$plugin->Log->ExportHAR($filename);
$json = file_get_contents($filename);
 
print_r(json_decode($json));

If you're curious as to what that prints - here it is.

Want to see a HAR (from another run)? Here it is.

So here you go - much data can be extracted and dumped for inspection from the HAR output. For the full httpwatch data, there's the API.

(to be continued...)

 

Inline MHTML+Data URIs

Sunday, October 3rd, 2010

MHTML and Data URIs in the same CSS file is totally doable and gives us nice support for IE6+ and all modern browsers. But the question is - what about inline styles. In other words can you have a single-request web application which bundles together markup, inline styles, inline scripts, inline images? With data URIs - yes, clearly. But MHTML?

I remember hacker extraordinaire Hedger Wang coming up with a test page, which proved it's doable. Problems with the test are that a/ I can't find the page anymore, his domain has expired b/ there was some funky IE7/Vista stuff (probably now solvable) in there which even included an undesired redirect c/ was complex - the whole HTML becomes a multipart document, if I remember correctly there was something that required html served as text/plain....

So I tried something simple - shove an MHTML doc inside an inline style comment. It so totally worked! Including IE6 and IE8 in IE7 mode on Windows 7 (which in my experience behaves as badly as IE7 proper on Vista)

Here's the test page. Look ma', no extra HTTP requests :)

So it's a simple HTML doc:

<!doctype html>
<html>
  <head>
    <title>Look Ma' No HTTP requests</title>
    <style type="text/css">
 
/* magic here */
 
    </style>
  </head>
  <body>
    <h1>MHTML + Data:URIs inline in a <code>style</code> element</h1>
    <p class="image1">hello<br>hello</p>
    <p class="image2">bonjour<br>bonjour</p>
  </body>
</html>

And the magic is two parts: the MHTML doc inside a CSS comment and the actual CSS which uses data URIs for normal browsers and refers to the MHTML parts in IE6,7.

/*
Content-Type: multipart/related; boundary="_"
 
--_
Content-Location:locoloco
Content-Transfer-Encoding:base64
 
iVBORw0KGgoAAAAN ... [more crazyness]... QmCC
--_
Content-Location:polloloco
Content-Transfer-Encoding:base64
 
iVBORw0KGgoAAAANSUh ... [moarrr] ... ggg==
--_--
*/
.image1 {
  background-image: url("data:image/png;base64,iVBORw0KGgoAAAAN ... QmCC"); 
  *background-image: url(mhtml:http://phpied.com/files/mhtml/mhtml-html.html!locoloco); 
}
 
.image2 {
  background-image: url("data:image/png;base64,iVBORw0KGgoAAAANSUh ... ggg=="); 
  *background-image: url(mhtml:http://phpied.com/files/mhtml/mhtml-html.html!polloloco); 
}
 
body {
  font: bold 24px Arial;
}

How cool is that!

Please report any issues you might find in any browser/os combination

The obvious drawback is repeating the long base64'd image content twice, but it's solvable with either server-side sniffing or... one crazy hack, found on the Russian site habrahabr.ru. I should talk about it separately and help spread the word to the larger English-speaking audience, but for the impatient - click!

So there you go - MHTML inline in CSS inline in HTML or building single-request x-browser web apps :)

 

Progressive rendering via multiple flushes

Monday, December 21st, 2009

2010 update:
Lo, the Web Performance Advent Calendar hath moved

10/2011 update: You can also read the web page with Romanian translation (by Web Geek Science)

Dec 21 This post is part of the 2009 performance advent calendar experiment. Stay tuned for the articles to come.

Perceived page loading time is just as important as the real loading time. And when it comes to user perception, visible indication of progress is always good. The user gets feedback that something is going on (and in the right direction) and feels much better.

Using multiple content flushes allows you to improve both the real and the perceived performance. Let's see how.

The head flush()

While your server is busy stitching the HTML from different sources - web services, database, etc - the browser (and hence the user) just sits and waits. Why don't we make it work and start downloading components we know will be absolutely needed, such as the logo, the sprite, css, javascript... While the server is busy, you can send a part of the HTML, for example the whole <head> of the document. In there you can put the references to external components such as the CSS, which then the browser can head-start downloading while waiting for the whole HTML response.

<html>
<head>
  <title>the page</title>
  <link rel=".." href="my.css" />
  <script src="my.js"></script>
</head>
<?php flush(); ?>
<body>
  ...

Doing something like this will result in shorter waterfalls, because more downloads can happen in parallel. In the waterfall below the page is not yet completed at 0.4 seconds, yet the browser has already requested more components.

One step further - multiple flushes

While having the browser busy is good and the whole page loads faster, can we do better? How about letting the user see something while the server is still busy? Remember - show something "in a blink of an eye". And how about doing the flushing several times, hence rendering the page is stages. This will help show usable partial versions of the page without waiting on potentially long-loading page or waiting some blocking JavaScripts to load.

Here's an example - Google search results.

The header part of that page (chunk #1) doesn't need any complicated logic. True, the page title and pre-filling the input box are dynamic parts, but this is just a simple echo of the user input, nothing that requires complex work. So out goes the header. Notice that the number of search results is not visible yet. In this chunk there's the logo, so the sprite is downloaded. If this page was using external CSS, it would be included in the head too.

Then, the search results, the meat of the page. Out it goes, as a static HTML chunk #2.

So far the page is done, but not quite yet. There's some progressive enhancement of the page which requires JavaScript. And JavaScript blocks. So include it in the footer as chunk #3.

The page is usable even without the JavaScript and without the footer. The user mostly cares about the results, so chunk #2 is what matters most. Chunk #3 can even get lost in transfer. As for chunk #1 - it's just to give feedback that "hey, we're working on your query". The first chunk actually tricks the user to believe that the query is already done. "Heck", concludes the user seeing that the page is already coming up, "that was FAST" :)

Boring details - HTTP 1.1 chunked encoding

So how does this work actually, how come the HTML is served in parts?

The answer is - HTTP 1.1 chunked encoding. A normal HTTP response looks like:

Headers...
[One empty line]
<html><body>response...

A chunked response would be like:

Headers...
Transfer-Encoding: chunked
More headers...

size of chunk #1
<html><body>...chunk #1...

size of chunk #2
...chunk #2...

size of chunk #3
...chunk #3 </body><html>

0 (meaning "the end!")

The chunk sizes are given as a hexadecimal. Here's an example response (from the wikipedia article)

HTTP/1.1 200 OK
Content-Type: text/plain
Transfer-Encoding: chunked

25
This is the data in the first chunk

1C
and this is the second one

0

Chunking/flushing strategies

There's basically two paths you can take when it comes to chunking:

Application-level chunking
when your web app knows when to flush, based on some logic. The Google search above is an example of application-level flushing - header, body, footer are parts of the page, known to the web application
Server-level chunking
when your application doesn't worry about how the content is delivered, but leaves this task to the server. The server can choose some strategy for flushing, for example once every 4K of output. Google search does this when the user agent doesn't support gzip - it flushes out every 4k. Bing.com does it similarly - flushing about every 1k (sometimes 2K, sometimes less) regardless of the user agent's support for gzip. Interestingly enough bing's first chunk is often non-readable characters - just something that tells the browser - hey, I'm alive, here's your first byte

Amazon is an interesting example of doing a mix of both strategies - it looks like sometimes it's server level (e.g. in the middle of an html tag) but sometimes it looks like the chunk contains (or wraps up) a page section. Amazon is also a good example of focusing on what's important on a page (Why is the user here? What do they want? What do we want them to do?) and making sure it's rendered first.

The areas I've marked in this screenshot do not correspond to chunks exactly, but they show how the page renders progressively, using a combination of chunked response and source order:

  1. #1 - header. Every page has one. Get done with it.
  2. #2 - buy now. This is what we want the user to do.
  3. #3 - image. The user wants to see what they're buying. Probably also helps Amazon minimize merchandise returns :)
  4. #4 - title/price. Kind of important too.

The rest of the page - reviews, comments, also buy... all this can wait, it's all secondary. Most of it is way below the fold anyway.

Tools: none

Unfortunately, as far as I know, there's no tool that offers visibility into those chunks - what is the contents of each one and how does it looks like.

Fiddler let's you see the encoded chunked response, but that doesn't help too much. At least it gives you an idea that the response was chunked - you can see under Inspectors/Transformer that there is a "Chunked Transfer-Encoding" checkbox.

Also in Fiddler under the next tab - Headers, you can see the chunked encoding header. And also Fiddler helpfully tells you that the response has been encoded (the yellow message on top)

HTTPWatch let's you see the incomprehensible response, but it also tells you the number of chunks. Note that the number includes the last 0 in the response, so when it says 4 chunks it means actually 3.

I also tried to fill the void in the tools department by attempting a Firefox extension. Unfortunately it didn't work, I couldn't find API exposed to extensions that would give me access to the raw encoded response. Looks like it would be possible as an extension to HTTPwatch or Fiddler though - both offer extensibility, both show the raw response.

For own consumption I did a PHP script to request the page and give me the chunks ungzipped. It's very primitive but you can give it a shot here. Test with Yahoo! Search for example.

Works with gzip!

A common concern is whether chunked encoding works together with gzipping the response. Yes, it does. In this presentation Steve Souders sheds some light (PPT, see slide #66) on how to address common issues and also gives flush() equivalents in languages other than PHP.

There's a number of things that can be in the way of successfully implementing chunked encoding+gzip including:

  • call ob_flush() in addition to flush() and careful if you have several output buffers started, you may need to iterate and flush all of them
  • some browsers may require some minimal amount of content before starting to parse, IE needs at least 256 bytes
  • you may need a newer version of Apache
  • DeflateBufferSize in Apache may be set too high for your chunk size
  • check the the user-contributed comments in the php.net manual for flush() for helpful advice and ideas
  • there's ob_implicit_flush() setting which may flush for you instead of you doing flush() every time

Do it!

It may be tricky to implement multiple (or even single) flushing, but it's well worth it. There's some server setup hurdles when it comes to gzip, but once you figure it out, you only do it once. As a reward you get faster loading times, plus progressive rendering so your page not only has faster time-to-onload by feels that way too.