Archive for the 'ydn' Category

Collecting web data with a faster, free server

Tuesday, December 8th, 2009

Dec 8 This article is part of the 2009 performance advent calendar experiment. This is also the first ever guest post to this blog.
Please welcome the world-famous Christian Heilmann! And stay tuned for the next articles.

Christian HeilmannChris Heilmann is a self confessed data junkie and worked for over 12 years as a professional web developer. Having published several books on JavaScript, Accessibility and web development using web services he right now works as a developer evangelist for the Yahoo Developer Network. He blogs at http://wait-till-i.com/, has all his talks and videos at http://icant.co.uk/ and can be found on Twitter as @codepo8.

 

RSS is a wonderful format to get information from all kind of different sources. It is dead easy to provide, has a predictable (albeit limited) format and is very easy to use. The problem of course is that with the amount of different feeds used in one page its performance goes down.

The reason is the classic HTTP request issue - the more you negotiate, find and pull the slower your page renders. Therefore you need to try to shorten the time the calls happen.

Say you want to pull the following five RSS feeds and display them:

  • http://code.flickr.com/blog/feed/rss/
  • http://feeds.delicious.com/v2/rss/codepo8?count=15
  • http://www.stevesouders.com/blog/feed/rss
  • http://www.yqlblog.net/blog/feed/
  • http://www.quirksmode.org/blog/index.xml

The least effective way of doing that is pulling and displaying them one after the other:

Retrieving five RSS feeds using curl takes about 11 seconds.

$oldtime = microtime(true);
$url = 'http://code.flickr.com/blog/feed/rss/';
$content[] = get($url);
$url = 'http://feeds.delicious.com/v2/rss/codepo8?count=15';
$content[] = get($url);
$url = 'http://www.stevesouders.com/blog/feed/rss';
$content[] = get($url);
$url = 'http://www.yqlblog.net/blog/feed/';
$content[] = get($url);
$url = 'http://www.quirksmode.org/blog/index.xml';
$content[] = get($url);
display($content);
echo '<p>Time spent: <strong>' . (microtime(true)-$oldtime) .'</strong></p>';
function get($url){
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  $output = curl_exec($ch);
  curl_close($ch);
  return $output;
}
function display($data){
  foreach($data as $d){
    $obj = simplexml_load_string($d);
    echo '<div><h2><a href="'.$obj->channel->link.'">'.
          $obj->channel->title.'</a></h2>';
    echo '<ul>';
    foreach($obj->channel->item as $i){
      echo '<li><a href="'.$i->link.'">'.$i->title.'</a></li>';
    }
    echo '</ul></div>';
  }
}

Using Stoyan's Multi Curl function you already realise quite an increase in speed.

Retrieving five RSS feeds with multicurl takes about 2.8 seconds.

$data = array(
  'http://code.flickr.com/blog/feed/rss/',
  'http://feeds.delicious.com/v2/rss/codepo8?count=15',
  'http://www.stevesouders.com/blog/feed/rss',
  'http://www.yqlblog.net/blog/feed/',
  'http://www.quirksmode.org/blog/index.xml'

);
$r = multiRequest($data);
display($r);

function multiRequest($data, $options = array()) {
  // array of curl handles
  $curly = array();
  // data to be returned
  $result = array();
  // multi handle
  $mh = curl_multi_init();
  // loop through $data and create curl handles
  // then add them to the multi-handle
  foreach ($data as $id => $d) {
    $curly[$id] = curl_init();
    $url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
    curl_setopt($curly[$id], CURLOPT_URL,            $url);
    curl_setopt($curly[$id], CURLOPT_HEADER,         0);
    curl_setopt($curly[$id], CURLOPT_RETURNTRANSFER, 1);
    // post?
    if (is_array($d)) {
      if (!empty($d['post'])) {
        curl_setopt($curly[$id], CURLOPT_POST,       1);
        curl_setopt($curly[$id], CURLOPT_POSTFIELDS, $d['post']);
      }
    }
    // extra options?
    if (!empty($options)) {
      curl_setopt_array($curly[$id], $options);
    }
    curl_multi_add_handle($mh, $curly[$id]);
  }
  // execute the handles
  $running = null;
  do {
    curl_multi_exec($mh, $running);
  } while($running > 0);
  // get content and remove handles
  foreach($curly as $id => $c) {
    $result[$id] = curl_multi_getcontent($c);
    curl_multi_remove_handle($mh, $c);
  }
  // all done
  curl_multi_close($mh);
  return $result;
}
function display($data){
  foreach($data as $d){
    $obj = simplexml_load_string($d);
    echo '<div><h2><a href="'.$obj->channel->link.'">'.
          $obj->channel->title.'</a></h2>';
    echo '<ul>';
    foreach($obj->channel->item as $i){
      echo '<li><a href="'.$i->link.'">'.$i->title.'</a></li>';
    }
    echo '</ul></div>';
  }
}

However, there are still two things that are annoying:

  • You pull far more data than you really need
  • You do all the request from your server

Yahoo Pipes has been used for that kind of task for quite a while, but the issue was that it is a visual interface and therefore hard to maintain. The server was also not the best performing out there.

The good news is that there is a new(er) kid on the block in Yahoo Land called YQL running on a massively fast server farm and with one purpose: making it easier to use web services, mix them and get only the data back that you want.

YQL in itself is a web service and you post queries to it that access other web services in a SQL style syntax. Normally you'd get RSS feeds using the RSS table, like so:

select * from rss where url="http://feeds.delicious.com/v2/rss/codepo8?count=15"

See the single RSS in the YQL console (you need a Yahoo account to log in). You can also see the single RSS retrieval output.

The issue with this is that it only retrieves the items of the RSS feed and not the title, which we need for the headings. Therefore we need to use the XML table:

select * from xml where url="http://feeds.delicious.com/v2/rss/codepo8?count=15"

See the single XML in the YQL console (you need a Yahoo account to log in). You can also see the single XML retrieval output.

This gives us the same data the normal cURL calls give us. The cool thing about YQL is though that you can filter the data you get back to the bare minimum. In our case, this means replacing the * with the title and the link of the feed and of the items:

select channel.title,channel.link,channel.item.title,channel.item.link 
  from xml 
  where url="http://feeds.delicious.com/v2/rss/codepo8?count=15"

See the filtered RSS in the YQL console (you need a Yahoo account to log in). You can also see the filtered RSS retrieval output.

In order to use this with all of our RSS feeds, we can use the in() command:

select channel.title,channel.link,channel.item.title,channel.item.link
    from xml where url in(
      'http://code.flickr.com/blog/feed/rss/',
      'http://feeds.delicious.com/v2/rss/codepo8?count=15',
      'http://www.stevesouders.com/blog/feed/rss',
      'http://www.yqlblog.net/blog/feed/',
      'http://www.quirksmode.org/blog/index.xml'
    )

Check the aggregation in the console or the aggregation output.

This leaves all the hard work to the Yahoo Server farm. YQL pulls all the RSS feeds, adds one after the other and then gives it back to us as XML. We could simply use the generated URL from the console, but it is much more versatile to assemble the query in PHP:

$data = array(
  'http://code.flickr.com/blog/feed/rss/',
  'http://feeds.delicious.com/v2/rss/codepo8?count=15',
  'http://www.stevesouders.com/blog/feed/rss',
  'http://www.yqlblog.net/blog/feed/',
  'http://www.quirksmode.org/blog/index.xml'
);
$url ='http://query.yahooapis.com/v1/public/yql?q=';
$query = "select channel.title,channel.link,channel.item.title,channel.item.link from xml where url in('".implode("','",$data)."')";
$url.=urlencode($query).'&format=xml';
$content = get($url);
display($content);
function get($url){
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  $output = curl_exec($ch);
  curl_close($ch);
  return $output;
}
function display($data){
  $data = simplexml_load_string($data);
  $sets = $data->results->rss;
  $all = sizeof($sets);
  for($i=0;$i<$all;$i++){
    $r = $sets[$i];
    $title = $r->channel->title.'';
    if($title != $oldtitle){
      echo '<div><h2><a href="'.($r->channel->link.'').'">'.
           ($r->channel->title.'').'</a></h2><ul>';
    }
      echo '<li><a href="'.($r->channel->item->link.'').'">'.
           ($r->channel->item->title.'').'</a></li>';
    if($title != $sets[$i+1]->channel->title.''){
      echo '</ul></div>';
    }
    $oldtitle = $r->channel->title.'';
  };
}

As you can see the loop to display the different RSS feeds a bit clunky and we could use an open YQL table to move the whole conversion to a server-side JavaScript. However, as it is the performance of this way of retrieving the RSS feeds beats all the others hands-down already:

Retrieving five RSS feeds using YQL takes about half a second

You can try it yourself, get the demo code from GitHub and run it on your own server to see the magic of YQL.

 

Yahoo Music API

Thursday, August 7th, 2008

This was meant to be a longer posts with examples and such, but Jim Bumgardner said it and coded it better than I could :) He's been with Y!Music way longer than me and has done way cooler stuff.

As a front-end engineer for Yahoo! Music, I've always thought it would great if the web services we use to create the Y! Music pages were available to developers outside of Yahoo!, and, as of today, due to the herculean labors of our web services team, they are!

Full blog post and code examples are on the YDN (Yahoo! Developer Network) blog and the Music API docs are there too.

Could you come up with an idea of a mashup using music data like artists, songs, similarities? The APIs are here.

 

My online footprint lately

Wednesday, July 23rd, 2008

This is a sort of a catch-up post for listing what I've been up to lately.

  • YUI Blog just published my first article, I'm so proud. It's about loading JavaScript in non-blocking fashion, because JavaScripts, they, you know, like, block downloads. Luckily, there's an easy fix - DOM includes, which I've previously discussed, discussed and discussed.
  • SitePoint published an update to my older article that introduces AJAX, ok, Ajax, by creating a command-line-like interface with PHP on the server side. The updated article features improved code, jQuery example, YUI example, JSON discussion and example. Check it out, bookmark and recommend to your friends that keep asking you "What's this AJAX (they are new, don't know it's now spelled "Ajax") thing? Do you know of a good article?"
  • YDN (developer.yahoo.com) published a video presentation of me and my lovely teammate Nicole Sullivan where we talk about some new and cool front-end performance techniques. So if you wandered how I look and are eager to hear my fabulous Balkan peninsula accent, give it a shot. The talk is called "After YSlow 'A'" and is targeted at those of you who have reached performance nirvana, but are still hungry for more. We talk about preloading components, post-loading, javascript, images, using flush() in PHP to send first byte early on and other fun stuff.
  • Last, not least, I decided to try and find some time to update my JavaScript patterns site. Unfortunately I got sidetracked (yep, I'm easily distracted by shiny objects) and played with a not-so-javascript pattern. The post I published (includes a pretty lame screencast! and) demonstrates how you can use animated background position to indicate loading progress.

Whew, c'est tout pout ce moment, expect a lot more now that the JavaScript book is out of the way. Ah, yep, if you feel like it, join me on Facebook, I created a JS book page.

 

Simultaneuos HTTP requests in PHP with cURL

Tuesday, February 19th, 2008

The basic idea of a Web 2.0-style "mashup" is that you consume data from several services, often from different providers and combine them in interesting ways. This means you often need to do more than one HTTP request to a service or services. In PHP if you use something like file_get_contents() this means all the requests will be synchronous: a new one is fired only after the previous has completed. If you need to make three HTTP requests and each call takes a second, your app is delayed at least three seconds.

Solution

An improvement of course is to cache responses as much as possible, but at one point or another you still need to make those requests.

Using the curl_multi* family of cURL functions you can make those requests simultaneously. This way your app is as slow as the slowest request, as opposed to the sum of all requests. And that's something.

A function

Here's a little function I coded that will allow you do multi requests.

<?php

function multiRequest($data, $options = array()) {

  // array of curl handles
  $curly = array();
  // data to be returned
  $result = array();

  // multi handle
  $mh = curl_multi_init();

  // loop through $data and create curl handles
  // then add them to the multi-handle
  foreach ($data as $id => $d) {

    $curly[$id] = curl_init();

    $url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
    curl_setopt($curly[$id], CURLOPT_URL,            $url);
    curl_setopt($curly[$id], CURLOPT_HEADER,         0);
    curl_setopt($curly[$id], CURLOPT_RETURNTRANSFER, 1);

    // post?
    if (is_array($d)) {
      if (!empty($d['post'])) {
        curl_setopt($curly[$id], CURLOPT_POST,       1);
        curl_setopt($curly[$id], CURLOPT_POSTFIELDS, $d['post']);
      }
    }

    // extra options?
    if (!empty($options)) {
      curl_setopt_array($curly[$id], $options);
    }

    curl_multi_add_handle($mh, $curly[$id]);
  }

  // execute the handles
  $running = null;
  do {
    curl_multi_exec($mh, $running);
  } while($running > 0);

  // get content and remove handles
  foreach($curly as $id => $c) {
    $result[$id] = curl_multi_getcontent($c);
    curl_multi_remove_handle($mh, $c);
  }

  // all done
  curl_multi_close($mh);

  return $result;
}

?>

Consuming

The function accepts an array of URLs to hit and optionally an array of cURL options if you need to pass any. The first array can be a simple indexed array or URLs or it can be an array of arrays where the second has a key named "url". If you use the second way and you also have a key called "post", the function will do a post request.

The function returns an array of responses as strings. The keys in the result array match the keys in the input.

A GET example

Let's say you want to use some Yahoo search web services (consult YDN) to create a music artist band-o-pedia kind of mashup. Here's how you can search audio, video and images at the same time:

<?php

$data = array(
  'http://search.yahooapis.com/VideoSearchService/V1/videoSearch?appid=YahooDemo&query=Pearl+Jam&output=json',
  'http://search.yahooapis.com/ImageSearchService/V1/imageSearch?appid=YahooDemo&query=Pearl+Jam&output=json',
  'http://search.yahooapis.com/AudioSearchService/V1/artistSearch?appid=YahooDemo&artist=Pearl+Jam&output=json'
);
$r = multiRequest($data);

echo '<pre>';
print_r($r);

?>

This will print something like:

Array
(
    [0] => {"ResultSet":{"totalResultsAvailable":"633","totalResultsReturned":...
    [1] => {"ResultSet":{"totalResultsAvailable":"105342","totalResultsReturned":...
    [2] => {"ResultSet":{"totalResultsAvailable":10,"totalResultsReturned":...
)

A POST example

There's an interesting Yahoo search service called term extraction which analyses content. It accepts POST requests. Here's how to consume this service with the function above, making two simultaneous requests:

<?php
$data = array(array(),array());

$data[0]['url']  = 'http://search.yahooapis.com/ContentAnalysisService/V1/termExtraction';
$data[0]['post'] = array();
$data[0]['post']['appid']   = 'YahooDemo';
$data[0]['post']['output']  = 'php';
$data[0]['post']['context'] = 'Now I lay me down to sleep,
                               I pray the Lord my soul to keep;
                               And if I die before I wake,
                               I pray the Lord my soul to take.';

$data[1]['url']  = 'http://search.yahooapis.com/ContentAnalysisService/V1/termExtraction';
$data[1]['post'] = array();
$data[1]['post']['appid']   = 'YahooDemo';
$data[1]['post']['output']  = 'php';
$data[1]['post']['context'] = 'Now I lay me down to sleep,
                               I pray the funk will make me freak;
                               If I should die before I waked,
                               Allow me Lord to rock out naked.';

$r = multiRequest($data);

print_r($r);
?>

And the result:

Array
(
    [0] => a:1:{s:9:"ResultSet";a:1:{s:6:"Result";s:5:"sleep";}}
    [1] => a:1:{s:9:"ResultSet";a:1:{s:6:"Result";a:3:{i:0;s:5:"freak";i:1;s:5:"sleep";i:2;s:4:"funk";}}}
)