Sunday, May 5, 2013

Programmatic Web-browsing with http.Navigator

Did you know there's an HTTP response header called "Link"?

> HEAD / HTTP/1.1
> Host: http://pfraze.blogspot.com

< HTTP/1.1 200 OK
< Link: <http://pfraze.blogspot.com/2013>; rel="collection"; title="2013"

It describes resources which have some relationship to the current one, primarily through the "href" (the URL), the "rel" tag (a keyword describing the relationship), and the "title" tag. Grimwire's local.http.Navigator object makes use of it by searching the rel tags and titles:

local.http.navigator('http://pfraze.blogspot.com')
    .collection('2013') // find a {rel=collection, title=2013} link
    .collection('05') // find a {rel=collection, title=05} link
    .item('programmatic-web-browsing-with-http-navigator')
    // ^ find a {rel=item, title=programmatic...} link
    .getJson();

Each navigation is lazy, choosing only to resolve with a HEAD request when the program asks the navigator to make a request. In the example above, "getJson()" would have triggered this traffic:

> HEAD / HTTP/1.1
< HTTP/1.1 200 OK
< Link: <http://pfraze.blogspot.com/2013>; rel="collection"; title="2013"

> HEAD /2013 HTTP/1.1
< HTTP/1.1 200 OK
< Link: <http://pfraze.blogspot.com/2013/05>; rel="collection"; title="05"

> HEAD /2013/05 HTTP/1.1
< HTTP/1.1 200 OK
< Link: <http://pfraze.blogspot.com/2013/05/programmatic-web-browsing-with-http-navigator>; rel="item"; title="programmatic-web-browsing-with-http-navigator"

> GET /2013/05/programmatic-web-browsing-with-http-navigator HTTP/1.1
...

There's one round-trip per navigation, but, since Grimwire runs Web servers within the browser, a lot of that traffic will be at a negligible latency.

To avoid enumerating every possible link, the navigator also uses URI Templates:

Link: <http://pfraze.blogspot.com/{title}>; rel="collection"

This would result in the exact same behavior as above. Grimwire first tries to find a link that matches both rel and title. If it doesn't find that, but does find a link with a matching rel-type and no title attribute, it'll use that. This allows you to spec out your URI scheme generally, rather than enumerating every reachable URI.

More extensive templates, with more variables, can also be used:

local.http.navigator('http://pfraze.blogspot.com')
    .collection('posts', { tags:'grimwire' })
    .getJson()
// > HEAD http://pfraze.blogspot.com
// < Link: http://pfraze.blogspot.com/{title}{?tags}
// > GET http://pfraze.blogspot.com/posts?tags=grimwire

If any of the navigations don't find a matching link, it rejects the response promise with a 404.

Advantages:
  • No manual URI construction.
  • Less out-of-band client knowledge - with the links, the server can change URIs at any time, possibly routing to other hosts or new URI schemes.
  • By default, a 404 will not be retried, reducing traffic in a failure condition.

Tips:
  • Generally, only link to items which are "adjacent" in the resource hierarchy. (/a should link to / and /a/b, but not /a/b/c). This is because a subsequent navigation can accomplish the same thing, and it's easier to describe links that are 1 hop away in relevance.
  • A Grimwire Worker server can describe the Link header as an array of objects. The attributes map directly to the standard Link definition, but also include the "href" key, which specifies the URL. For instance, [{href:"http://pfraze.blogspot.com/{title}", rel:"collection"}]
  • More documentation can be found at grimwire.com.

No comments:

Post a Comment