The value of APIs that can be crawled


Recently, there was an interesting article on ReadWriteWeb questioning the long term effect of the proliferation of public APIs, versus merely offering crawlable data. On one hand – the article argued – APIs offer a great deal of control to the publisher and they are great for access to real-time information. On the other hand, if data is only accessible through an API then it is not available for spiders and crawlers and thus won’t show up in search results. In effect, the public loses out, since less data can be searched for.

However, I think this takes a somewhat limited view on what an API can be. In fact, if the API is designed properly then the API itself can be discovered via a crawler, along with all the information the API provider chooses to make public. This is particularly true for APIs where ‘discoverability’ is a fundamental design concept. For example, in the open source RESTx project – a fast and simple way to create RESTful web services – a RESTful, documented and fully discoverable API is created automatically.This API consists of links, exposed resources, human- and machine-readable documentation and descriptions of parameters, which are extracted from information contained in the code and in the resources that have been created on the server. Everything can be discovered merely by following links.

As a result, the API itself can be crawled. Retrieving data in a RESTful API means: Sending an HTTP GET request to a particular link. Since that is exactly what spiders are doing, we are now getting the best of both worlds: On one hand, we have the immediacy and control of an API. On the other hand, all exposed resources of a RESTx server can be found simply by searching for them. This means that no separate repository of services and resources needs to be maintained, which is particularly interesting for system like RESTx, where new RESTful resources can be created quickly and even by end users.

At the same time, the data itself, which is returned when a RESTful resource is accessed, is still available for crawlers and spiders, if you chose to make it public. Thus, the data does not get locked away behind an API as the RWW article had feared. Instead, both the API and the data are out in the open, discovered and searchable, unless configured otherwise.

Another strength of a RESTful API is the fact that data can be returned in different representations. The RWW article actually mentions that with APIs one is limited to the content format chosen by the provider. However, a good RESTful API usually offers the ability to return content in more than just one format. As an example, see how RESTx returns a different format based on the client request.

As a conclusion, I would say that RESTful APIs, such as the ones produced by RESTx, are in fact the answer to the concerns pointed out in the article.

We'd love to hear your opinion on this post