Crawling and The Programmable Web

Today, applications increasingly depend on a rich ecosystem of APIs. Thousands of different services are variously tethered together to form new software offerings and enhance existing ones. The idea of a programmable web is finally coming true.

While this is not trivial, I am nonetheless beginning to question the long-term effects of an API-centric worldview, a sort of blind faith in the almighty API, which has at best a difficult relationship with open data and big data concepts.

How do we access data today?

There are two core ways to access data today – via a publisher or via a crawler. Each has a different role.

Publishers have data and choose to make it publicly available through an API so that developers can easily design products powered by a given service.

Crawlers on the other hand are used to proactively go out and grab data by yourself — scraping web pages for whatever it is you’re looking for, data that can then be used to build products, and inform better product and marketing decisions.

There something of a third option, data aggregators, like Factual and Infochimps and Hoovers. I’m not going to treat them much as part of this post because they gain access to data like the rest of us – via APIs and/or crawlers. They facilitate the distribution of that data as part of their core business (most often using a marketplace concept or subscription), but the input mechanisms are no different.

And there is potentially even a fourth option – human curation of the kind that Factual and WolframAlpha and CrowdFlower employ to acquire new data altogether. But all of these providers offer API access to their data, so I’m still going to bucket them as such.

APIs, at least as we think of them today, have many disadvantages. And before you grab your shovels and organize a mob to come after me, please understand that I’m not calling for the discontinuation of APIs.

At 80legs, we ourselves offer a popular API, which takes a particularly hybrid approach – providing programmatic access to the data acquired via crawls.

What I really want is a natural stratification based on who is good at what, essentially. Right now, we’re asking APIs to do too much.

APIs are great for the real-time web, for example – they’re great for staying up to speed, whether that means trending search data or retweet velocity. APIs are great for enhancing functionality – whether that’ a Klout score or geolocation. APIs are great for integrating certain pieces of non-strategic infrastructure like invite codes (Prefinery) if you’re a startup in beta, or Freshbooks, if you’re an accountant. They’re also great of app-level integrations, like adding Facebook accounts to Tweetdeck, or sucking down content from Netflix.

But at a higher level, as all applications and services become more and more data-driven, it’s important to understand the differences between these different methods for extracting data, regardless of where you net out philosophically.

This is a debate that needs to take place.

Control, Control, Control

Control and flexibility are the two most important elements to look at when it comes to the difference between an API and a crawl. I also spend some time at the end of this post talking about security and privacy, because I think there are big impacts there for APIs and crawlers alike.

Cost might be a fourth facet to look at, but that’s grounds for a different post because pricing varies so widely.

Let’s start with control.

When using an API, publishers – companies like Amazon and LinkedIn, for example – control the entire process. Publishers provide you with an API account, which allows you a certain amount of calls, or requests for data per day. They also determine what kinds of content are made available, and in what context.

Publishers offer an API for many reasons. It’s financially in their best interest to have products built on top of their data to increase developer loyalty and form a kind of API-dependency to their content. It’s also useful as a way to accurately measure server usage and overall engagement, even if there’s no money involved.

APIs can go down and become unavailable, they can go from free to paid, and their publishers can be acquired by larger companies that make all manner of changes. There’s a lot of uncertainty in APIs, and many devs have learned this fact the hard way. Think back to Gnip rethinking their entire business model due to the relicensing of certain APIs.

But like moths, we so often head right back to the flame.

Crawlers act very differently. They allow much more control over the data acquisition process. This has many advantages.

For starters, the format in which content is delivered can be a lifesaver if formatted properly, or prompt hours of additional work if not.

APIs supply content in one format – the format chosen by the publisher.

Say you need a XML file type but the company only delivers JSON through their API. You’re either stuck or left spending hours re-formatting.

Crawlers let the choice-driven developer have his cake and eat it too. Formats are just another choice to make beforehand, instead of a hindrance.

Granted, standardization can be great in some cases – for example with sites like MySpace where each profile is customized and therefore rendered in HTML differently. MySpace APIs format the content to make it uniform, meaning that what was once difficult to work with as a developer (i.e. large discrepancies in the data), is now standardized and simple to use.

But the “one size fits all” mentality fails more often than you might think, especially once you step outside of the web’s largest sites – one size fits all rarely fits anyone well.

And it’s not just format – crawling offers much more control when it comes to time and timing, scope, and cost, too.

Flexibility and Availability

Data access choices are an important component of building any web product, especially when it comes to flexibility and availability. Specs change, needs change — heck, markets change. Especially if you’re a lean startup, out early + iterate often is a way of life.

APIs only deliver content from the publisher’s site. You’re locked into a single interface’s content sources and structure, without flexibility by definition, which can be very limiting. You’re left with acquiring stand-alone datasets to supplement your evolving needs, or mashing up with another API to fill in holes.

Now, the very best API providers are great at adapting to developers’ needs and evolving alongside them. Companies like Yolink, for whom their API is their bread and butter, are particularly responsive. But too often an API is left unattended, having been a mere box to check, instead of a strategic commitment.

Unimaginative APIs can also limit use cases unwittingly, because some of the furthest-flung (if more promising) applications just aren’t supported in the calls or code. There’s a huge difference between an API that wants to be heightened and explored, and an API whose scope, if anything, constrains original thinking.

Crawlers on the other hand aren’t specific to any one site’s data, meaning that they can access content from any number of sources and compile it in one place, mixing and matching, comparing and contrasting to your heart’s delight.

Crawls can be more open-ended and investigative as well, whereas an API is more about putting a square peg in a square hole. API’s also don’t offer competitive advantage – everyone has access to the same stuff. A clever crawl can help build a moat.[SDD2]

Finally, crawlers can reach far beyond the capabilities of an API. Millions of pieces of data are publicly available on the web, and only a very small percentage of it is available via an API. At a certain point it’s purely an issue of volume. Much of the web is instantly crawl-able, and the amount of data available freely on the web is growing more quickly than the number of APIs by an order of magnitude. The caveat – you just have to know where to look.

The Elephant in the Room — Security and Privacy

Let’s talk about privacy and data, because how the world evolves in this respect could have huge implications for APIs and crawlers alike.

As the recent Facebook data privacy concerns highlight, the security of people’s data is a high priority, regardless of how it may or may not be acquired or sold. Further, users expect publishers to protect their data aggressively (whether they do is another matter).

And this is a PR/perception issue as much as it is anything else.

Users worry that their data might get into the hands of people who will use it for malicious purposes, whether via an API or a crawler.  I would argue that this is not always the case, because responsible crawling companies at least, have strict licensing agreements with their clients to ensure data is used lawfully.

But, the reality is that publishers are increasingly incentivized because of public policy issues to constrain API access. And the world’s biggest crawler, Google, is starting to look evil, with the ominous question “what exactly does Google know about me?” popping up at family dinners around the country.

Some are even arguing that Facebook is bound to be federally regulated sooner or later because of its profligacy when it comes to data, and that would certainly have broad impacts.

APIs are not inherently more or less secure than crawlers, but in the current climate, especially with regards to privacy, we can expect companies large and small to make less and less data open and available (something that the linked data community has been ruing as well).

Security right now is a big X factor that is going to take some time to play out.

The nice thing about crawlers (depending on your perspective) is that they are harder to control, at least for now. But it is a reasonable thing to say that data responsibility and privacy issues are going to shape and reshape this conversation big time.


Today’s web is full of data that if kept within an API-driven paradigm suffers from less creative use, less flexibility, and less control (from a developer standpoint).

An endlessly crawl-able web was in many ways what Tim Berners-Lee and WC3 intended for the web all along. Content creators like publishers and social networks can create sites as they’ve always done, while data aggregators can access data in whatever format they like.

In fact, in an older but still applicable interview with Berners-Lee, he talks about why a open, linked data web is by far preferable than APIs for data access.

There is a foundational, DNA-level need to share data. Without openness, you loose the full value and impede any future innovation in the process.

APIs absolutely have benefits – but only when we are not beholden to them – when we can use them rationally, strategically, and carefully. And when data isn’t at the crux of your site, service or application.

“We have an open API” is an overused phrase, especially as API’s are no by definition open or closed.

If you need certain attributes, like real-time/speed, certain capabilities, or certain pieces of infrastructure, there are thousands of amazing APIs out there. But if your business runs on data – crawling is the only way to go.

5 Responses to “Crawling and The Programmable Web”

  1. 1 Greg Perry August 6, 2010 at 9:18 pm

    A very insightful post, eloquent and to the point.

    Your approach is interesting.

    How do you navigate the minefield of licensing on the part of the content owner, for example if Facebook has a specific prohibition on the part of third party use (and reuse) of their userbase’s content, how do you legally make that available through your service and API? Do you just adhere to robots.txt for acquired content, or do you have strategic partnerships in place with the larger content and social networking service providers that gives you the ability to redistribute their content?

    Warm regards and best of luck to you.

  2. 2 Shion Deysarkar August 7, 2010 at 10:24 am

    You bring about some good points. The issues around publicly-available content are murky, but there have been some recent developments that help clarify them.

    A judge has recently ruled that a website TOS is essentially not a binding legal document:, which I feel is a pretty fair statement to make. Websites are making content publicly available, but try to restrict access to that content through a document that is never agreed upon.

    Our high-level opinion is that public content means just that.

  3. 3 rocketry.wordpress.Com May 9, 2013 at 3:54 am

    Right now it looks like Movable Type is the top blogging platform out there right now.
    (from what I’ve read) Is that what you’re using on your blog?

  4. 4 Fun run multiplayer race hack September 24, 2013 at 6:38 am

    I’m amazed, I have to admit. Seldom do I come across a blog that’s equally educative and interesting, and without a doubt, you’ve hit the nail on the head.
    The problem is something which too few men and women are speaking intelligently about.
    I am very happy I stumbled across this in my hunt
    for something relating to this.

  5. 5 Malissa March 6, 2014 at 3:49 am

    Pretty! This has been an extremely wonderful post.

    Thanks for supplying these details.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Twitter Updates

%d bloggers like this: