I should have gotten around to this post about a week ago, but we’ve been running around doing real work since our launch. Anyway, a while back, Marshall Kirkpatrick wrote a post entitled “Ten Useful Examples of the Real-Time Web in Action” on ReadWriteWeb. In it, he outlines several benefits that real-time web technologies can provide. At #1 is “Real-Time Push to Replace Web Crawling”, where he references PubSubHubbub co-creator Brad Fitzpatrick wondering about something that certainly interests us:
…real-time push technologies could someday replace the need for most of the web crawling his employer Google does to maintain its index. If all webpages were PubSubHubbub enabled, for example, they could simply tell a Hub about any changes they had published and Google could find out via that Hub. There would be no reason for Google to poll websites for changes over and over again.
Although this idea is certainly very compelling, I don’t think it’s very likely that real-time push can replace crawling. Here’s why:
- Real-time push is only useful for (surprisingly enough) real-time content, which is a small % of web content, and always will be (just do some simple induction to figure out why). So unless you’ve been receiving pushes since “time 0″, you won’t be getting all the content you might want.
- Real-time push allows the site to only provide snippets of content, which means you’ll have to crawl if you want more. Put another way, sometimes the guy making the request wants control over the response of that request. Imagine that ;)
- This idea depends on all sites using real-time push, which I personally feel is highly unlikely to happen. Just ask the semantic web guys how many webmasters use RDF markup.
The above 3 points are general rebuttals to the idea that real-time push will be pervasive. There’s still a specific reason why 80legs would still maintain an advantage over real-time push, and that’s because our distributed architecture would still provide performance and cost advantages when it comes to accessing and processing web content. Simply put, we can throw more bandwidth and compute power for looking at and processing web content then what someone could do on their own, with a centralized data center.
Let me finish off by saying that I do think real-time push is a really cool technology. For things like pulling status updates, news, etc., it can be really useful. But I think the vast majority of the web will always need to be crawled, for many different purposes that real-time push can’t provide.