Archive Page 2

Python API Released

The 80legs Python API is now available for use.  To learn how to access and use it, visit the 80legs Python API documentation.


New feature: 80app packs!

We’ve just deployed a new version of 80legs that adds an exciting new feature: 80app Packs!

Plus and Premium subscribers will now have access to a growing set of useful, pre-built 80apps.  The following 80apps are currently available or will be available soon:


  • Return Page Content
  • Regex Text Matcher
  • Regex Source Matcher
  • Image Resizer


  • All Plus 80apps
  • Social Network Scrapers
  • E-commerce Site Scrapers

80legs users will be able to select these apps and get the information they want from crawls with zero programming.  Everything will be pre-built and ready to go.  We want to make things as easy as possible for our users.

We plan to keep on adding more and more 80apps to Plus and Premium Plans.  If you have an idea for 80apps you’d like to see, just let us know!

Our predictions for 2010

I put up a post on Silicon Angle regarding my opinions on some potential trends for 2010.  While I’m no Nostradamus, what I’ve posted there is based on some of the things we’ve been seeing through our experience working on 80legs and my own experience as I get more involved in the national tech startup culture.  Take a look and let us know what you think!

80legs Subscription Plans and Free Web-Crawling

We have just updated 80legs with some exciting new changes.  Starting today, the 80legs service will be divided into 3 tiers: Basic, Plus and Premium.  Since the time we’ve launched, we’ve noticed that our customer base can be classified into 3 major groups – light, medium and heavy users.  Each of these plans is targeted to each group and designed to fulfill their specific needs.

Here are details on each plan:

Basic Plan:

  • Free to use
  • Normal crawling speed (up to 1 request/second/domain)
  • Access to 80legs Web Portal
  • 1 job running at a time
  • Up to 100K crawled pages per job
  • Low priority in 80legs job queue
  • No recurring jobs allowed

Plus Plan:

  • $99/month + crawling fees
  • Fast crawling speed (up to 5 requests/second/domain)
  • Access to 80legs Web Portal and API
  • Up to 3 jobs running at a time
  • Up to 1M crawled pages per job
  • Normal priority in 80legs job queue
  • Recurring jobs allowed

Premium Plan:

  • $299/month + crawling fees
  • Ultra-fast crawling speed (up to 10 requests/second/domain)
  • Access to 80legs Web Portal and API
  • Up to 5 jobs running at a time
  • Up to 10M crawled pages per job
  • Preferred priority in 80legs job queue
  • Recurring jobs allowed

Existing users can sign up for a plan by going to the new Subscription section in the 80legs Web Portal, where there are complete details and instructions on signing up for a plan.

We’re really excited about these changes.  Of course, the Basic Plan now enables completely free web-crawling, which until today has been completely unheard of.  The Plus and Premium Plans give heavier users the ability to set up and run more intensive crawls.

If any of our users have questions about the changes, please contact us or submit a tickets.  We’re always happy to hear from you!

Defrag Experience

Defrag2009LogoThis past week I was in Denver attending Defrag 2009, which is something of the uber-tech geek con and bills itself as:

…focused on the tools and technologies that accelerate the “aha” moment, and is a gathering place for the growing community of implementers, users, and thinkers that are building the next wave of software innovation.

It was a very unique experience, to say the least.  We actually were unsure of our interest in attending when we first heard about Defrag.  Eric Norlin had contacted me several months ago about us being a sponsor.  Since “big data” is one of the themes at Defrag, he justifiably figured that we would fit right in.  Unfortunately, with DEMO looming, we were unsure of Defrag of being worth taking a chunk out of our budget.  I actually initially declined Eric, but he was persistent and contacted me again after DEMO.  Of course, I was even more cautious about committing now that all the money for DEMO had actually been spent!  But, after Eric offered the opportunity to speak, I decided we’d go for it.

Let me first say that deciding to attend Defrag was definitely the right move.  The quality of level of the audience is definitely the highest of any conference I’ve seen.  Each person that came by the booth was plugged-in, technical and business-savvy.  We actually managed to generate a good number of promising leads, which was impressive considering there were only about 350 people in attendance.  From a pure business perspective, just closing 2 or 3 of these leads would make the conference worth it for us.

We had some great one-on-one conversations with folks there, including talks with the guys at Infochimps, Robert Scoble, and Bill from Factual (previously of Y! BOSS).  We also gave some folks a sneak peak at what we’re working on with Language Computer.  Without providing too much detail, we’re building a service called Extractiv, which will let people turn any part of the web into highly structured, semantic data.

The one down vote I would give for Defrag is that the talks didn’t always live up to “tech” billing I thought they would.  In many cases, on-stage discussion converged onto social media, Twitter, etc.  While those are important new developments, many of the speakers focused on how to create the right UI or visualize social content.  My personal opinion is that UI and visualization are not the hard problems to be solved in these spaces.  Rather, converting that content into meaningful and actionable data is.

Oh, and here’s my little presentation!

I think I ruffled some feathers by actually suggesting something could be better than the cloud in some cases (god forbid!). :)

Overall it was a great time, and I look forward to attending next year!

Web-Scale Apps Challenge

10-19-2009 11-22-33 AMWe just launched the 80legs Web-Scale Apps Challenge over at ChallengePost!  We’re challenging anyone and everyone to make the coolest apps for crawling and processing web content.  The top 3 entries will win some pretty sweet prizes, like a Kindle, original mint-condition Atari, and more.

We issued this challenge in anticipation of our App Store launch, which will happen the week of November 16th.  The 80legs App Store will allow our users to buy and run 80Apps created by third-party developers.  Our users will get to run custom code without having to do their own development work, and developers get a way to monetize cool web content processing technologies.

More details on the App Store to come!  For now, check out the challenge at!

Most of the web isn’t real-time

I should have gotten around to this post about a week ago, but we’ve been running around doing real work since our launch.  Anyway, a while back, Marshall Kirkpatrick wrote a post entitled “Ten Useful Examples of the Real-Time Web in Action” on ReadWriteWeb.  In it, he outlines several benefits that real-time web technologies can provide.  At #1 is “Real-Time Push to Replace Web Crawling”, where he references PubSubHubbub co-creator Brad Fitzpatrick wondering about something that certainly interests us:

…real-time push technologies could someday replace the need for most of the web crawling his employer Google does to maintain its index. If all webpages were PubSubHubbub enabled, for example, they could simply tell a Hub about any changes they had published and Google could find out via that Hub. There would be no reason for Google to poll websites for changes over and over again.

Although this idea is certainly very compelling, I don’t think it’s very likely that real-time push can replace crawling.  Here’s why:

  1. Real-time push is only useful for (surprisingly enough) real-time content, which is a small % of web content, and always will be (just do some simple induction to figure out why).  So unless you’ve been receiving pushes since “time 0”, you won’t be getting all the content you might want.
  2. Real-time push allows the site to only provide snippets of content, which means you’ll have to crawl if you want more.  Put another way, sometimes the guy making the request wants control over the response of that request.  Imagine that ;)
  3. This idea depends on all sites using real-time push, which I personally feel is highly unlikely to happen.  Just ask the semantic web guys how many webmasters use RDF markup.

The above 3 points are general rebuttals to the idea that real-time push will be pervasive.  There’s still a specific reason why 80legs would still maintain an advantage over real-time push, and that’s because our distributed architecture would still provide performance and cost advantages when it comes to accessing and processing web content.  Simply put, we can throw more bandwidth and compute power for looking at and processing web content then what someone could do on their own, with a centralized data center.

Let me finish off by saying that I do think real-time push is a really cool technology.  For things like pulling status updates, news, etc., it can be really useful.  But I think the vast majority of the web will always need to be crawled, for many different purposes that real-time push can’t provide.

Twitter Updates