Archive for the 'Analysis' Category

Reddit Ad by the Numbers for B2B Services

We’ve started experimenting with various advertising strategies here at 80legs.  So far we’ve mostly focused on Google AdWords, but we’re now looking at other channels as well.

We just wrapped up our first experiment with Reddit Ads.  I recently read both Gabriel Weinberg’s and Jason Wilk’s posts on using Reddit Ads.  Here’s a summary of the results they got with their ads:

DuckDuckGo (Gabriel):

  • Duration: 13 days
  • Cost: $650 ($50 per day)
  • Impressions: 1,288,378 (282,732 uniques)
  • Clicks: 20,700 (18,420 uniques)
  • CTR: 1.61% (6.49% unique)

Whiteyboard (Jason):

  • Duration: 2 days
  • Cost: $700 ($350 per day)
  • Impressions: 299,784 (63,000 uniques)
  • Clicks: 4,226 (4,197 uniques)
  • CTR: 1.41% (6.68% unique)
Here are the numbers for 80legs (note: only ran in Technology section):

  • Duration: 6 days
  • Cost:$120 ($20 per day)
  • Impressions: 402,537 (86,174 uniques)
  • Clicks: 1,327 (1,276 uniques)
  • CTR: 0.33% (1.48% unique)

And here’s some additional data from Google Analytics:

  • Pages/Visit: 2.41
  • Avg. Time on Site: 01:11
  • Bounce Rate: 50.25%

While our numbers seem quite pitiful compared to DDG and Whiteyboard, I’m not exactly disappointed.  Both DDG and Whiteyboard are consumer products/services.  We’re a B2B service, which means our target market is much smaller in terms of # of individuals.  There are going to be far fewer people interested in web crawling than using a search engine or a whiteboard.

The most important factor is ROI.  Our ad cost $120.  I have at least 1-2 people that contacted us expressing interest in purchasing plans or custom services.  Our plus plan is just $99/month, so I’m fairly confident the ad was a “win” for us.

We also had 116 signups during the course of the Reddit ad.  A typical 6-day period will have 80 – 100 signups.  So that’s also good, but not astoundingly awesome.

Overall, I would say the ad worked well for us.  We’re going to try some variations on the ad (targeting specific crawl packages, etc.) and see if that works better.

B2C ads are going to perform better than B2B ads on pretty much any ad channel.  With a positive ROI, the ad was worth the expense and at the very least warrants additional experimentation.


Crawling and The Programmable Web

Today, applications increasingly depend on a rich ecosystem of APIs. Thousands of different services are variously tethered together to form new software offerings and enhance existing ones. The idea of a programmable web is finally coming true.

While this is not trivial, I am nonetheless beginning to question the long-term effects of an API-centric worldview, a sort of blind faith in the almighty API, which has at best a difficult relationship with open data and big data concepts.

How do we access data today?

There are two core ways to access data today – via a publisher or via a crawler. Each has a different role.

Publishers have data and choose to make it publicly available through an API so that developers can easily design products powered by a given service.

Crawlers on the other hand are used to proactively go out and grab data by yourself — scraping web pages for whatever it is you’re looking for, data that can then be used to build products, and inform better product and marketing decisions.

There something of a third option, data aggregators, like Factual and Infochimps and Hoovers. I’m not going to treat them much as part of this post because they gain access to data like the rest of us – via APIs and/or crawlers. They facilitate the distribution of that data as part of their core business (most often using a marketplace concept or subscription), but the input mechanisms are no different.

And there is potentially even a fourth option – human curation of the kind that Factual and WolframAlpha and CrowdFlower employ to acquire new data altogether. But all of these providers offer API access to their data, so I’m still going to bucket them as such.

APIs, at least as we think of them today, have many disadvantages. And before you grab your shovels and organize a mob to come after me, please understand that I’m not calling for the discontinuation of APIs.

At 80legs, we ourselves offer a popular API, which takes a particularly hybrid approach – providing programmatic access to the data acquired via crawls.

What I really want is a natural stratification based on who is good at what, essentially. Right now, we’re asking APIs to do too much.

APIs are great for the real-time web, for example – they’re great for staying up to speed, whether that means trending search data or retweet velocity. APIs are great for enhancing functionality – whether that’ a Klout score or geolocation. APIs are great for integrating certain pieces of non-strategic infrastructure like invite codes (Prefinery) if you’re a startup in beta, or Freshbooks, if you’re an accountant. They’re also great of app-level integrations, like adding Facebook accounts to Tweetdeck, or sucking down content from Netflix.

But at a higher level, as all applications and services become more and more data-driven, it’s important to understand the differences between these different methods for extracting data, regardless of where you net out philosophically.

This is a debate that needs to take place.

Control, Control, Control

Control and flexibility are the two most important elements to look at when it comes to the difference between an API and a crawl. I also spend some time at the end of this post talking about security and privacy, because I think there are big impacts there for APIs and crawlers alike.

Cost might be a fourth facet to look at, but that’s grounds for a different post because pricing varies so widely.

Let’s start with control.

When using an API, publishers – companies like Amazon and LinkedIn, for example – control the entire process. Publishers provide you with an API account, which allows you a certain amount of calls, or requests for data per day. They also determine what kinds of content are made available, and in what context.

Publishers offer an API for many reasons. It’s financially in their best interest to have products built on top of their data to increase developer loyalty and form a kind of API-dependency to their content. It’s also useful as a way to accurately measure server usage and overall engagement, even if there’s no money involved.

APIs can go down and become unavailable, they can go from free to paid, and their publishers can be acquired by larger companies that make all manner of changes. There’s a lot of uncertainty in APIs, and many devs have learned this fact the hard way. Think back to Gnip rethinking their entire business model due to the relicensing of certain APIs.

But like moths, we so often head right back to the flame.

Crawlers act very differently. They allow much more control over the data acquisition process. This has many advantages.

For starters, the format in which content is delivered can be a lifesaver if formatted properly, or prompt hours of additional work if not.

APIs supply content in one format – the format chosen by the publisher.

Say you need a XML file type but the company only delivers JSON through their API. You’re either stuck or left spending hours re-formatting.

Crawlers let the choice-driven developer have his cake and eat it too. Formats are just another choice to make beforehand, instead of a hindrance.

Granted, standardization can be great in some cases – for example with sites like MySpace where each profile is customized and therefore rendered in HTML differently. MySpace APIs format the content to make it uniform, meaning that what was once difficult to work with as a developer (i.e. large discrepancies in the data), is now standardized and simple to use.

But the “one size fits all” mentality fails more often than you might think, especially once you step outside of the web’s largest sites – one size fits all rarely fits anyone well.

And it’s not just format – crawling offers much more control when it comes to time and timing, scope, and cost, too.

Flexibility and Availability

Data access choices are an important component of building any web product, especially when it comes to flexibility and availability. Specs change, needs change — heck, markets change. Especially if you’re a lean startup, out early + iterate often is a way of life.

APIs only deliver content from the publisher’s site. You’re locked into a single interface’s content sources and structure, without flexibility by definition, which can be very limiting. You’re left with acquiring stand-alone datasets to supplement your evolving needs, or mashing up with another API to fill in holes.

Now, the very best API providers are great at adapting to developers’ needs and evolving alongside them. Companies like Yolink, for whom their API is their bread and butter, are particularly responsive. But too often an API is left unattended, having been a mere box to check, instead of a strategic commitment.

Unimaginative APIs can also limit use cases unwittingly, because some of the furthest-flung (if more promising) applications just aren’t supported in the calls or code. There’s a huge difference between an API that wants to be heightened and explored, and an API whose scope, if anything, constrains original thinking.

Crawlers on the other hand aren’t specific to any one site’s data, meaning that they can access content from any number of sources and compile it in one place, mixing and matching, comparing and contrasting to your heart’s delight.

Crawls can be more open-ended and investigative as well, whereas an API is more about putting a square peg in a square hole. API’s also don’t offer competitive advantage – everyone has access to the same stuff. A clever crawl can help build a moat.[SDD2]

Finally, crawlers can reach far beyond the capabilities of an API. Millions of pieces of data are publicly available on the web, and only a very small percentage of it is available via an API. At a certain point it’s purely an issue of volume. Much of the web is instantly crawl-able, and the amount of data available freely on the web is growing more quickly than the number of APIs by an order of magnitude. The caveat – you just have to know where to look.

The Elephant in the Room — Security and Privacy

Let’s talk about privacy and data, because how the world evolves in this respect could have huge implications for APIs and crawlers alike.

As the recent Facebook data privacy concerns highlight, the security of people’s data is a high priority, regardless of how it may or may not be acquired or sold. Further, users expect publishers to protect their data aggressively (whether they do is another matter).

And this is a PR/perception issue as much as it is anything else.

Users worry that their data might get into the hands of people who will use it for malicious purposes, whether via an API or a crawler.  I would argue that this is not always the case, because responsible crawling companies at least, have strict licensing agreements with their clients to ensure data is used lawfully.

But, the reality is that publishers are increasingly incentivized because of public policy issues to constrain API access. And the world’s biggest crawler, Google, is starting to look evil, with the ominous question “what exactly does Google know about me?” popping up at family dinners around the country.

Some are even arguing that Facebook is bound to be federally regulated sooner or later because of its profligacy when it comes to data, and that would certainly have broad impacts.

APIs are not inherently more or less secure than crawlers, but in the current climate, especially with regards to privacy, we can expect companies large and small to make less and less data open and available (something that the linked data community has been ruing as well).

Security right now is a big X factor that is going to take some time to play out.

The nice thing about crawlers (depending on your perspective) is that they are harder to control, at least for now. But it is a reasonable thing to say that data responsibility and privacy issues are going to shape and reshape this conversation big time.


Today’s web is full of data that if kept within an API-driven paradigm suffers from less creative use, less flexibility, and less control (from a developer standpoint).

An endlessly crawl-able web was in many ways what Tim Berners-Lee and WC3 intended for the web all along. Content creators like publishers and social networks can create sites as they’ve always done, while data aggregators can access data in whatever format they like.

In fact, in an older but still applicable interview with Berners-Lee, he talks about why a open, linked data web is by far preferable than APIs for data access.

There is a foundational, DNA-level need to share data. Without openness, you loose the full value and impede any future innovation in the process.

APIs absolutely have benefits – but only when we are not beholden to them – when we can use them rationally, strategically, and carefully. And when data isn’t at the crux of your site, service or application.

“We have an open API” is an overused phrase, especially as API’s are no by definition open or closed.

If you need certain attributes, like real-time/speed, certain capabilities, or certain pieces of infrastructure, there are thousands of amazing APIs out there. But if your business runs on data – crawling is the only way to go.

Most of the web isn’t real-time

I should have gotten around to this post about a week ago, but we’ve been running around doing real work since our launch.  Anyway, a while back, Marshall Kirkpatrick wrote a post entitled “Ten Useful Examples of the Real-Time Web in Action” on ReadWriteWeb.  In it, he outlines several benefits that real-time web technologies can provide.  At #1 is “Real-Time Push to Replace Web Crawling”, where he references PubSubHubbub co-creator Brad Fitzpatrick wondering about something that certainly interests us:

…real-time push technologies could someday replace the need for most of the web crawling his employer Google does to maintain its index. If all webpages were PubSubHubbub enabled, for example, they could simply tell a Hub about any changes they had published and Google could find out via that Hub. There would be no reason for Google to poll websites for changes over and over again.

Although this idea is certainly very compelling, I don’t think it’s very likely that real-time push can replace crawling.  Here’s why:

  1. Real-time push is only useful for (surprisingly enough) real-time content, which is a small % of web content, and always will be (just do some simple induction to figure out why).  So unless you’ve been receiving pushes since “time 0”, you won’t be getting all the content you might want.
  2. Real-time push allows the site to only provide snippets of content, which means you’ll have to crawl if you want more.  Put another way, sometimes the guy making the request wants control over the response of that request.  Imagine that ;)
  3. This idea depends on all sites using real-time push, which I personally feel is highly unlikely to happen.  Just ask the semantic web guys how many webmasters use RDF markup.

The above 3 points are general rebuttals to the idea that real-time push will be pervasive.  There’s still a specific reason why 80legs would still maintain an advantage over real-time push, and that’s because our distributed architecture would still provide performance and cost advantages when it comes to accessing and processing web content.  Simply put, we can throw more bandwidth and compute power for looking at and processing web content then what someone could do on their own, with a centralized data center.

Let me finish off by saying that I do think real-time push is a really cool technology.  For things like pulling status updates, news, etc., it can be really useful.  But I think the vast majority of the web will always need to be crawled, for many different purposes that real-time push can’t provide.

Comparing 80legs to Yahoo! BOSS

Yahoo! recently announced a new pricing scheme for their BOSS platform, so we thought it would be a good idea to provide a comparison between 80legs and BOSS.

Web-Scale Development vs. Re-packaged Yahoo! Search

The biggest difference between 80legs and BOSS is that 80legs is a platform for developing your own web-scale applications while BOSS is an API for retrieving search results from Yahoo!.  In other words, with 80legs you can easily build any kind of web-scale app that accesses the entire Internet.  With BOSS, you are ultimately  re-packaging search results from Yahoo!.

Query Types

BOSS lets you make 4 types of queries:

  • Spelling
  • Web
  • News
  • Image

Each of these query types is logically the same type: keyword matching on text content.  The difference between the four is the result type you get with each one.  80legs has no limitations on query types.  With our service, you can do any of the following:

  • Keyword matching on text content (includes all 4 BOSS query ‘types’)
  • Visual matching on images (e.g., Is Image A similar to Image B?)
  • Programmatic queries (e.g., On which pages does the word ‘Obama’ appear 4 times?)
  • And any other query type you conceive

Because 80legs is an application development platform, you can create your own code to create any query type you want.


Within some of the BOSS query types listed above, you can pass in a limited set of filter options to narrow down the result set your query returns.  For example, with web queries, you can choose from a set of 6 file types.  When filtering with 80legs, you pass in regular expressions instead of pre-defined options.  This gives the developer infinitely more freedom when it comes to filtering result sets.


Here’s the pricing table for BOSS:


Each unit costs $0.10.  This table is a bit opaque, but with a little math we can break it down as follows (MRR = million results returned):

  • $0.10 per MRR: off-peak use
  • $3.00 per MRR: 1,000 results/query, on-peak use
  • $10.00 per MRR: 100 results/query, on-peak use
  • $12.00 per MRR: 50 results/query, on-peak use
  • $30.00 per MRR: 10 results/query, on-peak use

The cost to use 80legs is more straightforward (MPC = million pages crawled):

  • $2.00 per MPC: for crawling/accessing content
  • $0.03 per CPU-hr: for computing/analysis performed on content

Now, this comparison is admittedly a bit of an apples to oranges comparison (hopefully we’ve impressed upon you that 80legs is a different animal and has way more features), but it gives you some sense of the difference in pricing.  Companies interested in serious web-scale development could potentially save a lot by going with BOSS during off-peak hours, but I wonder if they would be trying BOSS at all due to limitations we mentioned above.  (Also, it’s not clear what constitutes ‘off-peak’ at this point.)  Smaller users will be paying less on a per-unit basis.  Again, this is an apples-oranges scenario, so comparing the two pricing schemes is a bit odd, but we like to be thorough :).


80legs and BOSS are two very different things.  80legs is a platform for making any kind of web-scale application.  BOSS is a way to query Yahoo!.  80legs allows much more functionality and enables a  much wider variety of service and products looking to do interesting things with Internet data.