This blog has moved. Please go to http://blog.80legs.com and update your bookmarks.
Tags: big data, sxsw, SXSWi, web crawling
80legs is teaming up with InfoChimps, host of last year’s inaugural Data Cluster at SXSWi. Our panel, Data Nerds, Is Big Data Crushing the Web? , questions the future of big data and its impact on the future of tech. Here’s an excerpt from our proposal:
Web data is growing at a record pace – and data junkies will soon rule the tech world. 50 million tweets per day. 1.2 million photos served per second. 50 million websites added annually. The question is, how are we expected to build the next generation of technological innovations on top of this ever-growing Everest of data? To be honest, it can be daunting. In this panel, we’ll discuss how big data on the web changes the game for everyone. Is Hadoop good enough to manage this data explosion? Is massive web crawling dead? Is it even feasible to make such vast amounts of data open to everyone, and how do people even tap into it? Should the average Joe even care?
We’re excited to see several other panel proposals that also address the issue of making sense out of ever-more-massive amounts of data. While you’re voting us up, give these folks some thumbs as well!
The collateral that is presently available is largely from the social media giants that tout solutions built using 10,000 node clusters that process petabytes of data a day. The reality? The average person just cannot relate or intuitively draw parallels to their own business problems. While Big Data solutions are worthwhile far before you reach petabyte scale data, just getting started can be a challenge in itself.
With probabilistic computing, you can interpret and act on all kinds of data using statistical inference – starting with some background assumptions, you can propose possible configurations of the world that explain how that data came about. You can use probabilistic computing to trace effects back to their probable causes. For instance, what do web surfing and purchases tell us about the consumers? How can site usage patterns inform user interface design? And what are the best ways to targets ads and offers at specific users?
Visualizing social data teaches us about people’s behavior, cultural norms, relationships and much more. The panelists are interactive visualization gurus from groups who are all trying to make sense of data – Stamen Design, IBM Research, Microsoft, New York Times and Google.
This session presents notes from the road gathered over the last 4+ years while building Scout Labs (by Lithium Technologies). It includes discovery and acquisition of data, and the amount available. We also cover the general messiness and lack of structure of the data, and challenges in building systems to analyze it.
Tags: ip protection, monotype imaging
As millions of web pages are created every day, IP protection is an ever-growing concern for content creators. While most folks associate IP protection with things like music and movies, these are not the only types of content that need to be protected. Monotype Imaging uses IP protection services to track the usage of font types across the web.
In order to assist its IP protection services, Monotype uses 80legs to run incredibly large scans of the web. These scans crawl across tens of thousands of popular domains and identify the location of fonts on the web pages of these domains. 80legs uses a proprietary algorithm, provided by Monotype and converted to an 80app, to check these files and extract metadata from them. Using this information, Monotype can essentially run a gigantic data collection survey of how and where particular fonts are used on the web.
The web crawl run by 80legs processes 80 million URLs in about 2 days and updates its findings on a monthly basis, though it could update more frequently if necessary. This kind of powerful web crawling enables Monotype to stay up to date and gives them unsurpassed competitive and customer intelligence.
For more information on Monotype Imaging, be sure to check out their website. If you’re interested in similar services from 80legs or would like to be featured in a future newsletter, please contact us.
Tags: advertising, reddit
We’ve started experimenting with various advertising strategies here at 80legs. So far we’ve mostly focused on Google AdWords, but we’re now looking at other channels as well.
- Duration: 13 days
- Cost: $650 ($50 per day)
- Impressions: 1,288,378 (282,732 uniques)
- Clicks: 20,700 (18,420 uniques)
- CTR: 1.61% (6.49% unique)
- Duration: 2 days
- Cost: $700 ($350 per day)
- Impressions: 299,784 (63,000 uniques)
- Clicks: 4,226 (4,197 uniques)
- CTR: 1.41% (6.68% unique)
- Duration: 6 days
- Cost:$120 ($20 per day)
- Impressions: 402,537 (86,174 uniques)
- Clicks: 1,327 (1,276 uniques)
- CTR: 0.33% (1.48% unique)
And here’s some additional data from Google Analytics:
- Pages/Visit: 2.41
- Avg. Time on Site: 01:11
- Bounce Rate: 50.25%
While our numbers seem quite pitiful compared to DDG and Whiteyboard, I’m not exactly disappointed. Both DDG and Whiteyboard are consumer products/services. We’re a B2B service, which means our target market is much smaller in terms of # of individuals. There are going to be far fewer people interested in web crawling than using a search engine or a whiteboard.
The most important factor is ROI. Our ad cost $120. I have at least 1-2 people that contacted us expressing interest in purchasing plans or custom services. Our plus plan is just $99/month, so I’m fairly confident the ad was a “win” for us.
We also had 116 signups during the course of the Reddit ad. A typical 6-day period will have 80 – 100 signups. So that’s also good, but not astoundingly awesome.
Overall, I would say the ad worked well for us. We’re going to try some variations on the ad (targeting specific crawl packages, etc.) and see if that works better.
B2C ads are going to perform better than B2B ads on pretty much any ad channel. With a positive ROI, the ad was worth the expense and at the very least warrants additional experimentation.
Tags: api, programmable web, web crawling
Today, applications increasingly depend on a rich ecosystem of APIs. Thousands of different services are variously tethered together to form new software offerings and enhance existing ones. The idea of a programmable web is finally coming true.
While this is not trivial, I am nonetheless beginning to question the long-term effects of an API-centric worldview, a sort of blind faith in the almighty API, which has at best a difficult relationship with open data and big data concepts.
How do we access data today?
There are two core ways to access data today – via a publisher or via a crawler. Each has a different role.
Publishers have data and choose to make it publicly available through an API so that developers can easily design products powered by a given service.
Crawlers on the other hand are used to proactively go out and grab data by yourself — scraping web pages for whatever it is you’re looking for, data that can then be used to build products, and inform better product and marketing decisions.
There something of a third option, data aggregators, like Factual and Infochimps and Hoovers. I’m not going to treat them much as part of this post because they gain access to data like the rest of us – via APIs and/or crawlers. They facilitate the distribution of that data as part of their core business (most often using a marketplace concept or subscription), but the input mechanisms are no different.
And there is potentially even a fourth option – human curation of the kind that Factual and WolframAlpha and CrowdFlower employ to acquire new data altogether. But all of these providers offer API access to their data, so I’m still going to bucket them as such.
APIs, at least as we think of them today, have many disadvantages. And before you grab your shovels and organize a mob to come after me, please understand that I’m not calling for the discontinuation of APIs.
At 80legs, we ourselves offer a popular API, which takes a particularly hybrid approach – providing programmatic access to the data acquired via crawls.
What I really want is a natural stratification based on who is good at what, essentially. Right now, we’re asking APIs to do too much.
APIs are great for the real-time web, for example – they’re great for staying up to speed, whether that means trending search data or retweet velocity. APIs are great for enhancing functionality – whether that’ a Klout score or geolocation. APIs are great for integrating certain pieces of non-strategic infrastructure like invite codes (Prefinery) if you’re a startup in beta, or Freshbooks, if you’re an accountant. They’re also great of app-level integrations, like adding Facebook accounts to Tweetdeck, or sucking down content from Netflix.
But at a higher level, as all applications and services become more and more data-driven, it’s important to understand the differences between these different methods for extracting data, regardless of where you net out philosophically.
This is a debate that needs to take place.
Control, Control, Control
Control and flexibility are the two most important elements to look at when it comes to the difference between an API and a crawl. I also spend some time at the end of this post talking about security and privacy, because I think there are big impacts there for APIs and crawlers alike.
Cost might be a fourth facet to look at, but that’s grounds for a different post because pricing varies so widely.
Let’s start with control.
When using an API, publishers – companies like Amazon and LinkedIn, for example – control the entire process. Publishers provide you with an API account, which allows you a certain amount of calls, or requests for data per day. They also determine what kinds of content are made available, and in what context.
Publishers offer an API for many reasons. It’s financially in their best interest to have products built on top of their data to increase developer loyalty and form a kind of API-dependency to their content. It’s also useful as a way to accurately measure server usage and overall engagement, even if there’s no money involved.
APIs can go down and become unavailable, they can go from free to paid, and their publishers can be acquired by larger companies that make all manner of changes. There’s a lot of uncertainty in APIs, and many devs have learned this fact the hard way. Think back to Gnip rethinking their entire business model due to the relicensing of certain APIs.
But like moths, we so often head right back to the flame.
Crawlers act very differently. They allow much more control over the data acquisition process. This has many advantages.
For starters, the format in which content is delivered can be a lifesaver if formatted properly, or prompt hours of additional work if not.
APIs supply content in one format – the format chosen by the publisher.
Say you need a XML file type but the company only delivers JSON through their API. You’re either stuck or left spending hours re-formatting.
Crawlers let the choice-driven developer have his cake and eat it too. Formats are just another choice to make beforehand, instead of a hindrance.
Granted, standardization can be great in some cases – for example with sites like MySpace where each profile is customized and therefore rendered in HTML differently. MySpace APIs format the content to make it uniform, meaning that what was once difficult to work with as a developer (i.e. large discrepancies in the data), is now standardized and simple to use.
But the “one size fits all” mentality fails more often than you might think, especially once you step outside of the web’s largest sites – one size fits all rarely fits anyone well.
And it’s not just format – crawling offers much more control when it comes to time and timing, scope, and cost, too.
Flexibility and Availability
Data access choices are an important component of building any web product, especially when it comes to flexibility and availability. Specs change, needs change — heck, markets change. Especially if you’re a lean startup, out early + iterate often is a way of life.
APIs only deliver content from the publisher’s site. You’re locked into a single interface’s content sources and structure, without flexibility by definition, which can be very limiting. You’re left with acquiring stand-alone datasets to supplement your evolving needs, or mashing up with another API to fill in holes.
Now, the very best API providers are great at adapting to developers’ needs and evolving alongside them. Companies like Yolink, for whom their API is their bread and butter, are particularly responsive. But too often an API is left unattended, having been a mere box to check, instead of a strategic commitment.
Unimaginative APIs can also limit use cases unwittingly, because some of the furthest-flung (if more promising) applications just aren’t supported in the calls or code. There’s a huge difference between an API that wants to be heightened and explored, and an API whose scope, if anything, constrains original thinking.
Crawlers on the other hand aren’t specific to any one site’s data, meaning that they can access content from any number of sources and compile it in one place, mixing and matching, comparing and contrasting to your heart’s delight.
Crawls can be more open-ended and investigative as well, whereas an API is more about putting a square peg in a square hole. API’s also don’t offer competitive advantage – everyone has access to the same stuff. A clever crawl can help build a moat.[SDD2]
Finally, crawlers can reach far beyond the capabilities of an API. Millions of pieces of data are publicly available on the web, and only a very small percentage of it is available via an API. At a certain point it’s purely an issue of volume. Much of the web is instantly crawl-able, and the amount of data available freely on the web is growing more quickly than the number of APIs by an order of magnitude. The caveat – you just have to know where to look.
The Elephant in the Room — Security and Privacy
Let’s talk about privacy and data, because how the world evolves in this respect could have huge implications for APIs and crawlers alike.
As the recent Facebook data privacy concerns highlight, the security of people’s data is a high priority, regardless of how it may or may not be acquired or sold. Further, users expect publishers to protect their data aggressively (whether they do is another matter).
And this is a PR/perception issue as much as it is anything else.
Users worry that their data might get into the hands of people who will use it for malicious purposes, whether via an API or a crawler. I would argue that this is not always the case, because responsible crawling companies at least, have strict licensing agreements with their clients to ensure data is used lawfully.
But, the reality is that publishers are increasingly incentivized because of public policy issues to constrain API access. And the world’s biggest crawler, Google, is starting to look evil, with the ominous question “what exactly does Google know about me?” popping up at family dinners around the country.
Some are even arguing that Facebook is bound to be federally regulated sooner or later because of its profligacy when it comes to data, and that would certainly have broad impacts.
APIs are not inherently more or less secure than crawlers, but in the current climate, especially with regards to privacy, we can expect companies large and small to make less and less data open and available (something that the linked data community has been ruing as well).
Security right now is a big X factor that is going to take some time to play out.
The nice thing about crawlers (depending on your perspective) is that they are harder to control, at least for now. But it is a reasonable thing to say that data responsibility and privacy issues are going to shape and reshape this conversation big time.
Today’s web is full of data that if kept within an API-driven paradigm suffers from less creative use, less flexibility, and less control (from a developer standpoint).
An endlessly crawl-able web was in many ways what Tim Berners-Lee and WC3 intended for the web all along. Content creators like publishers and social networks can create sites as they’ve always done, while data aggregators can access data in whatever format they like.
In fact, in an older but still applicable interview with Berners-Lee, he talks about why a open, linked data web is by far preferable than APIs for data access.
There is a foundational, DNA-level need to share data. Without openness, you loose the full value and impede any future innovation in the process.
APIs absolutely have benefits – but only when we are not beholden to them – when we can use them rationally, strategically, and carefully. And when data isn’t at the crux of your site, service or application.
“We have an open API” is an overused phrase, especially as API’s are no by definition open or closed.
If you need certain attributes, like real-time/speed, certain capabilities, or certain pieces of infrastructure, there are thousands of amazing APIs out there. But if your business runs on data – crawling is the only way to go.
Tags: crawl packages, open data
We’re excited to announce a new service at 80legs: Crawl Packages.
What crawl packages are:
Crawl packages are pre-configured crawls that you can access and run in just a few clicks.
For a specific website or group of websites, we’ve designed and setup an 80legs crawl, along with custom data extractors, to crawl that site and extract all the interesting information from it. These are crawls you could have setup yourself, but we’ve gone ahead and done all the work for you.
Types of crawl packages available:
We’re currently offering crawl packages for social networks, retail/shopping sites and business directories. We’ll be expanding our offerings to include other websites as well. Initial plans include crawling blogs (and their comments), semantic annotation feeds of various websites, and so on.
Results & Pricing:
Most crawl packages will cost $350 per month and produce 10 – 20 million records per month. The type of records produced depend on the crawl package. Social network packages produce publicly-available profiles, Retail packages produce product listings, etc.
We realize that the availability of crawl packages will raise some concerns over what data should be crawled and shouldn’t. We only crawl publicly-available Web data. We don’t crawl private data and have no interest in that.
What we are interested in is what our users can do with Web data that is more accessible. Since our launch, we’ve seen many startups come to us asking for large amounts of Web data so that they can create additional value on top of that data. They want to do interesting things like provide new insight into how people connect with one another, create CPIs of online product invetory, and more. We want to make that possible, and crawl packages are a step in that direction.
Lately I’ve been interested in doing some odd and quirky things around the office. As I was thinking about what I could do about this, it struck me that one of the folks on our team, Jenn, is really awesome at baking. So I asked her if she’d be interested in a Cake of the Month. Each month, she’d make some funky cake that we’d all enjoy. It’s just a little thing, but something to make the work day a little more fun. Anyway, here’s what she came up with for the very first Cake of the Month!
I’d say it’s a great start to a recurring tradition.