StreamHacker Weotta be Hacking

4Oct/104

The Perils of Web Crawling

I've written a lot of web crawlers, and these are some of the problems I've encountered in the process. Unfortunately, most of these problems have no solution, other than quitting in disgust. Instead of looking for a way out when you're already deep in it, take this article as a warning of what to expect before you start down the perilous path of making a web spider. Now to be clear, I'm referring to deep web crawlers coded for a specific site, not a generic link harvester or image downloader. Generic crawlers probably have their own set of problems on top of everything mentioned below.

Most web pages suck

I have yet to encounter a website whose HTML didn't generate at least 3 WTFs.

The only valid measure of code quality: WTFs/minute

Seriously, I now have a deep respect for web browsers that are able to render the mess of HTML that exists out there. If only browsers were better at memory management, I might actually like them again (see unified theory of browser suckage).

Inconsistent information & navigation

2 pages on the same site, with the same url pattern, and the same kind of information, may still not be the same. One will have additional information that you didn't expect, breaking all your assumptions about what HTML is where. Another may not have essential information you expected all similar pages to have, again throwing your HTML parsing assumptions out the window.

Making things worse, the same kind of page on different sections of a site may be completely different. This generally occurs on larger sites where different sections have different teams managing them. The worst is when the differences are non-obvious, so that you think they're the same, until you realize hours later that the underlying HTML structure is completely different, and CSS had been fooling you, making it look the same when it's not.

90% of writing a crawler is debugging

Your crawler will break. It's not a matter of if but when. Sometimes it breaks immediately after you write it, because it can't deal with the malformed HTML garbage that you're trying to parse. Sometimes it happens randomly, weeks or months later, when the site makes a minor change to their HTML that sends your crawler into a chaotic death spiral. Whatever the case, it'll happen, it may not be your fault, but you'll have to deal with it (I recommend cursing profusely). If you can't deal with random sessions of angry debugging because some stranger you don't know but want to punch in the face made a change that broke your crawler, quit now.

You can't make their connection any faster

If your crawler is slow, it's probably not your fault. Chances are, your incoming connection is much faster and less congested than their outgoing connection. The internet is a series of tubes, and while their tube may be a lot larger than yours, it's filled with other people's crap, slowing down your completely legitimate web crawl to, well, a crawl. The only solution is to request more pages in parallel, which leads directly to the next problem...

Sites will ban you

Many web sites don't want to be crawled, at least not by you. They can't really stop you, but they'll try. In the end, they only win if you let them, but it can be quite annoying to deal with a site that doesn't want you to crawl them. There's a number of techniques for dealing with this:

  • route your crawls thru TOR, if you don't mind crawling slower than a pentium 1 on with a 2.4 bps modem (if you're too young to know what that means, think facebook or myspace, but 100 times slower, and you're crying because you're peeling a dozen onions while you wait for the page to load)
  • use anonymous proxies, which of course are always completely reliable, and never randomly go offline for no particular reason
  • slow down your crawl to the point a one armed blind man could do it faster
  • ignore robots.txt - what right do these sites have to tell you what you can & can't crawl?!? This is America, isn't it!?!

Conclusion

Don't write a web crawler. Seriously, it's the most annoying programming work in the world. Of course, chances are that if you're actually considering writing a crawler, it's because there's no other option. In that case, good luck, and stock up on liquor.

  • Pingback: Tweets that mention The Perils of Web Crawling «streamhacker.com -- Topsy.com

  • http://twitter.com/bgimpert Ben Gimpert

    I’ve had success using a text-based browser like Lynx as an intermediate crawling step. Take the awful, idiosyncratic HTML and dump it with Lynx, and then use the “formatted” plain text as input into your data-mining system proper. Alas, this does not help with sites not wanting to be spidered.

  • http://streamhacker.com/ Jacob Perkins

    Thanks for the tip Ben, seems like a good idea if you’re after the plaintext of a webpage and want to get rid of all the junk.

  • http://twitter.com/saidimu saidimu apale

    It is quite a surprise that all these years of writing web-crawlers and painstakingly curating XPath (or other similar) expressions hasn’t led to a community resource on parsing expressions (or even better, working and updated crawlers).

    Everyone, it seems, is seriously intent on re-re-re-re-inventing the wheel. Perhaps that is why it is such a fool’s errand.

    Marco Ambent of Instapaper.com [http://blog.instapaper.com/post/730281947] blogged about starting such a (mis)adventure, but I haven’t heard much about it since.

  • http://streamhacker.com/ Jacob Perkins

    That kind of resource would be nice, but I suspect a lot of people don’t want sites to know who’s crawling them & how. If the xpaths were posted online, many sites would deliberately change their html to break them. And crawling is a bit of a legal grey area, which discourages publishing xpaths and/or crawlers. If everyone adopted microformats, things could be easier, but that’ll take a while.

%d bloggers like this: