I've written a lot of web crawlers, and these are some of the problems I've encountered in the process. Unfortunately, most of these problems have no solution, other than quitting in disgust. Instead of looking for a way out when you're already deep in it, take this article as a warning of what to expect before you start down the perilous path of making a web spider. Now to be clear, I'm referring to deep web crawlers coded for a specific site, not a generic link harvester or image downloader. Generic crawlers probably have their own set of problems on top of everything mentioned below.
Most web pages suck
I have yet to encounter a website whose HTML didn't generate at least 3 WTFs.
Seriously, I now have a deep respect for web browsers that are able to render the mess of HTML that exists out there. If only browsers were better at memory management, I might actually like them again (see unified theory of browser suckage).
Inconsistent information & navigation
2 pages on the same site, with the same url pattern, and the same kind of information, may still not be the same. One will have additional information that you didn't expect, breaking all your assumptions about what HTML is where. Another may not have essential information you expected all similar pages to have, again throwing your HTML parsing assumptions out the window.
Making things worse, the same kind of page on different sections of a site may be completely different. This generally occurs on larger sites where different sections have different teams managing them. The worst is when the differences are non-obvious, so that you think they're the same, until you realize hours later that the underlying HTML structure is completely different, and CSS had been fooling you, making it look the same when it's not.
90% of writing a crawler is debugging
Your crawler will break. It's not a matter of if but when. Sometimes it breaks immediately after you write it, because it can't deal with the malformed HTML garbage that you're trying to parse. Sometimes it happens randomly, weeks or months later, when the site makes a minor change to their HTML that sends your crawler into a chaotic death spiral. Whatever the case, it'll happen, it may not be your fault, but you'll have to deal with it (I recommend cursing profusely). If you can't deal with random sessions of angry debugging because some stranger you don't know but want to punch in the face made a change that broke your crawler, quit now.
You can't make their connection any faster
If your crawler is slow, it's probably not your fault. Chances are, your incoming connection is much faster and less congested than their outgoing connection. The internet is a series of tubes, and while their tube may be a lot larger than yours, it's filled with other people's crap, slowing down your completely legitimate web crawl to, well, a crawl. The only solution is to request more pages in parallel, which leads directly to the next problem...
Sites will ban you
Many web sites don't want to be crawled, at least not by you. They can't really stop you, but they'll try. In the end, they only win if you let them, but it can be quite annoying to deal with a site that doesn't want you to crawl them. There's a number of techniques for dealing with this:
- route your crawls thru TOR, if you don't mind crawling slower than a pentium 1 on with a 2.4 bps modem (if you're too young to know what that means, think facebook or myspace, but 100 times slower, and you're crying because you're peeling a dozen onions while you wait for the page to load)
- use anonymous proxies, which of course are always completely reliable, and never randomly go offline for no particular reason
- slow down your crawl to the point a one armed blind man could do it faster
- ignore robots.txt - what right do these sites have to tell you what you can & can't crawl?!? This is America, isn't it!?!
Don't write a web crawler. Seriously, it's the most annoying programming work in the world. Of course, chances are that if you're actually considering writing a crawler, it's because there's no other option. In that case, good luck, and stock up on liquor.