Tag Archives: html

The Perils of Web Crawling

I’ve written a lot of web crawlers, and these are some of the problems I’ve encountered in the process. Unfortunately, most of these problems have no solution, other than quitting in disgust. Instead of looking for a way out when you’re already deep in it, take this article as a warning of what to expect before you start down the perilous path of making a web spider. Now to be clear, I’m referring to deep web crawlers coded for a specific site, not a generic link harvester or image downloader. Generic crawlers probably have their own set of problems on top of everything mentioned below.

Most web pages suck

I have yet to encounter a website whose HTML didn’t generate at least 3 WTFs.

The only valid measure of code quality: WTFs/minute

Seriously, I now have a deep respect for web browsers that are able to render the mess of HTML that exists out there. If only browsers were better at memory management, I might actually like them again (see unified theory of browser suckage).

Inconsistent information & navigation

2 pages on the same site, with the same url pattern, and the same kind of information, may still not be the same. One will have additional information that you didn’t expect, breaking all your assumptions about what HTML is where. Another may not have essential information you expected all similar pages to have, again throwing your HTML parsing assumptions out the window.

Making things worse, the same kind of page on different sections of a site may be completely different. This generally occurs on larger sites where different sections have different teams managing them. The worst is when the differences are non-obvious, so that you think they’re the same, until you realize hours later that the underlying HTML structure is completely different, and CSS had been fooling you, making it look the same when it’s not.

90% of writing a crawler is debugging

Your crawler will break. It’s not a matter of if but when. Sometimes it breaks immediately after you write it, because it can’t deal with the malformed HTML garbage that you’re trying to parse. Sometimes it happens randomly, weeks or months later, when the site makes a minor change to their HTML that sends your crawler into a chaotic death spiral. Whatever the case, it’ll happen, it may not be your fault, but you’ll have to deal with it (I recommend cursing profusely). If you can’t deal with random sessions of angry debugging because some stranger you don’t know but want to punch in the face made a change that broke your crawler, quit now.

You can’t make their connection any faster

If your crawler is slow, it’s probably not your fault. Chances are, your incoming connection is much faster and less congested than their outgoing connection. The internet is a series of tubes, and while their tube may be a lot larger than yours, it’s filled with other people’s crap, slowing down your completely legitimate web crawl to, well, a crawl. The only solution is to request more pages in parallel, which leads directly to the next problem…

Sites will ban you

Many web sites don’t want to be crawled, at least not by you. They can’t really stop you, but they’ll try. In the end, they only win if you let them, but it can be quite annoying to deal with a site that doesn’t want you to crawl them. There’s a number of techniques for dealing with this:

  • route your crawls thru TOR, if you don’t mind crawling slower than a pentium 1 on with a 2.4 bps modem (if you’re too young to know what that means, think facebook or myspace, but 100 times slower, and you’re crying because you’re peeling a dozen onions while you wait for the page to load)
  • use anonymous proxies, which of course are always completely reliable, and never randomly go offline for no particular reason
  • slow down your crawl to the point a one armed blind man could do it faster
  • ignore robots.txt – what right do these sites have to tell you what you can & can’t crawl?!? This is America, isn’t it!?!

Conclusion

Don’t write a web crawler. Seriously, it’s the most annoying programming work in the world. Of course, chances are that if you’re actually considering writing a crawler, it’s because there’s no other option. In that case, good luck, and stock up on liquor.

Design Patterns in Javascript – Model-View-Controller

Model-View-Controller, or MVC for short, is an software architecture pattern for user interfaces that creates a distinction between the display (or view), the data (or model), and the interaction (or controller). In this article, I’m going to focus specifically on how MVC applies to the client side of web applications. What I mean is that MVC in a web based ui translates to

This separation of concerns means that as long as the model stays consistent, you can

  1. Create multiple views for the same model
  2. Develop the controller and views independently

For a great example of 1 checkout css Zen Garden. Here, the same HTML (the model) can be seen with many different views.

Now checkout the jquery ui demos for a clear example of both 1 and 2. Again, the same HTML has many different views with the various CSS themes, and you can also see jquery ui controller code in action. The Javascript manipulates the model, affecting what you see and how you see it, independently of which theme (or view) you choose.

What this all means is that you can create semantically correct HTML optimized for search engines, CSS so your page looks good for humans, and Javascript to make your page interactive. In other words, content can be separated from presentation, which can be separated from the interaction behavior. The CSS and Javascript can be developed (mostly) independently and only be loosely coupled to the HTML.

Taking this a step further, it’s possible to create generic views and controllers using CSS and Javascript that work on a small generic model. jquery ThickBox is an example of this, since it can be used on any page where you have <a class='thickbox' ...>...</a>. As long as your HTML supports the generic model, you can reuse the CSS and Javascript across multiple pages.