Interested to learn how Google, Bing, or Yahoo work? Wondering what it takes to crawl the web, and what a simple web crawler looks like? In under 50 lines of Python (version 3) code, here's a simple web crawler! (The full source with comments is at the bottom of this article).
Let's look at the code in more detail!
The following code should be fully functional for Python 3.x. It was written and tested with Python 3.2.2 in September 2011. Go ahead and copy+paste this into your Python IDE and run it or modify it.
from html.parser import HTMLParser
from urllib.request import urlopen
from urllib import parse
We are going to create a class called LinkParser that inherits some
methods from HTMLParser which is why it is passed into the definition