Last Updated: December 30, 2020
·
9.248K
· mariakatosvich

How to make a web crawler in Python (under 50 lines of code)?

Interested to learn how Google, Bing, or Yahoo work? Wondering what it takes to crawl the web, and what a simple web crawler looks like? In under 50 lines of Python (version 3) code, here's a simple web crawler! (The full source with comments is at the bottom of this article).

Let's look at the code in more detail!

The following code should be fully functional for Python 3.x. It was written and tested with Python 3.2.2 in September 2011. Go ahead and copy+paste this into your Python IDE and run it or modify it.

from html.parser import HTMLParser

from urllib.request import urlopen

from urllib import parse

We are going to create a class called LinkParser that inherits some

methods from HTMLParser which is why it is passed into the definition

class LinkParser(HTMLParser):

# This is a function that HTMLParser normally has
# but we are adding some functionality to it
def handle_starttag(self, tag, attrs):
    # We are looking for the begining of a link. Links normally look
    # like <a href="www.someurl.com"></a>
    if tag  'a':
        for (key, value) in attrs:
            if key  'href':
                # We are grabbing the new URL. We are also adding the
                # base URL to it. For example:
                # www.fixithere.net is the base and
                # somepage.html is the new URL (a relative URL)
                #
                # We combine a relative URL with the base URL to create
                # an absolute URL like:
                # www.fixithere.net/sky-customer-service/
                newUrl = parse.urljoin(self.baseUrl, value)
                # And add it to our colection of links:
                self.links = self.links + [newUrl]

# This is a new function that we are creating to get links
# that our spider() function will call
def getLinks(self, url):
    self.links = []
    # Remember the base URL which will be important when creating
    # absolute URLs
    self.baseUrl = url
    # Use the urlopen function from the standard Python 3 library
    response = urlopen(url)
    # Make sure that we are looking at HTML and not other things that
    # are floating around on the internet (such as
    # JavaScript files, CSS, or .PDFs for example)
    if response.getheader('Content-Type')=='text/html':
        htmlBytes = response.read()
        # Note that feed() handles Strings well, but not bytes
        # (A change from Python 2.x to Python 3.x)
        htmlString = htmlBytes.decode("utf-8")
        self.feed(htmlString)
        return htmlString, self.links
    else:
        return "",[]

# And finally here is our spider. It takes in an URL, a word to find,
# and the number of pages to search through before giving up

def spider(url, word, maxPages):  
pagesToVisit = [url]
numberVisited = 0
foundWord = False
# The main loop. Create a LinkParser and get all the links on the page.
# Also search the page for the word or string
# In our getLinks function we return the web page
# (this is useful for searching for the word)
# and we return a set of links from that web page
# (this is useful for where to go next)
while numberVisited < maxPages and pagesToVisit != [] and not foundWord:
    numberVisited = numberVisited +1
    # Start from the beginning of our collection of pages to visit:
    url = pagesToVisit[0]
    pagesToVisit = pagesToVisit[1:]
    try:
        print(numberVisited, "Visiting:", url)
        parser = LinkParser()
        data, links = parser.getLinks(url)
        if data.find(word)>-1:
            foundWord = True
            # Add the pages that we visited to the end of our collection
            # of pages to visit:
            pagesToVisit = pagesToVisit + links
            print(" **Success!**")
    except:
        print(" **Failed!**")
if foundWord:
    print("The word", word, "was found at", url)
else:
    print("Word never found")

Magic!