Python Web Crawling for Emails

#python

#web

#beginner

#crawling

#emails

Introduction and Legal

In this post I'll show you how to create a Python web crawler. This post goes along with @scosta/921e3afc9018" rel="nofollow">this blog post if you're interested in finding out why I'm posting this. You should also read and understand the laws surrounding web crawling and not use this script to do anything that might be construed as illegal.

Okay, now that that's out of the way, let's get started!

Step 1: The Framework

I always start my scripts with a docstring, the class definition(s) and then stub out the functions.

'''
A web crawler for extracting email addresses from web pages.

Takes a string of URLs and requests each page, checks to see if we've
found any emails and prints each email it finds.
'''


class Crawler(object):

    def __init__(self, urls):
        '''
        @urls: a string containing the (comma separated) URLs to crawl.
        '''
        self.urls = urls.split(',')

    def crawl(self):
        '''
        Iterate the list of URLs and request each page, then parse it and
        print the emails we find.
        '''
        pass

    @staticmethod
    def request(url):
        '''
        Request @url and return the page contents.
        '''
        pass

    @staticmethod
    def process(data):
        '''
        Process @data and yield the emails we find in it.
        '''
        pass

Some important things to note:
* The request and process functions are decorated using Python's @staticmethod decorator as they don't need access to anything that self provides.
* We're splitting the URLs we pass in by ',' so that we can pass in multiple from the command line.
* Module and function docstrings are PEP8 compliant.

Step 2: `crawl`

CoderWall forces scrolling in code blocks after short number of lines so I'm going to do this function by function. I may have to hop on their Assembly project and request and maybe build that feature!

With this function, we're just going to pass each URL off to request and then process the data with process.

def crawl(self):
    '''
    Iterate the list of URLs and request each page, then parse it and print
    the emails we find.
    '''
    for url in self.urls:
        data = self.request(url)
        for email in self.process(data):
            print email

It's not very good form to print out the emails within this function - it'd be better to return (or yield) them to main and then let it decide what to do with them, but we'll leave it for now. Maybe I'll do a more "functional" V2 of this post later!

Step 3: `request`

This function will request the page and then return the page's body. It uses the "urllib2" library so be sure to add that to the start of your script (take a look at the final version of the script at the bottom of this post to see what I mean).

@staticmethod
def request(url):
    '''
    Request @url and return the page contents.
    '''
    response = urllib2.urlopen(url)
    return response.read()

It's a pretty simple function, just requests the page, reads the response and returns that to the crawl function.

Step 4: `process`

This function is responsible for searching the page data for email addresses. It does this using a regular expression (or regex). This is a super simple implementation that will miss a lot of emails (due to the regex) but I wanted to keep it straightforward so that readers who have never uses regexes can understand as well.

@staticmethod
def process(data):
    '''
    Process @data and yield the emails we find in it.
    '''
    for email in re.findall(r'(\w+@\w+\.com)', data):
        yield email

This function requires the "re" library so be sure to add that with "urllib2".

Step 5: `main`

I really like using the Python "argparse" for my scripts, regardless of how simple they are. We'll use it here to allow the user to enter an email or emails from the command line and then we'll create a new Crawler and see if we can find some emails.

def main():
    argparser = argparse.ArgumentParser()
    argparser.add_argument(
        '--urls', dest='urls', required=True,
        help='A comma separated string of emails.')
    parsed_args = argparser.parse_args()
    crawler = Crawler(parsed_args.urls)
    crawler.crawl()

if __name__ == '__main__':
    sys.exit(main())

This new code requires the "argparse" and "sys" libraries so be sure to add them with the other includes at the top of your script!

Wrapping Up

Okay, here's your script!

'''
A web crawler for extracting email addresses from web pages.

Takes a string of URLs and requests each page, checks to see if we've
found any emails and prints each email it finds.
'''

import argparse
import re
import sys
import urllib2


class Crawler(object):

    def __init__(self, urls):
        '''
        @urls: a string containing the (comma separated) URLs to crawl.
        '''
        self.urls = urls.split(',')

    def crawl(self):
        '''
        Iterate the list of URLs and request each page, then parse it and
        print the emails we find.
        '''
        for url in self.urls:
            data = self.request(url)
            for email in self.process(data):
                print email

    @staticmethod
    def request(url):
        '''
        Request @url and return the page contents.
        '''
        response = urllib2.urlopen(url)
        return response.read()

    @staticmethod
    def process(data):
        '''
        Process @data and yield the emails we find in it.
        '''
        for email in re.findall(r'(\w+@\w+\.com)', data):
            yield email


def main():
    argparser = argparse.ArgumentParser()
    argparser.add_argument(
        '--urls', dest='urls', required=True,
        help='A comma separated string of emails.')
    parsed_args = argparser.parse_args()
    crawler = Crawler(parsed_args.urls)
    crawler.crawl()


if __name__ == '__main__':
  sys.exit(main())

There you have it! Test it from the command line using python crawler.py --urls https://www.example.com/.

Note that the "http" or "https" are required because request will fail without it. I'll leave it up to you to fix that bug!

Let me know what you think and what you'd like to see next in the comments!