Python Web Crawling for Emails
Introduction and Legal
In this post I'll show you how to create a Python web crawler. This post goes along with @scosta/921e3afc9018" rel="nofollow">this blog post if you're interested in finding out why I'm posting this. You should also read and understand the laws surrounding web crawling and not use this script to do anything that might be construed as illegal.
Okay, now that that's out of the way, let's get started!
Step 1: The Framework
I always start my scripts with a docstring, the class definition(s) and then stub out the functions.
'''
A web crawler for extracting email addresses from web pages.
Takes a string of URLs and requests each page, checks to see if we've
found any emails and prints each email it finds.
'''
class Crawler(object):
def __init__(self, urls):
'''
@urls: a string containing the (comma separated) URLs to crawl.
'''
self.urls = urls.split(',')
def crawl(self):
'''
Iterate the list of URLs and request each page, then parse it and
print the emails we find.
'''
pass
@staticmethod
def request(url):
'''
Request @url and return the page contents.
'''
pass
@staticmethod
def process(data):
'''
Process @data and yield the emails we find in it.
'''
pass
Some important things to note:
* The request
and process
functions are decorated using Python's @staticmethod
decorator as they don't need access to anything that self
provides.
* We're splitting the URLs we pass in by ','
so that we can pass in multiple from the command line.
* Module and function docstrings are PEP8 compliant.
Step 2: crawl
CoderWall forces scrolling in code blocks after short number of lines so I'm going to do this function by function. I may have to hop on their Assembly project and request and maybe build that feature!
With this function, we're just going to pass each URL off to request
and then process the data with process
.
def crawl(self):
'''
Iterate the list of URLs and request each page, then parse it and print
the emails we find.
'''
for url in self.urls:
data = self.request(url)
for email in self.process(data):
print email
It's not very good form to print out the emails within this function - it'd be better to return (or yield) them to main
and then let it decide what to do with them, but we'll leave it for now. Maybe I'll do a more "functional" V2 of this post later!
Step 3: request
This function will request the page and then return the page's body. It uses the "urllib2" library so be sure to add that to the start of your script (take a look at the final version of the script at the bottom of this post to see what I mean).
@staticmethod
def request(url):
'''
Request @url and return the page contents.
'''
response = urllib2.urlopen(url)
return response.read()
It's a pretty simple function, just requests the page, reads the response and returns that to the crawl
function.
Step 4: process
This function is responsible for searching the page data for email addresses. It does this using a regular expression (or regex). This is a super simple implementation that will miss a lot of emails (due to the regex) but I wanted to keep it straightforward so that readers who have never uses regexes can understand as well.
@staticmethod
def process(data):
'''
Process @data and yield the emails we find in it.
'''
for email in re.findall(r'(\w+@\w+\.com)', data):
yield email
This function requires the "re" library so be sure to add that with "urllib2".
Step 5: main
I really like using the Python "argparse" for my scripts, regardless of how simple they are. We'll use it here to allow the user to enter an email or emails from the command line and then we'll create a new Crawler
and see if we can find some emails.
def main():
argparser = argparse.ArgumentParser()
argparser.add_argument(
'--urls', dest='urls', required=True,
help='A comma separated string of emails.')
parsed_args = argparser.parse_args()
crawler = Crawler(parsed_args.urls)
crawler.crawl()
if __name__ == '__main__':
sys.exit(main())
This new code requires the "argparse" and "sys" libraries so be sure to add them with the other includes at the top of your script!
Wrapping Up
Okay, here's your script!
'''
A web crawler for extracting email addresses from web pages.
Takes a string of URLs and requests each page, checks to see if we've
found any emails and prints each email it finds.
'''
import argparse
import re
import sys
import urllib2
class Crawler(object):
def __init__(self, urls):
'''
@urls: a string containing the (comma separated) URLs to crawl.
'''
self.urls = urls.split(',')
def crawl(self):
'''
Iterate the list of URLs and request each page, then parse it and
print the emails we find.
'''
for url in self.urls:
data = self.request(url)
for email in self.process(data):
print email
@staticmethod
def request(url):
'''
Request @url and return the page contents.
'''
response = urllib2.urlopen(url)
return response.read()
@staticmethod
def process(data):
'''
Process @data and yield the emails we find in it.
'''
for email in re.findall(r'(\w+@\w+\.com)', data):
yield email
def main():
argparser = argparse.ArgumentParser()
argparser.add_argument(
'--urls', dest='urls', required=True,
help='A comma separated string of emails.')
parsed_args = argparser.parse_args()
crawler = Crawler(parsed_args.urls)
crawler.crawl()
if __name__ == '__main__':
sys.exit(main())
There you have it! Test it from the command line using python crawler.py --urls https://www.example.com/
.
Note that the "http" or "https" are required because request
will fail without it. I'll leave it up to you to fix that bug!
Let me know what you think and what you'd like to see next in the comments!
Written by Saul Costa
Related protips
1 Response
Any updates or advancements on this project?