Last Updated: September 09, 2019
·
9.275K
· dimitative

Crawling all URIs with wget and grep

After a while you'll reach the point where you want to get all URIs of your website for some reason e.g. creating a sitemap for SE optimization or a list for validation purposes (W3C Validator) - or just getting a quick overview to tidy up your web site. For all these cases this simple little script using wget and grep may be a great little helper:

wget --no-verbose --recursive --spider --force-html --level=DEPTH_LEVEL --no-directories --reject=jpg,jpeg,png,gif YOUR_DOMAIN 2>&1 | sort | uniq | grep -oe 'www[^ ]*'

Resulting in a list of all URIs depending on the DEPTH_LEVEL (e.g. 5) you set, it sorts out all pictures and forces to crawl html files. You can then save the output into a single file by adding > result.txt after the statement. You can simply modify the matching pattern, for example replacing the www into http:// in order to get more suitable results.

The script does not save any data or content from the web site. It just simply 'spiders' the structure and does not create any directories.

Take a look here for building regular expressions: http://rubular.com

A local installation of the W3C Validator API would be a possible combination to use the URIs: http://validator.w3.org/docs/api.html

4 Responses
Add your response

If I want to get all urls from mydomain.com but only urls start with links_
for example mydomain.com/links_sony.php
??

over 1 year ago ·

Thank you for your comment. Such configuration can be done with regular expressions at the -oe option in the end. Have you tried that out?

over 1 year ago ·

Ok I solved... thanks..
Just i'd like to split result.txt every 20 records... it's possible??

over 1 year ago ·

The command gives you a complete list of URIs. You can consider some postprocessing of the list to modify the chunks. I am sure that for this purpose you will find some frameworks or libraries. Please let me know if you have managed it somehow.

over 1 year ago ·