Crawling all URIs with wget and grep
After a while you'll reach the point where you want to get all URIs of your website for some reason e.g. creating a sitemap for SE optimization or a list for validation purposes (W3C Validator) - or just getting a quick overview to tidy up your web site. For all these cases this simple little script using wget and grep may be a great little helper:
wget --no-verbose --recursive --spider --force-html --level=DEPTH_LEVEL --no-directories --reject=jpg,jpeg,png,gif YOUR_DOMAIN 2>&1 | sort | uniq | grep -oe 'www[^ ]*'
Resulting in a list of all URIs depending on the DEPTH_LEVEL (e.g. 5) you set, it sorts out all pictures and forces to crawl html files. You can then save the output into a single file by adding > result.txt
after the statement. You can simply modify the matching pattern, for example replacing the www into http:// in order to get more suitable results.
The script does not save any data or content from the web site. It just simply 'spiders' the structure and does not create any directories.
Take a look here for building regular expressions: http://rubular.com
A local installation of the W3C Validator API would be a possible combination to use the URIs: http://validator.w3.org/docs/api.html
Written by Dimitri
Related protips
4 Responses
If I want to get all urls from mydomain.com but only urls start with links_
for example mydomain.com/links_sony.php
??
Thank you for your comment. Such configuration can be done with regular expressions at the -oe option in the end. Have you tried that out?
Ok I solved... thanks..
Just i'd like to split result.txt every 20 records... it's possible??
The command gives you a complete list of URIs. You can consider some postprocessing of the list to modify the chunks. I am sure that for this purpose you will find some frameworks or libraries. Please let me know if you have managed it somehow.