Last Updated: September 09, 2019

Crawling all URIs with wget and grep

After a while you'll reach the point where you want to get all URIs of your website for some reason e.g. creating a sitemap for SE optimization or a list for validation purposes (W3C Validator) - or just getting a quick overview to tidy up your web site. For all these cases this simple little script using wget and grep may be a great little helper:

wget --no-verbose --recursive --spider --force-html --level=DEPTH_LEVEL --no-directories --reject=jpg,jpeg,png,gif YOUR_DOMAIN 2>&1 | sort | uniq | grep -oe 'www[^ ]*'

Resulting in a list of all URIs depending on the DEPTH_LEVEL (e.g. 5) you set, it sorts out all pictures and forces to crawl html files. You can then save the output into a single file by adding > result.txt after the statement. You can simply modify the matching pattern, for example replacing the www into http:// in order to get more suitable results.

The script does not save any data or content from the web site. It just simply 'spiders' the structure and does not create any directories.

Take a look here for building regular expressions: http://rubular.com

A local installation of the W3C Validator API would be a possible combination to use the URIs: http://validator.w3.org/docs/api.html

Written by Dimitri

Say Thanks

Respond

Related protips

$#, $@ & $?: Bash Built-in variables

202.5K

Using Rsync to deploy a website. Easy one liner command

18.78K

Getting weather via your command line

13.85K

4 Responses

Add your response

gennysa

If I want to get all urls from mydomain.com but only urls start with links_
for example mydomain.com/links_sony.php
??

over 1 year ago ·

dimitative

Thank you for your comment. Such configuration can be done with regular expressions at the -oe option in the end. Have you tried that out?

over 1 year ago ·

gennysa

Ok I solved... thanks..
Just i'd like to split result.txt every 20 records... it's possible??

over 1 year ago ·

dimitative

The command gives you a complete list of URIs. You can consider some postprocessing of the list to modify the chunks. I am sure that for this purpose you will find some frameworks or libraries. Please let me know if you have managed it somehow.

over 1 year ago ·