Last Updated: February 25, 2016
·
1.153K
· creaktive

Handle sitemap.xml in CLI

I don't always look inside a domain's sitemap.xml, but when I do, I have to download all the listed URLs in a batch

Opening the XML in the browser window and copying/pasting links one-by-one is simply unacceptable. Most of times, this will suffice to quickly check the contents of a sitemap file while in terminal session:

curl -Lq http://sysd.org/sitemap.xml | grep -w loc

However, this approach relies on prettified XML (line breaks between URL nodes). And even then, URLs are surrounded by <loc>...</loc> tags. So, this output can't be piped to curl (nor wget), yet.

Enter Mojolicious, self-described as a next generation web framework for the Perl programming language. It's perfect to write web-enabled one-liners.

So, this is how I quickly handle sitemap.xml today:

perl -Mojo -le 'g($ARGV[0])->dom->find("loc")->each(sub{print shift->content_xml})' http://sysd.org/sitemap.xml

This one-liner parses sitemap.xml as a DOM tree and outputs one URL per line. You could pipe it to vim - to review the links list manually, or even to wget -i - to fetch everything. With minor tweaks, it is possible to handle sitemap indexes, also.