Handle sitemap.xml in CLI
Opening the XML in the browser window and copying/pasting links one-by-one is simply unacceptable. Most of times, this will suffice to quickly check the contents of a sitemap file while in terminal session:
curl -Lq http://sysd.org/sitemap.xml | grep -w loc
However, this approach relies on prettified XML (line breaks between URL nodes). And even then, URLs are surrounded by <loc>...</loc>
tags. So, this output can't be piped to curl
(nor wget
), yet.
Enter Mojolicious, self-described as a next generation web framework for the Perl programming language. It's perfect to write web-enabled one-liners.
So, this is how I quickly handle sitemap.xml
today:
perl -Mojo -le 'g($ARGV[0])->dom->find("loc")->each(sub{print shift->content_xml})' http://sysd.org/sitemap.xml
This one-liner parses sitemap.xml
as a DOM tree and outputs one URL per line. You could pipe it to vim -
to review the links list manually, or even to wget -i -
to fetch everything. With minor tweaks, it is possible to handle sitemap indexes, also.