Last Updated: January 06, 2019
· skyzyx

HTML (with microformats, microdata) → Markdown (GitHub-Flavored Markdown, Commonmark)

I have a version of my bio that is written in HTML with lots of microformats and microdata embedded.

I wanted to produce a Markdown (Commonmark, really) version without having to do the conversion by hand.

NOTE: For those who don't know, macOS is a blend of the XNU kernel and FreeBSD tools. Most Linuxes use the GNU flavor of tools. In the example code, there is a reference to sed which should be the GNU version, not the built-in BSD version. You can install the right version using Homebrew.

cat author.html | sed -r "s/<\/?span([^>]*)>//g" | pandoc -r html -w gfm --columns 10000 | tee

What this does:

  1. Reads the author.html file to stdout
  2. Pipes the content into GNU sed (which supports Perl-compatible regular expressions with -r) to strip out all <span> tags and attributes
  3. Pipes that to a tool called Pandoc, which converts the HTML to GitHub-Flavored Markdown (which is now a superset of Commonmark)
  4. Overwrites the contents of with Pandoc's results