w8043q
Last Updated: January 06, 2019
·
8
· skyzyx

HTML (with microformats, microdata) → Markdown (GitHub-Flavored Markdown, Commonmark)

I have a version of my bio that is written in HTML with lots of microformats and microdata embedded. https://ryanparman.com/about/#full-length

I wanted to produce a Markdown (Commonmark, really) version without having to do the conversion by hand. https://ryanparman.com/about/#markdown

NOTE: For those who don't know, macOS is a blend of the XNU kernel and FreeBSD tools. Most Linuxes use the GNU flavor of tools. In the example code, there is a reference to sed which should be the GNU version, not the built-in BSD version. You can install the right version using Homebrew.

cat author.html | sed -r "s/<\/?span([^>]*)>//g" | pandoc -r html -w gfm --columns 10000 | tee author.md

What this does:

  1. Reads the author.html file to stdout
  2. Pipes the content into GNU sed (which supports Perl-compatible regular expressions with -r) to strip out all <span> tags and attributes
  3. Pipes that to a tool called Pandoc, which converts the HTML to GitHub-Flavored Markdown (which is now a superset of Commonmark)
  4. Overwrites the contents of author.md with Pandoc's results