Last Updated: February 25, 2016
·
4.617K
· 13k

Bash: regex greediness trick for complex enclosing patterns

Bash regex doesn't support greediness modifiers like PCRE's "?" (to be non-greedy).

So if anyone wants to match enclosing patterns (like "match everything inside quotes", or "match everything inside a HTML tag"), you have to rely on some workarounds.

A common trick for simple (single-char matching) patterns, like XML tags, is to match the left part and everything afterwards being greedy, then negate the right-side match, meaning "match from [left-hand] and everything that is not [right-hand]":

text="aaa<skipme>bbb<skipmetoo>ccc"
echo "${text//<*([^>])>/}"
# outputs "aaabbbccc"

But with more complex enclosing patterns, for example "[[match everything inside]]", which is pretty common for wiki markup. The trick above won't work since you rely on the character group negation that will match single chars. I tried playing with the negation operator "!()" but I failed miserably, so a workaround I found was to perform a 2-pass substitution. First I replace all "[[" and "]]" with a single, easily distinguishable, char then I replace it with the final desired string:

# Char 0x06 is the non-printable "ACK"
# You need to use as $var because the regexp
# won't accept "\x06" in the pattern
ch=`echo -e '\x06'`
text="aaa<<skipme>>bbb<<skipmetoo>>ccc"
tmp="${text//@(<<|>>)/$ch}"
echo "${tmp//${ch}*([^${ch}])${ch}}"
# "aaabbbccc"

A more detailed post on my blog:

http://monospaced-thoughts.com/2011/07/05/bash-regexp-trick/