Last Updated: March 08, 2016
·
11.1K
· ryanartecona

RegEx to find <img/> tags missing alt attributes

A couple of weeks ago, at work, I was helping a colleague make a git hook that did some rudimentary testing of locally-committed code changes before they get accepted into the remote, such that invalid or malformed code would require correction before making it into the central repo.

After a couple frustrating hours, I came up with a regex that matches only those tags that lack alt attributes:


<img(\s*(?!alt)([\w\-])+=([\"\'])[^\"\']+\3)*\s*\/?>


Notes on flexibility:

  • + the tag and its attributes may span multiple lines
  • + supports attributes whose names include hyphens (i.e. data-* attributes)
  • + tags can end with /> (self-closing) or just >
  • + attribute values can be in single or double quotes
  • all of the tag's attributes must have quoted values (so <img src=me.jpg /> won't match)
  • none of the tag's attribute values may contain quote characters ( so <img src="<?php 'someString' ?>" /> won't match)

Notes on usage:

  • this requires a Perl-compatible regular expression (PCRE), which notably doesn't come standard with git grep, and requires the -P flag when used with grep
  • when searching source code, you should use ack instead of grep
  • in Python, your regex string literal should be prefixed with r (e.g. r'<regex>'
  • in Javascript, use forward slashes to denote a regex string, and use the g flag to match all occurrences, instead of only the first (e.g. /<regex>/g)

If you find this useful, incomplete, or interesting in anyway, drop me a tweet!

4 Responses
Add your response

First of all, thank you. I've changed it to look for links/anchors that are missing a class attribute.

<a(\s(?!class)([\w-])+=([\"\'])[^\"\']+\3)\s*\/?>

-Could it be extended to look for links/anchors missing a particular class such as 'standard_link'? The problem is further complicated by the fact that there could be multiple classes defined. I know regex shouldn't really be used with html but I want to use it with the program grepwin to look through my PHP code to make sure I've added a class to all my links.

over 1 year ago ·

@u01jmg3 Glad you got some use out of it!

I'm sure it's possible for it to be extended in that way, but I'm not so sure I've got the chops to do it. One thing I learned from sculpting this thing is that trying to match anything beyond a well-defined text, date, or other numerical format can make for a pretty hairy regex pretty quickly. You may want to give it a shot yourself, but you would probably be better off writing a script that searches those files in that way for you with the help of a decent HTML parsing library.

over 1 year ago ·

@ryanartecona
I ended up using XPath.

$html = file_get_contents($filename);
$dom = new DOMDocument();
@$dom->loadHtml($html);
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//a[not(img) and not(contains(concat(' ', &#64;class, ' '), ' standard_link '))]");

if($entries->length > 0){
    echo '<ul><li style="font-weight: bold;">' . $filename . '</li><ol>';
}

foreach ($entries as $entry) {
    $array[] = $dom->saveHTML($entry);
}

if(isset($array))
    foreach($array as $value)
        echo '<li>' . htmlspecialchars($value) . '</li>';           

echo '</ol></ul>';
over 1 year ago ·

To remove any img tag that does not contain a src attribute:


$str = preg_replace('#<img\s((?!src=).)/?>#Umi','',$str); </code> </pre> This is shorter and, I think, more reliable.
over 1 year ago ·