Last Updated: September 09, 2019
·
1.449K
· alexanderbrevig

Wisdom: Don't use RegEx to parse HTML

This entertaining post can explain it to you: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

tl;dr
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. - Upset StackOverflow User

<h3>Use a real HTML Parser:</h3>

<ul>
<li>Ruby: <a href="http://nokogiri.org/">Nokogiri</a></li>
<li>JavaScript: <a href="http://jquery.com/">jQuery</a></li>
<li>PHP: <a href="http://docs.php.net/manual/en/domdocument.loadhtml.php">PHP5 DOMDocument</a></li>
<li>.Net(C#): <a href="http://htmlagilitypack.codeplex.com/">Html Agility Pack</a></li>
<li>VB6: <a href="http://www.codeguru.com/vb/vb_internet/html/article.php/c4815">MSHTML</a> (Used by IE)</li>
<li>Python: <a href="http://lxml.de/xpathxslt.html">lxml</a></li>
<li>Perl: <a href="http://search.cpan.org/~gaas/HTML-Parser-3.68/Parser.pm">HTML:Parser</a></li>
<li>Java: <a href="http://htmlcleaner.sourceforge.net/">HTML Cleaner</a></li>
</ul>