HTML-Parser & HTML-DOM

Monday, February 28^th, 2022 12:35 CST

I mentioned in Where Have the Years Gone? that Jeff and I have been working on a PHP HTML parser and an HTML DOM implementation and that I would like to write about it more; this is my attempt to do so. The impetus for these libraries was needing a better scraper for our RSS aggregator server, The Arsse. I had written an HTML5 parser back when the parser specification was new, and up until last week this website was generated using it. It was sloppy, in a single 20,000 line file, only supported UTF-8, and wasn’t updated for all the changes in the specification from when I first wrote it; it was simply not suitable for use in The Arsse.

The Arsse uses PicoFeed to parse feeds and scrape websites. PicoFeed was originally authored for use with Miniflux, and the library was abandoned when the developer decided to switch to Go from PHP — leaving us in a pickle. Thankfully, someone picked up the torch, and we’ve contributed to the project since then. However, we’d still prefer to write our own because of issues we’ve found along the way; we would also like to support JSON Feed, too, even though we have not entirely glowing opinions of the format ourselves. It borrows far too much from RSS and not enough from improvements brought forth by Atom — repeating 15+ year old mistakes in the process.

HTML-Parser

I began writing the new parser in 2017 based off of the WHATWG living standard instead of HTML5 while Jeff was still focused on The Arsse proper. I, unfortunately, became bored with it and moved onto something else; the process is mostly tedium. He decided it was time to work on it after beginning Lax, our in-progress feed parser mentioned above. His working on it got me interested in it again, and over time he became focused on the parser itself while I focused on the DOM. There were also a couple of branching projects that resulted from this, namely a set of internationalization tools that actually conform to WHATWG’s encoding standard (PHP’s intl and mbstring extensions don’t handle this correctly) and a mime sniffer library. Jeff wrote the entirety of these with my providing nothing but occasional input.

There are other PHP HTML parsers, most notably the widely used Masterminds HTML5 parser. Masterminds HTML5 parser isn’t very accurate and in some cases fails to parse perfectly valid documents at all. HTML-Parser conforms to the specification where it can. It is also extensively unit tested, including with html5lib’s own tests. Because of this it is also slower than Masterminds’ library. We believe this accuracy is more important — especially when we attempt to scrape websites that may or may not be well-formed at all. We need the result to be what a browser would parse.

Originally, the parser and an extension to PHP’s dom extension were included together, mostly existing to circumvent PHP DOM bugs when appending and when handling namespaced attributes. This, however, caused parsing to slow down a bit, and the more I added to the DOM to fill out missing features the slower it became. The decision was made to separate the two and bake the circumventions necessary for accurate parsing into the parser itself. This was a blessing in disguise which will become apparent later.

After an initial write and working out bugs when unit testing against html5lib’s tests we went through a period shaving off fractions of a second here and there optimizing it when parsing an extremely large document: WHATWG’s single page HTML specification. I think initially it was around 30 seconds on my computer. Today, it’s around 5.5 seconds. The official benchmarks listed in the README of HTML-Parser are from Jeff’s computer, one slightly slower than my own. We still have some more ideas for improvements which might shave a bit more off the top. However, we don’t want to sacrifice readability of the code; the code still needs to be maintained by humans. Well, Jeff might actually be a robot…

Initially, a conforming HTML serializer was part of the DOM part of the HTML-Parser library. I had written a fully functioning and unit tested serializer. After the two parts were separated into separate libraries, Jeff decided it should be part of the parser and wrote another one. I just finished writing my initial stab at a pretty printer for the serializer in HTML-DOM, so I migrated everything over to Jeff’s serializer when I was able to. HTML-DOM still serializes as it should, but it’s largely from HTML-Parser.

HTML-DOM

When initially writing the DOM classes they were simple extensions of PHP’s DOM using its DOMDocument::registerNodeClass method. As I dug deeper into the WHATWG DOM specification, I discovered that it was too difficult to follow the specification as I was running up against type errors in PHP’s XML-based DOM. The straw that broke the camel’s back was when the node passed to Document::adoptNode could not be passed by reference. Since the library wasn’t married to HTML-Parser anymore I was free to do whatever I needed without worry about how much it would affect parsing speed. My decision was to then wrap PHP DOM’s classes. I could then do whatever I wanted and let PHP’s DOM handle it internally. This benefitted me greatly as soon as I started running unit tests.

PHP’s DOM is at best a flimsy and buggy wrapper written to access a buggy and antiquated XML library that conforms to no specification whatsoever, new or old. It returns empty strings when it should return null in some circumstances. It has issues with namespaces, especially concerning the xmlns attribute. When inserting nodes any non-default (in PHP DOM’s case null is default) namespaced elements that are children of non-namespaced elements are prefixed with default. Same goes for attributes. Also due to what presumably is a memory management bug in the original xmllib the more namespaced elements there are DOM operations become exponentially slower. This leads us to use null internally while exposing the HTML namespace externally. In reality, there needs to be a new DOM extension for PHP, but that is beyond what I am capable of programming. Wrapping the classes allows these bugs to be circumvented at least.

It also allows templates to be implemented as specified. While templates work as specified in HTML-DOM, they were a colossal pain to implement because of the XML-based inner DOM. A lot of code is written in both HTML-DOM and HTML-Parser just for handling templates whether it is for parsing, insertion, and especially for serializing. In my personal opinion template elements are the most ill conceived thing in HTML at present. They were designed to be used within a modular loading system. One such system was specified and implemented in Blink, but some drama that I don’t quite remember the details of ensued; it never was implemented in anything else and was subsequently removed from Chrome. JavaScript modules are now supported in place of them. Template elements are treated differently when parsing and have different rules when manipulated with the DOM, and is an everpresent exception throughout the specification. Storing HTML and CSS in JavaScript is a constant source of consternation from old hats like me who during the web standards push in the late 1990’s and early aughts had separation drilled into us, but as soon as HTML imports were abandoned there has simply been no other alternative. It’s bonkers to have JavaScript append template elements to the document when the HTML and CSS for components can simply be stored in template strings in code. Having them already in the document is inefficient as well because they’re downloaded and therefore cached for every page; in JavaScript they’re downloaded once and cached once.

While developing this library I discovered another attempt to do something similar: PhpGt. Somewhere along the path of their development they came to the same conclusion that the built-in classes must be wrapped to get anything meaningful done. That’s where the similarities between the libraries end, though. It oddly wraps all PHP DOM classes such as DOMDocument, DOMElement, DOMText, etc. when only DOMDocument is necessary; handles templates incorrectly; has multitudes of factory classes that overcomplicates things further; and it also seems to follow Mozilla’s MDN documentation instead of the actual specification. This has led to widespread bugs and implementation errors because MDN is a JavaScript developer’s documentation, not an implementor’s documentation. Other differences are that PhpGt’s DOM implements individual element classes for everything while ours only presently supports what is barely necessary for the element creation algorithm. There are plans to support more in the future as time permits. Our library is also backed by a conforming parser while PhpGt’s isn’t. A full list of what HTML-DOM supports is available in its README along with any implementation details and deviations.

Both libraries outlined above are available on Packagist as mensbeam/html-parser and mensbeam/html-dom and may be installed through Composer.