I mentioned in Where Have the Years Gone? that Jeff and I have been working on a PHP HTML parser and an HTML DOM implementation and that I would like to write about it more; this is my attempt to do so. The impetus for these libraries was needing a better scraper for our RSS aggregator server, The Arsse. I had written an HTML5 parser back when the parser specification was new, and up until last week this website was generated using it. It was sloppy, in a single 20,000 line file, only supported UTF-8, and wasn’t updated for all the changes in the specification from when I first wrote it; it was simply not suitable for use in The Arsse.
The Arsse uses PicoFeed to parse feeds and scrape websites. PicoFeed was originally authored for use with Miniflux, and the library was abandoned when the developer decided to switch to Go from PHP — leaving us in a pickle. Thankfully, someone picked up the torch, and we’ve contributed to the project since then. However, we’d still prefer to write our own because of issues we’ve found along the way; we would also like to support JSON Feed, too, even though we have not entirely glowing opinions of the format ourselves. It borrows far too much from RSS and not enough from improvements brought forth by Atom — repeating 15+ year old mistakes in the process.
I began writing the new parser in 2017 based off of the WHATWG living standard instead of HTML5 while Jeff was still focused on The Arsse proper. I, unfortunately, became bored with it and moved onto something else; the process is mostly tedium. He decided it was time to work on it after beginning Lax, our in-progress feed parser mentioned above. His working on it got me interested in it again, and over time he became focused on the parser itself while I focused on the DOM. There were also a couple of branching projects that resulted from this, namely a set of internationalization tools that actually conform to WHATWG’s encoding standard (PHP’s intl and mbstring extensions don’t handle this correctly) and a mime sniffer library. Jeff wrote the entirety of these with my providing nothing but occasional input.
There are other PHP HTML parsers, most notably the widely used Masterminds HTML5 parser. Masterminds HTML5 parser isn’t very accurate and in some cases fails to parse perfectly valid documents at all. HTML-Parser conforms to the specification where it can. It is also extensively unit tested, including with html5lib’s own tests. Because of this it is also slower than Masterminds’ library. We believe this accuracy is more important — especially when we attempt to scrape websites that may or may not be well-formed at all. We need the result to be what a browser would parse.
Originally, the parser and an extension to PHP’s dom extension were included together, mostly existing to circumvent PHP DOM bugs when appending and when handling namespaced attributes. This, however, caused parsing to slow down a bit, and the more I added to the DOM to fill out missing features the slower it became. The decision was made to separate the two and bake the circumventions necessary for accurate parsing into the parser itself. This was a blessing in disguise which will become apparent later.
After an initial write and working out bugs when unit testing against html5lib’s tests we went through a period shaving off fractions of a second here and there optimizing it when parsing an extremely large document: WHATWG’s single page HTML specification. I think initially it was around 30 seconds on my computer. Today, it’s around 5.5 seconds. The official benchmarks listed in the README of HTML-Parser are from Jeff’s computer, one slightly slower than my own. We still have some more ideas for improvements which might shave a bit more off the top. However, we don’t want to sacrifice readability of the code; the code still needs to be maintained by humans. Well, Jeff might actually be a robot…
Initially, a conforming HTML serializer was part of the DOM part of the HTML-Parser library. I had written a fully functioning and unit tested serializer. After the two parts were separated into separate libraries, Jeff decided it should be part of the parser and wrote another one. I just finished writing my initial stab at a pretty printer for the serializer in HTML-DOM, so I migrated everything over to Jeff’s serializer when I was able to. HTML-DOM still serializes as it should, but it’s largely from HTML-Parser.
When initially writing the DOM classes they were simple extensions of PHP’s DOM using its
DOMDocument::registerNodeClass method. As I dug deeper into the WHATWG DOM specification, I discovered that it was too difficult to follow the specification as I was running up against type errors in PHP’s XML-based DOM. The straw that broke the camel’s back was when the node passed to
Document::adoptNode could not be passed by reference. Since the library wasn’t married to HTML-Parser anymore I was free to do whatever I needed without worry about how much it would affect parsing speed. My decision was to then wrap PHP DOM’s classes. I could then do whatever I wanted and let PHP’s DOM handle it internally. This benefitted me greatly as soon as I started running unit tests.
PHP’s DOM is at best a flimsy and buggy wrapper written to access a buggy and antiquated XML library that conforms to no specification whatsoever, new or old. It returns empty strings when it should return
null in some circumstances. It has issues with namespaces, especially concerning the
xmlns attribute. When inserting nodes any non-default (in PHP DOM’s case
null is default) namespaced elements that are children of non-namespaced elements are prefixed with
default. Same goes for attributes. Also due to what presumably is a memory management bug in the original xmllib the more namespaced elements there are DOM operations become exponentially slower. This leads us to use
null internally while exposing the HTML namespace externally. In reality, there needs to be a new DOM extension for PHP, but that is beyond what I am capable of programming. Wrapping the classes allows these bugs to be circumvented at least.
While developing this library I discovered another attempt to do something similar: PhpGt. Somewhere along the path of their development they came to the same conclusion that the built-in classes must be wrapped to get anything meaningful done. That’s where the similarities between the libraries end, though. It oddly wraps all PHP DOM classes such as
DOMText, etc. when only
Both libraries outlined above are available on Packagist as mensbeam/html-parser and mensbeam/html-dom and may be installed through Composer.