Markup.ml is a standards-compliant parser for HTML5 and XML that handles bad input.
In particular, the HTML5 specification contains over 100 pages of detailed rules for recovering from every kind of malformed HTML. All are implemented in Markup.ml (almost). So, you just feed bytes, and Markup.ml will make sense of them. You can optionally provide an error-reporting function, and use Markup.ml to lint or validate HTML up to these rules.
The API is focused on ease of use. It is based around the familiar idiom of piping streams through stream transformers. For example, this is how you clean up malformed HTML:
let bad_html = "<body><p><em>Markup.ml<p>rocks!" Markup.(string bad_html |> parse_html |> signals |> pretty_print |> write_html |> to_channel stdout)
<body> <p> <em>Markup.ml</em> </p> <p> <em>rocks!</em> </p> </body>
If you don’t want streams, there is a helper for easily loading a Markup.ml stream into a DOM (i.e., an AST). See the Markup.ml home page for more examples. Markup.ml also has pretty thorough reference documentation.
In addition to the HTML5 parser, Markup.ml also includes a compliant XML parser with the same API, as well as compliant serializers for both HTML5 and XML.
All of these are…
- Streaming: parsing partial input, and emitting signals as input is still being received.
- Lazy: input is not parsed unless you request parsing signals, so you can trivially stop parsing partway through a document.
- Non-blocking: can be used with Lwt, but a straightforward blocking, non-Lwt API is the default for simple usage.
- One-pass: the parsers don’t build up a DOM in memory or buffer much input, limiting memory consumption.
The parsers detect common encodings automatically, and the user sees everything in UTF-8. The HTML5 parser also handles SVG and MathML.
Markup.ml has been out since January 2016, and underlies the web scraper Lambda Soup. This announcement was prompted by the recent release of Markup.ml 0.7.5, which is a good opportunity to introduce Markup.ml to the community here on Discourse
Markup.ml 0.7.5 contains some bugfixes and performance improvements; see the changelog here. Thanks to the credited contributors, and to all users!