Markup.ml – best-effort, standards-compliant HTML5 and XML parser

Markup.ml is a standards-compliant parser for HTML5 and XML that handles bad input.

In particular, the HTML5 specification contains over 100 pages of detailed rules for recovering from every kind of malformed HTML. All are implemented in Markup.ml (almost). So, you just feed bytes, and Markup.ml will make sense of them. You can optionally provide an error-reporting function, and use Markup.ml to lint or validate HTML up to these rules.

The API is focused on ease of use. It is based around the familiar idiom of piping streams through stream transformers. For example, this is how you clean up malformed HTML:

let bad_html = "<body><p><em>Markup.ml<p>rocks!"

Markup.(string bad_html |> parse_html |> signals
        |> pretty_print |> write_html |> to_channel stdout)

This prints:

<body>
  <p>
    <em>Markup.ml</em>
  </p>
  <p>
    <em>rocks!</em>
  </p>
</body>

If you don’t want streams, there is a helper for easily loading a Markup.ml stream into a DOM (i.e., an AST). See the Markup.ml home page for more examples. Markup.ml also has pretty thorough reference documentation.

In addition to the HTML5 parser, Markup.ml also includes a compliant XML parser with the same API, as well as compliant serializers for both HTML5 and XML.

All of these are…

  • Streaming: parsing partial input, and emitting signals as input is still being received.
  • Lazy: input is not parsed unless you request parsing signals, so you can trivially stop parsing partway through a document.
  • Non-blocking: can be used with Lwt, but a straightforward blocking, non-Lwt API is the default for simple usage.
  • One-pass: the parsers don’t build up a DOM in memory or buffer much input, limiting memory consumption.

The parsers detect common encodings automatically, and the user sees everything in UTF-8. The HTML5 parser also handles SVG and MathML.


Markup.ml has been out since January 2016, and underlies the web scraper Lambda Soup. This announcement was prompted by the recent release of Markup.ml 0.7.5, which is a good opportunity to introduce Markup.ml to the community here on Discourse :slight_smile:

Markup.ml 0.7.5 contains some bugfixes and performance improvements; see the changelog here. Thanks to the credited contributors, and to all users!

I’ve also just created two easy-ish issues on Markup.ml, and Markup.ml has a code overview in its CONTRIBUTING.md. If the issues seem interesting, leave a comment to claim them :slight_smile:


16 Likes

Wow, this is really cool. Thank you! Is there anything special we need to do, or be aware of in order to use it with Jane Street’s Async instead of with Lwt?

Thanks :slight_smile:

Unfortunately, yes. Markup.ml currently doesn’t have an Async binding, though it’s designed to make creating one easy. In particular, it’s necessary to write a small module and pass it to a functor. This should result in a module with an API equivalent to Markup_lwt.

I don’t have the requisite Async expertise to do a good job of this myself. I actually intended to learn Async better right after working on Markup, but then I got a bit involved with Lwt.