[ANN] cmarkit 0.1.0 – CommonMark parser and renderer for OCaml

Hello,

It’s my pleasure to announce the first release of the Cmarkit library.

Cmarkit parses the CommonMark specification. It provides:

  • A CommonMark parser for UTF-8 encoded documents. Link label resolution can be customized and a non-strict parsing mode can be activated to add: strikethrough, LaTeX math, footnotes, task items and tables.

  • An extensible abstract syntax tree for CommonMark documents with source location tracking and best-effort source layout preservation.

  • Abstract syntax tree mapper and folder abstractions for quick and concise tree transformations.

  • Extensible renderers for HTML, LaTeX and CommonMark with source layout preservation.

Cmarkit is distributed under the ISC license. It has no dependencies.

This first release benefited from a grant from the OCaml Software Foundation. Funding from my few but faithfull donators is also paramount to get these tedious bits out for release. Thank you all for your support.

Homepage: https://erratique.ch/software/cmarkit
Docs: https://erratique.ch/software/cmarkit/doc (or odig doc cmarkit)
Install: opam install cmarkit (once this PR is merged)

Best,

Daniel

31 Likes

Since someone is going to ask here’s my biased comparison between cmarkit and omd, the only other OCaml commonmark parser I’m aware of.

This is based on what I see here and what I fell on in using the omd tool.

  1. cmarkit takes the whole document as input, omd can work line by line on input channels. Note however that in practice due to how CommonMark parsing works you need it in memory anyways and wait for the end of input to trigger inline parsing.
  2. cmarkit should conform to the CommonMark spec, all conformance tests pass. I don’t think omd does since U+0000 doesn’t seem to be replaced by U+FFFD and in general it seems to forgo UTF decoding.
  3. cmarkit provides location tracking and source layout information in the AST omd does not.
  4. cmarkit fails on 3/22 (2/22 in OCaml 5) of the cmark pathological tests. omd fails on 17/22 of them.
  5. From commonmark to html on a large 12Mo md file, cmarkit seems slightly faster (26%) than omd (even when locs and layout it’s still 10% faster). But no scientific benchmark was performed. Nor particular attention was paid to that. Nor is it likely to be important in practice (unless you are in charge of rendering all the READMEs of a code hosting platform).
  6. cmarkit has renderers to CommonMark (layout preserving) and LaTeX. omd hasn’t, but it has one to sexp which cmarkit hasn’t.
  7. cmarkit renderers are extensible and partially redefinable. omd ones aren’t.
  8. cmarkit lets client customize link label definition and resolution which allows to embed data binding DSLs in the very flexible label syntax. omd has no such thing.
  9. cmarkit’s AST is extensible. omd’s one is not.
  10. cmarkit as per node extensible metadata. omd uses a polymorphic scheme.
  11. cmarkit has AST mappers and folders. omd has no such thing.
  12. cmarkit has no dependencies. omd depends on a bunch of other packages.
  13. cmarkit and omd support different syntax extensions. It is unclear which ones are supported by omd, for cmarkit see the docs.
  14. cmarkit reuses the CommonMark spec vocabulary and the docs are fully hyperlinked into the specification to help you understand the terrible morass you are dealing with.
  15. cmarkit, the tool provided with the library, is a bit more featureful than the omd (or reference cmark) tool. Notably (with enough options specified :–) support is provided to output full HTML and LaTeX documents that are ready to read and render.

In general I’d say omd is fine if you are just interested in taking a CommonMark string to a default CommonMark rendering. If you are interested in making systems that integrate CommonMark as a medium that you process and play with you will be better off with cmarkit.

Finally it should be noted that omd was started in darker times when no CommonMark specification existed. Having spent a significant amount of time on cmarkit with a specification one can only appreciate the thoughness of the initial effort.

19 Likes

This looks wonderful and I’m looking forward to using and extending it. Thank you for the excellent work!

2 Likes

This looks great! I wish it was available a year ago when I was choosing between F# and OCaml for my markdown LSP. Ooph, temptation is strong! :slight_smile:

I’m curious about extensibility. There’s a bunch of commonmark extensions like attributes that people use quite often. Also, what if I want to add a new type of inline e.g. HashTag of Text.t node the docs seem to suggest that it’s not a supported use-case, is it?

There is no plan to provide an extension mechanism at the parsing level.

Let me quote that in full:

  1. There is no plan to provide an extension mechanism at the parsing level. A lot can already be achieved by using reference resolvers, abusing code fences, post-processing the abstract syntax tree, or extending the renderers.

Regarding:

You can add a case to the extensible inline type and process Text.t nodes to recognize them.

The approach has some limitations (notably you arrive after escaping) but that’s what I used for example in the ocamlmark POC to provide a syntax for heading identifiers. See the code here.

1 Like

Ok, I see what you mean here. IIUC post-processing would require rewriting/splitting nodes or modifying adjacent nodes if you still want a proper syntax tree after. This is workable but also it feels that it could be more work than just modifying the parser itself? This is not a criticism, rather trying to understand the tradeoffs better.

Thank for the pointer @dbuenzli!

If you manage to come up and implement a sound design for an extension mechanism that does that while still keeping the weird corner cases of the existing constructs undisturbed, be my guest… Reading the spec discussion forum and a few CommonMark implementations it seems no one has a good design for that. Most implementations seem to have a fixed built-in set of (potentially individually) activable (and incompatible) extensions.

I have a few ideas where hooks could be added but once you start thinking about the delicate interplay with the other constructs, it gets a bit unwiedly and as far as I’m concerned I already spent way too much days of my life on this; I wish you had done it rather than flee to F# last year ;–)

Also personally I think some of the extensions veer away too much from markdown’s philosophy which aims at making the source publishable as is without looking like it’s been marked up. I’m mostly interested in using CommonMark with non-technical users and as a templating language (I’m suprised it’s not more used for that since it has raw HTML built in). So I’m not that much interested in those extensions that turn it more into a formal language.

That being said depending on what you want to do I really suggest you look into the label resolvers. Basically by parsing with nested_links:true you can treat the link text syntax as a generalized span construct and the link label as your own DSL: the syntax is extremely lax and remember that in shortcut reference links both the text and the label coincide.

One example of this is in ocamlmark: I simply reserve all the link labels that start with ! to stand for the {!…} constructs of ocamldoc, the resolver is here in which we recognize these construct and store them in the metadata. After that during the link node translation we translate to the right ocamldoc construct when we hit these specially marked labels.

Another non public example I have is, without having to fiddle with Text.t node parsing, support for abbreviation/superscript/subscript via a simple resolver, a mapper and a rendering extension. This syntax:

This is an unexpanded [PDF][!abbr] and square[2][^] or [CO][!abbr][2][_]

This is an inline expanded abbreviation: [PDF](!abbr "Portable Document Format")

This is an expanded abbreviation: [PDF]

[PDF]: !abbr "Portable Document Format"

Renders to

<p>This is an unexpanded <abbr>PDF</abbr> and square<sup>2</sup> or <abbr>CO</abbr><sub>2</sub></p>
<p>This is an inline expanded abbreviation: <abbr title="Portable Document Format">PDF</abbr></p>
<p>This is an expanded abbreviation: <abbr title="Portable Document Format">PDF</abbr></p>

Also note that it does compose you can have an abbr in a link a link in an abbr (thanks to the non-standard parse nested_links:true). Adding support for [text][.class] or [text][#id] to classify an identify spans would be equally simple.

Other ideas are syntax for data binding and formatting in the style of PHP templating frameworks via shortcut links like [page.date|yyyy-mm-dd], [page.author] I think there’s a lot to tap into here, I hope the sample resolver can get you started on how to come up with this.

3 Likes

Thanks for a comprehensive reply @dbuenzli!

Heh, this is a tall order! When it comes to extensible markdown parser, I don’t think I could do better. My experience so far has been that when you start adding extensions to the base syntax things get hairy very quickly. I currently use markdig which is a powerhouse of markdown parsing and has an open extension mechanism, but even then you have to be careful in how you arrange your parsing pipeline due to interactions between different parsers.

This looks really cool! Unfortunately, I’m kind of bound to the syntax of existing extensions because that’s what people are used to and expect a markdown LSP to have support for.


Anyhow, congrats on 0.1.0 release! I hope I’ll have a chance to try out cmarkit on a real project.

1 Like