[ANN] OTOML 0.9.0 — a compliant and flexible TOML parsing, manipulation, and pretty-printing library

I don’t really like to base a release announcement on bashing another project, but this whole project is motivated by my dissatisfaction with To.ml—the only TOML library for OCaml, so here we go. OTOML is a TOML library that you (hopefully) can use without writing long rants afterwards. :wink:

In short:

  • TOML 1.0-compliant (To.ml is not).
  • Good error reporting.
  • Makes it easy to look up nested values.
  • Bignum and calendar libraries are pluggable via functors.
  • Flexible pretty-printer with indentation.

OPAM: opam - otoml
GitHub: GitHub - dmbaturin/otoml: A TOML(1.0.0-compliant) parsing and manipulation library for OCaml

Now let’s get to details.

TOML is supposed to be human-friendly so that people can use it as a configuration file format. For that, both developer and end-user experience must be great. To.ml provides neither. I’ve been using To.ml in my projects for a long time, and

Standard compliance

TOML is neither minimal nor obvious really, it’s much larger than the commonly used subset and the spec is not consistent and not easy to read, but To.ml fails at rather well-known things, like dotted keys, arrays of tables and heterogeneous arrays.

OTOML passes all tests in the test suite, except the tests related to bignum support. Those tests fail because the default implementation maps integers and floats to the native 31/63-bit OCaml types. More on that later.

Error reporting

Let’s look at error reporting. To.ml’s response to any parse error is a generic error with just line and column numbers.

utop # Toml.Parser.from_string "foo = [" ;;
- : Toml.Parser.result =
`Error
  ("Error in <string> at line 1 at column 7 (position 7)",
   {Toml.Parser.source = "<string>"; line = 1; column = 7; position = 7})

Menhir offers excellent tools for error reporting, so I took time to make descriptive messages for many error conditions (there are generic “syntax error” messages still, but that’s better than nothing at all).

utop # Otoml.Parser.from_string_result "foo = [" ;;
- : (Otoml.t, string) result =
Error
 "Syntax error on line 1, character 8: Malformed array (missing closing square bracket?)\n"

utop # Otoml.Parser.from_string_result "foo = {bar " ;;
- : (Otoml.t, string) result =
Error
 "Syntax error on line 1, character 12: Key is followed by end of file or a malformed TOML construct.\n"

Looking up nested values

Nested sections are common in configs and should be easy to work with. This is how you do it in OTOML:

utop # let t = Otoml.Parser.from_string "[this.is.a.deeply.nested.table]
answer=42";;
val t : Otoml.t =
  Otoml.TomlTable
   [("this",
     Otoml.TomlTable...

utop # Otoml.find t Otoml.get_integer ["this"; "is"; "a"; "deeply"; "nested"; "table"; "answer"] ;;
- : int = 42

For comparison, this is how it was done in To.ml:

utop # let toml_data = Toml.Parser.(from_string "
[this.is.a.deeply.nested.table]
answer=42" |> unsafe);;
val toml_data : Types.table = <abstr>

utop # Toml.Lenses.(get toml_data (
  key "this" |-- table
  |-- key "is" |-- table
  |-- key "a" |-- table
  |-- key "deeply" |-- table
  |-- key "nested" |-- table
  |-- key "table" |-- table
  |-- key "answer"|-- int ));;
- : int option = Some 42

Extra dependencies

The TOML spec includes first-class RFC3339 dates, for better or worse. The irony is that most uses of TOML (and, indeed, most configuration files in the world) don’t need that, so it’s arguably a feature bloat—but if we set out to support TOML as it’s defined, that question is academic.

The practical implication is that if the standard library of a language doesn’t include a datetime type, a TOML library has to decide how to represent those values. To.ml makes ISO8601 a hard dependency, so if you don’t use dates, you end up with a useless dependency. And if you prefer another library (or need functionality no present in ISO8601), you end up with two libraries: one you chose to use, and one more forced on you.

Same goes for the arbitrary precision arithmetic. Most configs won’t need it, but the standard demands it, so something needs to be done.

Luckily, in the OCaml land we have functors, so it’s easy to make all these dependencies pluggable. So I made it a functor that takes three modules.

module Make (I : TomlInteger) (F : TomlFloat) (D : TomlDate) : 
  TomlImplementation with type toml_integer = I.t and type toml_float = F.t and type toml_date = D.t

This is how to use Zarith for big integers and keep the rest unchanged:

(* No signature ascription:
   `module BigInteger : Otoml.Base.TomlInteger` would make the type t abstract,
   which is inconvenient.
 *)
module BigInteger = struct
  type t = Z.t
  let of_string = Z.of_string
  let to_string = Z.to_string
  let of_boolean b = if b then Z.one else Z.zero
  let to_boolean n = (n <> Z.zero)
end

module MyToml = Otoml.Base.Make (BigInteger) (Otoml.Base.OCamlFloat) (Otoml.Base.StringDate)

Printing

To.ml’s printer can print TOML at you, that’s for certain. No indentation, nothing to help you navigate nested values.

utop # let toml_data = Toml.Parser.(from_string "[foo.bar]\nbaz=false\n [foo.quux]\n xyzzy = [1,2]" |> unsafe) |> Toml.Printer.string_of_table |> print_endline;;
[foo.bar]
baz = false
[foo.quux]
xyzzy = [1, 2]

We can do better:

utop # let t = Otoml.Parser.from_string "[foo.bar]\nbaz=false\n [foo.quux]\n xyzzy = [1,2]" |> Otoml.Printer.to_channel ~indent_width:4 ~collapse_tables:false stdout;;

[foo]

[foo.bar]
    baz = false

[foo.quux]
    xyzzy = [1, 2]
val t : unit = ()

utop # let t = Otoml.Parser.from_string "[foo.bar]\nbaz=false\n [foo.quux]\n xyzzy = [1,2]" |> Otoml.Printer.to_channel ~indent_width:4 ~collapse_tables:false ~indent_subtables:true stdout;;

[foo]

    [foo.bar]
        baz = false

    [foo.quux]
        xyzzy = [1, 2]
val t : unit = ()

Maintenance practices

Last but not least, good maintenance practices are also important, not just good code. To.ml is at 7.0.0 now. It has a CHANGES.md file, but I’m still to see the maintainers document what the breaking change is, who’s affected, and what they should do to make their code compatible.

For example, in 6.0.0 the breaking change was a rename from TomlLenses to Toml.Lenses. In an earlier release, I remember the opposite rename. Given the standard compatibility problems going unfixed for years, that’s like rearranging furniture when the roof is leaking.

I promise not to do that.

Conclusion

I hope this library will help make TOML a viable configuration file format for OCaml programs.

It’s just the first version of course, so there’s still room for improvement. For example, the lexer is especially ugly: due to TOML being highly context-sensitive, it involves massive amounts of lexer hacks for context tracking. Maybe ocamllex is a wrong tool for the job abd it should be replaced with something else (since I’m using Menhir’s incremental API anyway, it’s not tied to any lexer API).

The printer is also less tested than the parser, so there may be unhandled edge cases. It also has some cosmetic issues like newlines between parent and child tables.

Any feedback and patches are welcome!

20 Likes

I like the nested fields example, this kind of find function should become more common for structured data formats.

Maybe ocamllex is a wrong tool for the job abd it should be replaced with something else

Have a look at sedlex, which lets you write lexers as regular functions with PPX, and also handles Unicode.

Also, did you find Menhir’s error reporting features lacking in any way? For future reference.

this kind of find function should become more common for structured data formats

Ezjsonm also has a similar function, and it’s very handy.

Also, did you find Menhir’s error reporting features lacking in any way? For future reference.

Merging and updating .messages files is not as smooth as I hope it eventually will be, but it’s improving. Also, mapping token sequences to states in your mind isn’t always easy, but that’s a more fundamental issue.

Have a look at sedlex, which lets you write lexers as regular functions with PPX, and also handles Unicode.

I looked at it a few times and I couldn’d figure out whether it can make context tracking simpler (or indeed allow it in the same way as I did with ocamllex).

The problem is best demonstrated by this string: [true]. In the “top level context”, it’s a header of a table named “true”. However, in foo = [true], it’s an array with a single boolean item true.

Right now, the tracking required to take that into account is done with a mutable reference in the lexer that is updated from inside lexing rules. That is pure ugly ugliness, but “correct” beats “pretty” any time. The question is whether any tool can make it correct and pretty at the same time.

I’m afraid you’d still need a mutable reference for tracking context in sedlex, but at least you can match on the context before matching on the lexer buffer. Depending on the grammar, it may be a win in readability or not improve at all. Sadly there’s no state feedback from Menhir to the lexer.

I recently tried writing a recursive ascent-descent parser with sedlex and I was pleasantly surprised by how readable and flexible it was: no build steps, no lexer hacks (or lexer at all) and no messages file, but then you miss on the correctness checks and diagnostics you get from a generator, so I’m not convinced either way.

In a completely handwritten lexer I could make context tracking pure by making it a part of the state passed as an argument between calls.

Perhaps a hypothetical “sedlex2” could facilitate that by allowing the user to pass a state variable and a state_update function. Now I’m wondering if I can achieve the same by using sedlex’s primitive matchers and a hand-written driver…

I think you can do that fine with sexlex by hand, at least. Here is an example of a sexlex usage I did before:

But maybe I’m not quite getting what you’re saying.

module Token = struct
  type token =
    | Word   of string
    | Phrase of string
    | Spread
    | Op
  [@@deriving show]

  type t = token list [@@deriving show]
end

module Tb : sig
  type t

  val create : unit -> t

  val add : Token.token -> t -> t

  val build : t -> Token.t
end = struct
  type t = Token.t

  let create () = []

  let add v t = v :: t

  let build t = List.rev t
end

type err =
  [ `Premature_end
  | `Error
  ]
[@@deriving show]

open Token

let word = [%sedlex.regexp? Star (Sub (any, Chars " :.\"'"))]

let phrase_double = [%sedlex.regexp? Star (Sub (any, Chars "\"\\"))]

let phrase_single = [%sedlex.regexp? Star (Sub (any, Chars "'\\"))]

let rec token bldr buf =
  match%sedlex buf with
    | ' '  -> token bldr buf
    | '"'  -> phrase_double_quote bldr buf ""
    | '\'' -> phrase_single_quote bldr buf ""
    | eof  -> Ok (Tb.build bldr)
    | word ->
        let word = Sedlexing.Utf8.lexeme buf in
        post_word (Tb.add (Word word) bldr) buf
    | _    -> Error `Error

and phrase_double_quote bldr buf acc =
  match%sedlex buf with
    | '"'           ->
        let word = Sedlexing.Utf8.lexeme buf in
        let word = acc ^ CCString.sub word 0 (CCInt.max 0 (CCString.length word - 1)) in
        token (Tb.add (Word word) bldr) buf
    | "\\\""        ->
        let acc = acc ^ "\"" in
        phrase_double_quote bldr buf acc
    | phrase_double ->
        let acc = acc ^ Sedlexing.Utf8.lexeme buf in
        phrase_double_quote bldr buf acc
    | _             -> Error `Error

and phrase_single_quote bldr buf acc =
  match%sedlex buf with
    | '\''          ->
        let word = Sedlexing.Utf8.lexeme buf in
        let word = acc ^ CCString.sub word 0 (CCInt.max 0 (CCString.length word - 1)) in
        token (Tb.add (Word word) bldr) buf
    | "\\'"         ->
        let acc = acc ^ "'" in
        phrase_single_quote bldr buf acc
    | phrase_single ->
        let acc = acc ^ Sedlexing.Utf8.lexeme buf in
        phrase_single_quote bldr buf acc
    | _             -> Error `Error

and post_word bldr buf =
  match%sedlex buf with
    | ':'  -> token (Tb.add Op bldr) buf
    | ".." -> post_spread (Tb.add Spread bldr) buf
    | ' '  -> token bldr buf
    | eof  -> Ok (Tb.build bldr)
    | _    -> Error `Error

and post_spread bldr buf =
  match%sedlex buf with
    | ' '  -> token (Tb.add (Word "") bldr) buf
    | word ->
        let word = Sedlexing.Utf8.lexeme buf in
        token (Tb.add (Word word) bldr) buf
    | eof  -> token (Tb.add (Word "") bldr) buf
    | _    -> Error `Error

let tokenize s = token (Tb.create ()) (Sedlexing.Utf8.from_string s)

A nice trick I’ve used several time is fuzzing / property-based testing of the parser and printer together:

  1. pick a property-based testing library (typically QCheck or Crowbar)
  2. write a generator for the parsed values
  3. check that generating a value, then printing it, then parsing it again, you get the same value as a result

This will typically find bugs in either the printer, parser, or value generator.

9 Likes

Sedlex needs a bit of love I think. Among other things, it needs a slightly richer set of regexps and the ability to use a more conventional regexp syntax. Mechanisms to allow the lexer to be used in a more functional way might also be good.