String templating and internationalization packages for ocaml

[This is a post about two different-but-related things, so a little
schizophrenic perhaps] and somewhat stream-of-consciousness too.

It’s about a templating idea (for Printf/Fmt-like formatting, and
then how to add internationalization on top of that.

I’ve been writing code that does a lot of formatted text
(specifically, in the context of better error-messages), and thinking
about how that could be made easier-to-do. I use Fmt a lot, b/c
even when not using Format operations, it’s very compact and the
combinator-based approach is efficient on the brain. But it seems
like (with a bit of a front-end) it could be better. I was thinking
about the following code:

Fmt.(str "File %s: mixed short/medium/long-form attributes in str_item:\n  short: %a\n  medium: %a\n  long: %a"
	 	 filename
                 (list string) used_short_form_attributes
                 (list string) used_medium_form_attributes
                 (list string) used_long_form_attributes
            )

and how it separates the format-specifier (%a), the formatter
((list string)) and the actual value being formatted
(used_short_form_attributes). And wondered: maybe it might be nicer
if it were

{%template|[id: mixed-attributes-in-str-item]
File $filename$:$%d:line$: mixed short/medium/long-form attributes in str_item:
  short: $list string:used_short_form_attributes$
  medium: $list string:used_short_form_attributes$
  long: $list string:used_short_form_attributes$|}

The idea being, you specify each bit of ocaml value that’s going to be
formatted between $:

  • If it’s a string, that’s it – just the expression (implied %s as
    the format-specifier)

  • If it’s anything other than a value that is going to be processed by
    a formatter, you give the format-specifier and the expression,
    e.g. $%d:line$

  • and if it’s going to be processed by a formatter, %a is implicit,
    so you just provide the formatter and the expression, e.g. $list string:used_short_form_attributes$.

[I hope it’s obvious this is easy-to-parse. Obviously $ is a
special char, as is :.]

I haven’t written this, and heck, I don’t even know if it’s a good and
scalable idea. I searched the opam database and google, and found
nothing for this sort of application: there are templating languages
for HTML and such, but those are for much larger bits of text.

That’s the first idea.

And then, thinking about it, I realized that perhaps one could go
further and use the fact that a PPX rewriter is processing the
template, to hook into ocaml-gettext? I’ve never used it, but plan to
do so today to figure out how it works; if that doesn’t work, then
something like ocaml-gettext. That is to say, each template could have
an ID (as above, which isn’t actually printed) that gets used to index
into a PO file (or a PO-like file, e.g.

mixed-attributes-in-str-item:
  en_US: """
File $#1$:$#2$: mixed short/medium/long-form attributes in str_item:
  short: $#3$
  medium: $#4$
  long: $#5$
"""
  fr_FR: """
 File $#1$:$#2$: attributs mixtes de forme courte/moyenne/longue dans str_item:
  longue: $list frenchquoted_string:#5$
  moyenne: $list frenchquoted_string:#4$
  courte: $list frenchquoted_string:#3$
"""

[please forgive my bad French, I used Google Translate]

(where frenchquoted_string is a Fmt formatter that surrounds its
argument with guillemets (“<<” and “>>”)). And then, the PPX rewriter
would take this PO file as an argument:

  • any message-id that didn’t appear in the file, would get added

  • any message-id that was -different- than the message in en_US
    would get updated (only the en_US message in the PO file)

  • the PPX rewrite process would be given a language (e.g. fr_FR)
    and use that to select the message-texts from the PO file that it
    would use in the template. So if you selected fr_FR, the text
    in mixed-attributes-in-str-item would be replaced with the
    French text.

And in that replacement template, the order and formatters for each
expression could be changed:

  • #N (e.g. #2) would refer to the second expression in the
    original template, so you could reorder

  • $#N means to use the same formatting instructions as in the
    original template

  • $%d:#N$ means to substitute %d as the formatting instruction
    (perhaps for different justification)

  • $xxyy zz:#N$ means to use a different formatter

This pretty much requires that you do your internationalization at
build-time, since you’re replacing code, reordering arguments, etc.
I’m not sure if that’s a good idea or not, but it seems appealing. I
wrote earlier that I wondered if I could hook into ocaml-gettext, and
maybe this ability to reorder and change code/formatting breaks that,
but maybe it’d be worth limiting that, in order to be able to use
ocaml-gettext.

Anyway, OK, that’s the half-baked brain fart. I would really
appreciate anybody who had any comments on this.

ETA: I see that ppx_pyformat comes close to what I want to do. I’ll have to look closely at it.

1 Like

There are indeed multiple ways to re-implement format strings using ppxlib, I have few example available on the front-end side at GitHub - Octachron/ffmt: Format string experimentations . In particular, I would recommend to use %t function (aka Format.formatter -> unit) rather than %a as the basis building function in an interpolation setting.

Concerning internationalization, I have an old prototype of an internationalized compiler using format string directly by combining Internationalization: a localization hook for compiler-libs by Octachron · Pull Request #1523 · ocaml/ocaml · GitHub which replaces the use of Format.{k,f}printf function by hooks and GitHub - Octachron/babilim: Localization plugin for OCaml which replaces those hooks by functions that translates the format string at runtime in a well-typed way while adding support for positional specifiers for the sake of translators.

Extracting the format string to construct the translation database is done separately using a compiler-libs ast iterator. Localizing at build-time sounds quite limiting to me. In particular, this means that translators cannot add a new language without support by developers.

1 Like

For inspiration, you can also have a look at the ppx_string_interpolation library open-sourced by Bloomberg.

What’s hard to do with strings and the standard formatting machinery might be easier to do with a PPX.

There were several libraries I found (though I didn’t find @dbuenzli 's ): of them, only ppx_pyformat had support for format-specifiers for each value being interpolated.

For sure, a PPX rewriter is the way to go. I take @octachron 's advice that compile-time internationalization is a hard-sell, but I don’t see how to do two things:

(1) [a minor concern] allow change in the order of interpolated values

(2) [a major concern] allow change in the format-specifier. For instance, when changing between en_US and fr_FR, you’d want to change the way that dates and times are formatted. I don’t see how that happens without changing the format-conversion function.

But this doesn’t change that for internationalizations that do NOT change format-specifiers, it should be straightforwad for translators to only add a new PO file, and that that PO file can be statically-checked against the existing default PO file and a passing static check should ensure that the new internationalization builds fine.

But really, I need to write the thing to see if this all works out.

This is implemented in babilim by making the translation layer use positional specifiers internally:
"This %d argument is a %s" is normalized to This %d#0 argument is a %s#1, and the po files may change the order explicitly with "%s#1 is the %d#0 argument (without any ppx).

With string interpolation, you can probably have a “translator name” and use that name as an index. It is even easier if the type of holes is uniform (which is one of the advantage of using exclusively %t hole), but it is doable in a heterogeneous setting.

As far as I can see, this comes automatically once you can translate format strings with reordering

let date d = I8n.dprintf ~context:"date-ymd" "%d/%d/%d" d.year d.month d.day

Then in the en_US setting the translation of %d/%d/%d will be %d#1/%d#2/%d#0, and later use of date can use date as a printer (or the interpolation conversion function) and get the updated format

let today d = Format.dprintf "Today is %t" (date d)

After a little thought, I realized that since I’m using Fmt as the actual formatting engine, I can nearly-trivially slide-in the minimal ocaml-gettext support needed to support i18n the way that that package does it. And then, well, I can move on to doing i18n the way I imagine it. Or maybe deciding that I was wrong about that grin.

Thanks for pushing on this a little: it made me think.

About internationalisation: we are very close to release Ocsigen-i18n 2.0 which removes all unnecessary dependencies, making it usable for any OCaml program (client or server).

3 Likes