Unicode in OCaml source code?


#1

I’m curious what the status is of using Unicode characters in OCaml source code.

I’ve noticed that there was apparently some effort to deprecate Latin1 characters in source (though I might be mistaken!) so that eventually a utf8 encoding for source files could be assumed, but even after a bit of googling I’ve seen no mention of whether there’s an actual move towards utf8 source files and Unicode integration into the language.

I was curious as to whether there’s any plan in that direction that I failed to find with google. (I also noticed nothing about this in the roadmaps in Mantis.)

Apologies in advance if I’m rehashing stuff here that everyone already knows well or if the discussion is otherwise unwanted. If not: if source files were assumed to be in utf8, I can see a number of major places Unicode might show up:

  1. Unicode characters in comments, e.g. (* we multiply by π here *)
    This seems incredibly straightforward to accommodate as comments have no semantic meaning to the compiler, and utf8’s encoding assures that existing comment parsing code wouldn’t break. It might already works perfectly, my non-systematic tests seemed to have worked, though I know there are some documentation tools that parse comments and I have no idea how this would impact them.
  2. Unicode chars in string constants, i.e. "π". This seems pretty straightforward as, if the encoding of strings is assumed to be utf8, this doesn’t change the semantics of the language significantly. Of course, there’s a question as to whether there should be some type of Uchar.t strings available as well, and there’s also a question about whether there’s any harm in not distinguishing utf8 strings from plain ASCII strings in the type system. Some languages (go I think?) even guarantee that utf8 strings are validly encoded since some byte patterns aren’t valid utf8.
  3. Unicode “characters”. Clearly 'π' is meaningless because a char is an eight bit thing, but having constants of type Uchar.t using some syntax (say u'π' or who knows what) seems appropriate.
  4. Unicode characters in identifier names, e.g let π = 3.14159 etc. This seems fairly straightforward as it doesn’t change the semantics of the language in any significant way, though there might be details to consider (such as applying one of the unicode normalizations to identifiers so they compare identically, and how to deal with scripts without uppercase letters for constructor names).
  5. Unicode characters in operator names. There’s a wealth of useful Unicode operator symbols available (everything from ∉ to → to ÷), and it might be nice to use them, but given the current scheme in which there’s a very fixed set of symbols that can be used in operators and their associativity and precedence are set by the initial character, it might require significant new syntax to allow such things.
    (Why do I care at all? I find it irritating that in 2017, even though my editor lets me enter unicode easily, I’m stuck with expressing programs in the characters picked for ASCII over 50 years ago. That said, it also doesn’t cause any real harm as such, so there’s no urgency to fixing it.)

Anyway, I thought I’d ask about all this stuff rather than remain wondering.


#2

Just a comment to 4:

IIRC, OCaml distinguishes identifiers which start upper case letters (for modules etc…) and lowercase letters (local var…). Unicode characters are ether upper case, lower case, title case or don’t have case. So we need to think about how to handle the later 2 cases. Also these definitions are table-driven, so incorporating such distinction would be a burden of implementation.

Or, we change the parsing rule so that using upper/lowercases is just a convention, not enforced by a parser. This seems a right direction IMHO.


#3

Yah, I mentioned that issue in:

You suggested:

Another option is just to relax the rule only for character sets in which there is no upper case, so it would be retained for Roman and Greek character sets etc., but would not apply if something was written in Chinese characters.


#4

For operators, one note:

Since having vastly more characters for operators available might mean that precedence, associativity, etc. might need declarations to maintain sanity, doing something better than the past might be nice.

I’ve always found the way that ML, Coq, Haskell etc. declare precedence to be unreasonable. You say something like “this operator is at level 5”, but there’s nothing natural in the human brain that says immediately to you what that means.

One thing that has always occurred to me would be it would be reasonable to declare something as having the same, or higher, or lower (or both) precedence than some other operator or operators, so that if you read the declaration, you say, “aha! + and - have lower precedence than × and ÷, which are in turn lower precedence than ~-!”, which seems far more natural than than “hrm, this has precedence level 7, what does that mean?”

Strawman syntax might be something derived from ML and Haskell like

infixl (= ∈) ∉

which would mean " has same precedence as " — one could also have multiple clauses like (> +; < √) to indicate greater than the + operator but below the operator.

Also, this would necessitate having some way of allowing modules to export fixity and precedence information (the fact that ML can’t do that to my knowledge is irritating.)

Again, none of this should be taken too seriously. It might be a really terrible idea.


#5

OCaml operator precedence is derived automatically from the first character of the operator’s symbol: https://caml.inria.fr/pub/docs/manual-caml-light/node4.9.html


#6

Yes. I’m well aware, as I indicated above. However, if a few hundred new symbols are added to the operator repertoire, that approach would seem not to scale well. It is easy to remember precedence and associativity for a handful of prefix symbols, but not that easy to remember hundreds.

That’s why I’m suggesting that, if a large number of new symbols are added, a declaration syntax might be a good idea. On the other hand, maybe it isn’t a great idea. After all, traditionally, OCaml hasn’t done things that way.


#7

Indeed. Sorry, bad reading on my part. For now I can recommend using your editor or font ligatures to render compound operators in more aesthetically pleasing ways. This is backward-compatible but also looks nice. Fira Code and Hasklig are really good. Iosevka is my personal choice.


#8

I am a bit of a font snob, and I absolutely loathe code ligatures.

But my real question was a long term one. What is the intent of the OCaml development community about Unicode support in source code? My woolgathering about where it might show up was not really as interesting to me as learning what plans people might already have.


#9

If you really want random opinions on this, then I would say that I wouldn’t mind #2 and #3 in the language, as part of support for OCaml programs readily being able to manipulate text written in other languages. For #1, #4, and #5 and fancy symbols in code I would prefer legal and civil penalties and a public registry of those convicted of having engaged in such anti-social behavior.

I expect though that I’ll see fancy symbols in code for #1, #4 and #5, because it’s actually sufficient for a build system to have a translation step. OCaml’s own lack of support doesn’t much matter except as far as the translation step eventually becoming unnecessary. Of course, if everyone who would be interested in fancy symbols in code persists in framing their interests as including text written in other languages, they’ll persist in not seeing build system extensions as a path that is open to them, since it is obviously an absurd answer to the other issue.

If you are more interested in an idea of what to expect from releases of ocaml, try searching for prior discussion on the mailing lists. That’s not a guarantee but it can inform your expectations.


#10

Totally off topic, but out of curiosity, what don’t you like about code ligatures?


#11

I tried searching for such things and had difficulty finding them, perhaps because google finds too many unrelated hits. This is why I asked.


#12

That’s a long discussion, but the summary is, a combination of “physically ugly” (I’m a typography geek), “violates expectations”, and “if I want special symbols a la Unicode, why not use Unicode?” — but I think this is too much of an aside and probably would be better discussed elsewhere.


#13

To my knowledge the most advanced work on using Unicode in OCaml source files is @whitequark’s ocaml-m17n project, which provides a unicode-aware frontend to the OCaml compiler (if I remember correctly, the implementation technique is to ask as a syntax preprocessor, but that is very transparent to users).

The idea to move the upstream compiler in that direction has been discussed a few times already (and (1) and (2) in the list are already supported), but while there is no strong opposition, there is a certain inertia and cold-feetness about doing things we would regret later. I think that wider usage of ocaml-m17n by interested people and feedback on use-cases could help move the discussion forward.

(There are two questions, (a) using unicode in source files and (b) language and library support for writing unicode-manipulating programs. I think that it’s best to separate them and keep this topic about (a), although points (2) and (3) of @perry’s list are also related to (b).)