I’m curious what the status is of using Unicode characters in OCaml source code.
I’ve noticed that there was apparently some effort to deprecate Latin1 characters in source (though I might be mistaken!) so that eventually a utf8 encoding for source files could be assumed, but even after a bit of googling I’ve seen no mention of whether there’s an actual move towards utf8 source files and Unicode integration into the language.
I was curious as to whether there’s any plan in that direction that I failed to find with google. (I also noticed nothing about this in the roadmaps in Mantis.)
Apologies in advance if I’m rehashing stuff here that everyone already knows well or if the discussion is otherwise unwanted. If not: if source files were assumed to be in utf8, I can see a number of major places Unicode might show up:
- Unicode characters in comments, e.g.
(* we multiply by π here *)
This seems incredibly straightforward to accommodate as comments have no semantic meaning to the compiler, and utf8’s encoding assures that existing comment parsing code wouldn’t break. It might already works perfectly, my non-systematic tests seemed to have worked, though I know there are some documentation tools that parse comments and I have no idea how this would impact them.
- Unicode chars in string constants, i.e.
"π". This seems pretty straightforward as, if the encoding of strings is assumed to be utf8, this doesn’t change the semantics of the language significantly. Of course, there’s a question as to whether there should be some type of
Uchar.tstrings available as well, and there’s also a question about whether there’s any harm in not distinguishing utf8 strings from plain ASCII strings in the type system. Some languages (go I think?) even guarantee that utf8 strings are validly encoded since some byte patterns aren’t valid utf8.
- Unicode “characters”. Clearly
'π'is meaningless because a
charis an eight bit thing, but having constants of type
Uchar.tusing some syntax (say
u'π'or who knows what) seems appropriate.
- Unicode characters in identifier names, e.g
let π = 3.14159etc. This seems fairly straightforward as it doesn’t change the semantics of the language in any significant way, though there might be details to consider (such as applying one of the unicode normalizations to identifiers so they compare identically, and how to deal with scripts without uppercase letters for constructor names).
- Unicode characters in operator names. There’s a wealth of useful Unicode operator symbols available (everything from ∉ to → to ÷), and it might be nice to use them, but given the current scheme in which there’s a very fixed set of symbols that can be used in operators and their associativity and precedence are set by the initial character, it might require significant new syntax to allow such things.
(Why do I care at all? I find it irritating that in 2017, even though my editor lets me enter unicode easily, I’m stuck with expressing programs in the characters picked for ASCII over 50 years ago. That said, it also doesn’t cause any real harm as such, so there’s no urgency to fixing it.)
Anyway, I thought I’d ask about all this stuff rather than remain wondering.