Hi @jhw, I am aware of the difference between graphemes (whose official name is grapheme clusters in the Unicode standard) and code points, and I literally meant code points (or more precisely scalar values). This is also why I avoided the word “character” because it is ultimately confusing. However, I appreciate your explanation, which might be helpful for people who are less familiar with the Unicode standard.
In any case, my defense is that many (useful!) grapheme clusters can be written out with only one code point, and that the confusion you mentioned can already happen now with string literals. That is, an OCaml programmer can already confuse themself with strings containing similar-looking grapheme clusters. To clarify, I also want to write just '\x288F'
instead of Uchar.of_int 0x288F
. @jhw I see your point about ~:
(so that it became ~: 0x288F
) but I still feel it’s still not as elegant as it could be (\x288F
).
Let me bring up a concrete example showing the potential usefulness of the proposal:
Unicode has subscripts from ₀
(U+2080) to ₉
(U+2089) that are suitable for creating fresh variables (in proof assistants when shadowing could happen). When parsing the subscript numbers (in order to check whether the user is already using them), I wanted to write
match c with
| '₀' .. '₉' -> ... (* Unicode numeric subscripts *)
| ...
as a straightforward pattern matching instead of
let code = Uchar.to_int c in
if 0x2080 (* ₀ *) <= code && code <= 0x2089 (* ₉ *)
then ... (* Unicode numeric subscripts *)
else ...
I feel the code with pattern matching is obviously more readable. The other is incomprehensible without comments. The fundamental issue is that there’s not even a way to write down “simple” code points such as 'π'
—one must write Uchar.of_int 0x03C0
instead. Parsing seems to show the inconvenience the best.
It is true that Unicode itself is confusing and there are lots of traps and pitfalls, but we already are allowing programmers to write string literals such as "π"
. It is also true that people who are not familiar with Unicode might be shocked to learn that there is not even an upper limit of how many code points you could have in one grapheme cluster (!), but I believe a reasonable compiler error message for clusters with multiple code points should be enough. For example, the compiler can say '🏳️🌈'
is not allowed because it consists of three code points: '🏳️'
, '\x8205'
, '🌈'
. Perhaps it can even suggest a canonical composition if a programmer accidentally used more code points than they should. I would also argue that perhaps users of Uchar.t
should have some basic ideas about Unicode.