Literals for Uchar.t (Unicode code points, more precisely Unicode scalar values)?

I’m really happy that the UTF-8/16 decoder is now part of the standard library! However, I kind of miss a way to write down Unicode code points (more precisely, Unicode scalar values) as literals. That is, I want to be able to write 'よ', '酷' and '😎' as literals of type Uchar.t. (The syntax doesn’t have to be exactly like these—especially if we want to make the parser’s job easier.)

The GitHub issue tracker suggested that I post here to see how people feel about this. :grinning:

3 Likes

(Note: edited for improved clarity. Thanks to @favonia and @dbuenzli for the reminder.)

I’m inclined to think the devil is in the details here.

The details: there is a significant difference between a glyph, i.e. a printable character, and a scalar value.

We’ve already got a not entirely bad way to specify a scalar value by number: Uchar.of_int inlines pretty nicely, and you can use let ( ~: ) = Uchar.of_int if you want a local abbreviation.

What you seem to be asking for is a way to specify a scalar value that represents an entire glyph in one, which is frustratingly not all the graphemes that Unicode can represent, nor is it always an unambiguous way to represent a specific glyph. Furthermore, even if the compiler checks that the glyph you entered between the delimiter is a single scalar value with the grapheme base property (and not a grapheme cluster comprising a sequence of scalar values), there is the not inconsiderable problem that a lot of glyphs look very similar to others, and they can be easily confused. The problem is significantly larger than when the alphabet is constrained to the Latin-1 printable characters.

Shorter jhw: I’m not sure that a syntax for Unicode scalar value literals would be as useful as it might seem from just a cursory examination.

1 Like

Agree. It would only make sense if get rid of the “char” in the type name completely and use the proper naming: Let's Stop Ascribing Meaning to Code Points - In Pursuit of Laziness

Ideally, Uchar should be deprecated and removed in the long run.

Hi @jhw, I am aware of the difference between graphemes (whose official name is grapheme clusters in the Unicode standard) and code points, and I literally meant code points (or more precisely scalar values). :sweat_smile: This is also why I avoided the word “character” because it is ultimately confusing. However, I appreciate your explanation, which might be helpful for people who are less familiar with the Unicode standard.

In any case, my defense is that many (useful!) grapheme clusters can be written out with only one code point, and that the confusion you mentioned can already happen now with string literals. That is, an OCaml programmer can already confuse themself with strings containing similar-looking grapheme clusters. To clarify, I also want to write just '\x288F' instead of Uchar.of_int 0x288F. @jhw I see your point about ~: (so that it became ~: 0x288F) but I still feel it’s still not as elegant as it could be (\x288F).

Let me bring up a concrete example showing the potential usefulness of the proposal:

Unicode has subscripts from (U+2080) to (U+2089) that are suitable for creating fresh variables (in proof assistants when shadowing could happen). When parsing the subscript numbers (in order to check whether the user is already using them), I wanted to write

match c with
| '₀' .. '₉' -> ... (* Unicode numeric subscripts *)
| ...

as a straightforward pattern matching instead of

let code = Uchar.to_int c in
if 0x2080 (* ₀ *) <= code && code <= 0x2089 (* ₉ *)
then ... (* Unicode numeric subscripts *)
else ...

I feel the code with pattern matching is obviously more readable. The other is incomprehensible without comments. The fundamental issue is that there’s not even a way to write down “simple” code points such as 'π'—one must write Uchar.of_int 0x03C0 instead. Parsing seems to show the inconvenience the best.

It is true that Unicode itself is confusing and there are lots of traps and pitfalls, but we already are allowing programmers to write string literals such as "π". It is also true that people who are not familiar with Unicode might be shocked to learn that there is not even an upper limit of how many code points you could have in one grapheme cluster (!), but I believe a reasonable compiler error message for clusters with multiple code points should be enough. For example, the compiler can say '🏳️‍🌈' is not allowed because it consists of three code points: '🏳️', '\x8205', '🌈'. Perhaps it can even suggest a canonical composition if a programmer accidentally used more code points than they should. I would also argue that perhaps users of Uchar.t should have some basic ideas about Unicode. :wink:

1 Like

@XVilka Hah, I also do not like the name Uchar because “character” is indeed very confusing! :sweat_smile: However, I do not see why that implies we should not have a convenient syntax for code points (and I literally mean code points here). Did you mean that if Uchar was renamed to, say, UCodePoint, then you would be fine with it? Or is there something fundamentally wrong with the proposal? (Unfortunately, I feel it might be difficult to change the standard library.)

Yes, it’d be great to have a literal syntax for Uchar! Which, slight nitpick, represents Unicode scalar values, not code points; it skips over the surrogates.

I’ve written more than one if Uchar.is_char u then match Uchar.unsafe_to_char u with ..., since all my text parsers use Uchar.utf_decode, so I’m obviously in favour. It’d also be nice if they supported range syntax for pattern matching, although we could make do with when clauses if that proved too hard.

@XVilka Pretty much every language has messed up the Unicode nomenclature, some for historical reasons and others even on purpose, so I don’t think naming is a good reason to deprecate this module in the standard library.

Off the top of my head:

  • C#, Java and JS call “char” to UTF-16 code units.
  • Rust calls “char” to scalar values.
  • Go calls “rune” to code points.
  • C# calls “Rune” to scalar values.
  • Swift calls “Character” to extended grapheme clusters, but it intentionally leaves unspecified which segmentation algorithm was used, default or custom tailored, so they’re effectively nondescript substrings that only make sense in context.

So “Uchar” is passable, and it will even make sense once we actually have Unicode “character literals”, which is what we all call that syntax anyway, isn’t it? At least I haven’t seen anyone confuse it with “unsigned char”, which was my worry when it was added, but thankfully the (slim) docs make that clear immediately.

@jhw If you allow me to rant a bit: I don’t understand why have people recently fixated on treating grapheme clusters as a text primitive. The point of text segmentation is finding useful boundaries within a string, for tasks like selection, editing and layout; but knowing that a standalone string is a single cluster is not very useful, because without boundaries to some other text the only information you have is that you can’t segment further.

There’s not even a single definition of what a grapheme cluster is and that’s by design. Extended grapheme clusters as defined in UAX #29 are just a default meant to be localised and tailored for different use cases. A recent proposal to Unicode (“Setting expectations for grapheme clusters”) even suggests to stop calling them “user-perceived characters” and emphasising that they really need to be tailored to make sense for any given purpose.

Blessing the default grapheme clusters in the syntax and standard library of a language could lead developers to use them without understanding their purpose and building subtly broken software. And the defaults also change between versions of Unicode, so you either pin your language to a fixed Unicode version or code could fail to compile on future releases.[1]


  1. But I’d love a report on both misuse and compatibility breaks from Swift usage, since they were bold enough to do it! ↩︎

1 Like

Does this require changing or further specifying the way OCaml source files are encoded, in order for the compiler to recognize as a code point a sequence of (wlog) UTF-8 bytes in the file?

My impression is that the compiler currently assumes ISO 8859-1 where it has to make any decisions. For example, my ocaml will accept (with a warning) the program given by the OCaml string literal "let \223 = ()\n" - that is, where the identifier being bound is the ISO 8859-1 Eszett. The warning is “Alert deprecated: ISO-Latin1 characters in identifiers”. Doesn’t this have to change in order for you to write ‘よ’ (as bytes in whatever Unicode encoding) and have the compiler understand that you mean a certain Unicode code point?

(The syntax doesn’t have to be exactly like these—especially if we want to make the parser’s job easier.)

I don’t think the syntax can be exactly this because 'a' already has to be understood as a literal of type char, not of Uchar.t.

Re: pattern matching, my understanding is that range patterns (the .. syntax) are restricted to chars to prevent performance blowups in the compiler: they elaborate to a match on each char in the range, so you could casually construct really large patterns if forms like 0 .. 1_000_000 were allowed. I’m not sure what the threshold of problematic range size is, but the proposed pattern-matching syntax clearly raises the upper limit by a lot, so it’s something to think about.

1 Like

No, you meant Unicode scalar values :–)

On the name Uchar. Since I’m the one who added it.

For many years I have tried to use the precise terminology that the standard defines and used the term Unicode scalar values which is what people usually intend to say when they wrongly say Unicode code point.

It seems no one wants to use it[1], not even the very people who devise the standards themselves and not the many text based standards I read (which tend to be confused about what international text is). It’s a pity since that leads to the great confusion that lives in most programmers mind about Unicode[2] . When people talk about Unicode scalar values they use either:

  1. Unicode characters, an undefined concept in the standard or
  2. Unicode code points, which is wrong, confusing and definitively not what you want to work with in your programming language.

I could have called that Uscalar but I’m not sure the module would be in at that point given the little understanding there exist about international text out there. So I decided to take the undefined concept (1) and cast it into the important concept that you need to deal with in Unicode processing – more on that semantic recast in my minimal Unicode introduction here.

I firmly stand by this choice because in the end scalar values are the atoms of Unicode, they are its defining alphabet so using Unicode character for that – to be distinguished of course of user perceived character – seems a good fit. For example it’s the members of the alphabet that your finite state automata on international text runs on (at least conceptually).

So no I think it would be extremely silly to deprecate Uchar and even more to call it UCodePoint.


  1. A little bit like the more “correct” URI instead of URL. ↩︎

  2. Which if that’s your case, you can always try to clear for yourself by reading my minimal unicode introduction ↩︎

6 Likes

You are absolutely right! I am going to fix it…

My impression from Modest support for Unicode letters in identifiers by xavierleroy · Pull Request #11736 · ocaml/ocaml · GitHub is that we will adopt UTF-8 and drop the ISO-Latin1 support in the end.

a .. b only means if the value is between a and b, then it’s matched. The compiler does not have to enumerate everything in the range. It can also be done with two comparisons for larger ranges.

It does in the current implementation of pattern matching, though, which currently explodes the entire range, I believe to deal with exhaustiveness more easily. There was a similar discussion about integer ranges: someone would have to write a different implementation of range patterns. Luckily, that can be done at a different time from introducing the literal syntax.

I’d like to echo a general point that has been mentioned by @debugnik and partially by @dbuenzli: Grapheme clusters should never be primitive in a programming language. Instead, the only technically reliable things are Unicode code points, Unicode scalar values, and standardized UTFs. The next thing could be normalization forms due to the promised stability from the the Unicode Consortium. There is also some limited stability about case-folding. Almost anything beyond these can be very problematic.

Relatedly, I think the current OCaml standard library already covers the basics (thank you @dbuenzli) and I can implement quite a few things which use Unicode without additional OCaml libraries (!). I am very happy with the current support in OCaml—it provides minimal but sufficient tools for most tasks. (Some other people mentioned Go in another thread but its support is technically worse in my opinion.) My proposal is to address one pain point which can arguably be done without dramatic changes.

So far the main technical obstacle seems to be (large) range patterns. I’m happy to drop those for now. I think it’s still a huge improvement if we can write the following code:

if '₀' <= code && code <= '₉'
then ... (* Unicode numeric subscripts *)
else ...

There was one more point I forgot to respond to:

I believe it is still an option to allow it. We are already overloading "..." for string literals and format string literals. We could also overload 'c' and have the type checker heavily biased towards the old char. That said, to clarify, I am not insistent on this.

3 Likes

I think you make a good point about pattern matching here. The ( ~: ) abbreviation trick mentioned above only works in expressions, not patterns. I agree it might be nice if a pattern syntax for Unicode scalar values were to be invented. I’m just not sure I want to see them written with UTF-8 directly in source code.

One compromise might be to allow '\u{...}' to be recognized as a literal in expressions and patterns of type Uchar.t in a similar way that escape sequence works in string literals. You could put the UTF-8 into a comment if you wanted to document what the scalar value looks like.

Your example code would be:

match c with
| '\u{2080}'..'\u{2089}' -> ... (* numeric subscripts, i.e. '₀' .. '₉'  *)
| ...

I don’t think that looks too bad.

To summarize the discussion so far, there are three major levels of support:

  • Level 1: Support Uchar.t literals such as '\u{2080}'; in particular, it should be a pattern.
  • Level 2: In addition to Level 1, support Uchar.t literals such as '😎'.
  • Level 3: In addition to Level 2, support range patterns such as '₀' .. '₉'

I can put aside Level 3 until someone implements range patterns for int. However, I still think there’s a huge difference between Level 1 and Level 2. In particular, I believe

let subzero = '₀'

is significantly more readable than

let subzero = '\u{2080}' (* ₀ *)

To see why, there’s no easy way for me to detect the bug in the following code:

let subzero = '\u{2090}' (* ₀ ... just kindding, the code is wrong! *)

I think one important reason for us to use high-level languages such as OCaml is to reduce our mental burden when attempting to write bug-free programs. The (lack of) consistency between comments and code is a well-known major source of bugs. The Level 1 support would invite many human errors that can be trivially detected with the Level 2 support, and for this reason I think Level 2 is essential if we want Uchar.t literals at all.

PS: I changed my previous notation '\x2080' to '\u{2080}' so that it is consistent with string literals, as suggested by @jhw.

PPS: I want to thank everyone for the discussion. Now I think I’m more or less ready to create a real proposal on GitHub…

1 Like

Sorry if I missed an earlier explanation, but don’t you need to use a
lexical convention that is disjoint from “regular” (byte) characters?

Something like, say, u'😎' rather than '😎'? These literals would
then always have type Uchar.t. The reason is that 'a' could be
interpreted as both a char and a Uchar.t otherwise.

Hopefully this is not off topic.

1 Like

Indeed. Note that the syntax for Uchar.t literals was discussed relatively recently here upstream.

This cuts both ways. From my perspective, there is also this problem:

let zero = 'Ø' (* just kidding... that's LATIN CAPITAL LETTER O WITH STROKE *)

That said, I will admit that it’s kinda hinky that regular-string-char admits full-on UTF-8 encoded Emoji but regular-char does not.

If people agree with you that any printable UTF-8 encoded scalar should be allowed in Uchar.t literals, then perhaps one maybe not so cheesy way to do that is to allow exactly one UTF-8 encoded scalar value to appear inside the braces of \u{...} forms in place of the usual hexadecimal digits.

Ah, my summary did not make it clear, but I think there are two solutions (as mentioned in previous comments): (1) overload it (just like string literals and format string literals) or (2) use u'😎' as you suggested or another distinct notation.

I do have tiny concerns about whether u'\u{XXXX}' would look too ugly with two u, but overall I don’t really care. I believe someone (not me) in the community will come up with a nice notation. My main goal is to avoid writing down the numeric scalar values.

@dbuenzli Wow, thank you for the pointer. I somehow failed to discover the thread!

@jhw Yeah, I don’t disagree with you that it can still be confusing. My point is that it’ll not make things worse than string literals and (I still believe) in some cases it’ll improve readability. I’m a bit confused by your suggestion '\u{😎}' though. Maybe you just want u'😎'?

I don’t want to curb your enthusiasm but I think that the first step for anything to happen would be that OCaml first decide that its sources are UTF-8 encoded.

It’s also likely that some of the the stuff in the new UTS55 should be taken into account at that point (I didn’t have the time to read it yet).

Finally there’s the question of unwanted combination when your literal is a combining character: you don’t want it to combine it with the literal syntax delimiters, see my note here.

1 Like

Yes, I’m now painfully aware of that. :neutral_face: However, it seems '\u{XXXX}' is not controversial? Though it might eventually force us to consider overloading 'c'. As usual, thanks for the very useful pointers.

If you ask me, it’s tricky. Surely it’s unambiguous.

But then you can’t do that on non-escaped literals (as @c-cube noticed) and I would posit that it’s not very usable and rather confusing to have different delimiting syntaxes for escaped and non-escaped literals.

These things happen in the long term though. So maybe it would be good to try to settle a moonshot design that upstream might agree with and then start putting the pieces in.

For example if upstream is willing to break compat on a u'…' and u"…" (or '…'u and "…"u as @lpw25 seemed to suggest in the discussion I linked to[1]). Then what can be done now is to implement a warning that triggers on such source code occurence which are allowed for now.

That way in a few years when all the pieces are in place, it becomes easier to convince people to switch the flip.


  1. I’m personally not convinced about postfix modifiers for strings though. ↩︎