Literals for Uchar.t (Unicode code points, more precisely Unicode scalar values)?

So, I was thinking that the way to keep this unambiguous is for 'a' to be a literal of type char, and for '\u{0061}' to be the same character literal but with type Uchar.t. Basically, the '\u{ and }' parts become the delimiters of a new literal type, and the Unicode escape sequence between the single quotes have the same meaning and syntax in both string literals and Unicode scalar value literals.

If you then allow my wacky idea of permitting printable (non-combining with the delimiters) Unicode scalar values between the braces in the Unicode escape sequence, then you could also write '\u{a}' to get a Uchar.t literal equivalent to Uchar.of_char 'a'. And you could also write string literals like "caff\u{è}" if you want to be really explicit about insisting that it’s a LATIN SMALL E WITH GRAVE ACCENT and not a LATIN SMALL E combined with GRAVE ACCENT. (In case your source code gets normalized to NFD behind your back somehow, the compiler would be able to catch it.)

The compiler team is still in an entmoot about how to deal with the source text encoding problem, as @dbuenzli notes. Which is fine. But the current sort of weird rules that allow free-form bytes in string literals (but not char literals) is maybe something that could be incrementally improved for consistency with char literals, if sufficient care is taken.

If we’re going to use escape sequences, then I think it would be cleaner to extend the integer literal syntax. We could add a suffix like u to indicate a Uchar.t type.

let nativeint = 0x0061n
let uchar = 0x0061u
1 Like

Though that doesn’t really bring you much w.r.t. say:

match Uchar.to_int u with 
| 0x0061 -> …
| _ -> …

let uchar = Uchar.of_int 0x0061

If you are going to use escapes there’s no real need to add something if you ask me.

1 Like

I think 0x0061u would be an acceptable companion to 'a'u, if this syntax is chosen, since it’s 3 characters shorter than '\u{0061}' and avoids the inconsistency of omitting the u-suffix when escaped (unless we double up the u as in '\u{0061}'u which starts looking monstrous).

We are talking about OCaml. The obvious solution is to use an extra ., as in u.'😂'.

1 Like

We already have an escape sequence for Unicode scalar values. It’s just that you can only use it in a string literal and not a character literal. I think that’s a (minor) blemish in the syntax, and I think my proposal (whatever its other detractions) at least removes that inconsistency.

Maybe the other proposals here allow for prettier syntax at the expense of consistency and a steeper learning curve for newcomers. “Q: How do I write a specific Unicode scalar value into my program?” “A: Depends. Are you writing a string literal or a character literal? They’re different.” “Q: Why?” “A: We were trying to let you type three fewer keystrokes.”

1 Like

I have a question for people who might be more familiar with the parsing and type-checking: Suppose we are setting a moonshot design. What could be the potential technical difficulties of overloading 'c' for both char and UChar.t? I feel this is the last missing piece for me to put together a proposal on GitHub.

This is a completely separate feature request (type-directed disambiguation for literals). My advice is to not try to bundle multiple controversial feature requests in one request.

Thanks. I was bringing it up because the last few comments are concerning the concrete notation, or more precisely, where to put the extra u. The type-directed disambiguation can provide a solution if no consensus can be made on where to add the u. I don’t see a way to cleanly decouple the issues.

To clarify, I don’t plan to provide just one solution in the proposal. Instead, I want to discuss all the proposed solutions and their possible issues. That will include the one using the type-directed disambiguation of literals.

[…] “A: We were trying to let you type three fewer keystrokes.”

It was I who mentioned the number three, but I made no mention of “keystrokes”. (On most keyboards it would be more keystrokes, but that does not matter IMO.) My main concern is readability and, yes, consistency, and the length and amount of visual noise has some effect on readability. However, I can see some issues with the 0x0061u notation (to a C programmer it looks like an unsigned integer, and we don’t have the analogous notation for char), so I’m not going to promote it any further.

If we don’t care about the length, and given @dbuenzli’s point that we can always use UChar.to_int for when we want to pattern match on many hex-encoded Unicode scalars, then his mention (but maybe not suggestion) of using doubled single quotes, as in ''a'' would at least be fairly noise-free, at least to my eyes. An then ''\u{0061}'' comes naturally, and could be a useful alternative to UChar.to_int when the pattern isn’t too long, or when we want to mix the notations in the same pattern.

Parsing 'c' shouldn’t be a problem, and there is precedence for the type-directed disambiguation in the format strings, though I’m not a expert on type checking. I think there is a disadvantage with this, though, that while the format strings are normally passed as an argument to a printf-like function so that it’s resolved on the spot, the compiler would have to decide the type of a function like

let is_digit = function '0'..'9' -> true | _ -> false

to be char -> bool while an UChar.t -> bool might have been intended. So, I’m wondering if not a separate syntax would be better to avoid mandatory type annotations for certain cases, which will need to be explained in introductory texts?

I made a feature request on GitHub based on the discussion here: Support unescaped Uchar.t literals · Issue #12696 · ocaml/ocaml · GitHub