Help me pick a bikeshed color, er, print syntax for Uchar.t


#1

Anyone want to help me figuring out the color for a bikeshed?

Uchar.t currently lacks a default printer for ppx_deriving.show — indeed, it’s practically the only thing in the standard library that doesn’t have such a printer. I’d like to do a pull request to add one. Writing the code is easy, but figuring out what the default printed representation should look like is not obvious, because there’s no read syntax for Uchar.t in OCaml currently.

I’d like to pick something that might even be a good candidate for a read syntax someday, so thoughts on a good one are actively solicited. I don’t want what I pick to suck, inspire the addition of that syntax to the language itself, and then end up living for a long time even though it sucks.

So, any suggestions? One complexity is that Uchar.t can store things that don’t print out very well, like zero-width joiners, combining characters, etc.

See also the discussion at https://github.com/ocaml-ppx/ppx_deriving/issues/174


#2

One suggestion for an OCaml Uchar.t read syntax that’s evolved on the discord channel:

let pi : Uchar.t = \u'π'

(for direct entry of Unicode chars in source)

and

let alsopi: Uchar.t = \u{3C0}

(for entry of chars by their hex codepoint.)

It’s gross, but finding something less gross seems hard…


#3

Printing out π requires determining the current locale of the user and leveraging that information to encode the codepoints to that encoding (you can’t just print UTF-8 and cross your fingers), depending on whether or not you want to support fun stuff like EBCDIC platforms or people running on iso-8859 platforms.

On both latin1 and utf-8 platforms I guess you could print out the char encoding provided it’s printable (0x20 <= t <= 0x7e or whatever)

An alternative would be to just print the codepoints, preferably in a format that would permit re-entry somehow. Ie I’d prefer 0x3C0 over \u{3C0} because I can copy-paste the former.

Not an easy problem.


#4

That doesn’t really have much to do with this question, since we’re discussing read syntax.


#5

Ah, I had never encountered the term “read syntax” before, but I take it that means the of_string syntax?

In that case let pi : Uchar.t = \u'π' seems to suggest we should parse the user’s input according to their locale, or assume UTF-8?


#6

For source files, we will almost certainly be adopting unicode eventually — pretty much all programming languages have.