Printing Uchar.t values

I was recently working with a small project that required decoding and printing UTF-8 characters, and I was surprised to find that there was no simple way to print a Uchar.t value. I’m curious if this is something that the stdlib should implement, or if I’m missing something obvious.

As far as I can tell, the only functions that can (indirectly) print a Uchar.t are in the Buffer (or Bytes) module: add_utf_8_uchar, add_utf_16le_uchar, and add_utf_16be_uchar. Using these requires the user to create and manage a buffer, which adds some unnecessary ceremony if you aren’t already using one.

This seems like a deficiency because the core language already supports directly writing UTF-8 characters in strings:

let s = "Here is a UTF-8 camel: \u{1F42B}"

Therefore, I would expect to also have something like:

let camel_char = Uchar.of_int 0x1F42B

(* A hypothetical Uchar.to_string: *)
let s = "Here is a UTF-8 camel: " ^ (Uchar.to_string camel_char)

(* Or a hypothetical 'uc' format string flag: *)
let s = Printf.sprintf "Here is a UTF-8 camel: %uc" camel_char

The stdlib String module also includes functions to decode UTF-8 characters from strings, so it seems reasonable to me that we also have functions that can turn them directly back into strings.

Is this worth opening a feature request on GitHub?

(PS: Here’s the project, for whoever may be curious GitHub - johnridesabike/wave-function-collapse: An OCaml implementation of the wave function collapse algorithm.)

4 Likes

Yup, right on. Here’s how I end up doing it:

https://git.sr.ht/~nobrowser/ocinco/tree/5-utf16/item/lib/Utils.ml#L18

2 Likes

I have been struggling with this the past day; here is my wip solution:
The relevant libraries (for unicode) (having not entirely made their way into ocaml yet) are Uucp (properties), Uutf (4.14 Buffer/String for encode/decode), Fmt (for printing), and Uuseg (for real string manipulation). (These are all @dbuenzli 's).
For the toplevel, I used #install_printer Fmt.Dump.uchar;;.
This prints the uchar as U+XXXX.
The standard library way, equivalently let dump ppf u = Format.fprintf ppf "U+%04X" (Uchar.to_int u);; #install_printer dump;; has the same functionality.
(But while I was still struggling/stumbling over a way to get the printer to work with only a dune utop cmd (deriving show?), nobrowser leaked a sweet set of unicode functions…)

1 Like

Building a string from small components (with varying size) in an efficient way requires using a specific datatype (like Buffer) directly. In other words, as soon as one move beyond printing a single Uchar.t value, the buffer path is nearly necessary. For me, this is a sign that an Uchar.to_string function is more an efficiency trap for unsuspecting beginners than a really useful addition to the standard library.

1 Like

I agree that to_string is probably not ideal, but what about the other possible functions? For the char type, we already have print_char, output_char, and the %c / %C printf format flags. I think that a lot of the time when people build strings from small components, using a function like sprintf is preferable over building a buffer anyway. It seems reasonable to me to include Uchar.t versions of those features.

A Format printing function for Uchar.t sounds reasonable and it covers all the use cases that you are describing.

4 Likes

Thanks or the feedback, everyone. I’ve gone ahead and submitted a feature-wish issue on GitHub.