I was recently working with a small project that required decoding and printing UTF-8 characters, and I was surprised to find that there was no simple way to print a Uchar.t value. I’m curious if this is something that the stdlib should implement, or if I’m missing something obvious.
As far as I can tell, the only functions that can (indirectly) print a Uchar.t are in the Buffer (or Bytes) module: add_utf_8_uchar, add_utf_16le_uchar, and add_utf_16be_uchar. Using these requires the user to create and manage a buffer, which adds some unnecessary ceremony if you aren’t already using one.
This seems like a deficiency because the core language already supports directly writing UTF-8 characters in strings:
let s = "Here is a UTF-8 camel: \u{1F42B}"
Therefore, I would expect to also have something like:
let camel_char = Uchar.of_int 0x1F42B
(* A hypothetical Uchar.to_string: *)
let s = "Here is a UTF-8 camel: " ^ (Uchar.to_string camel_char)
(* Or a hypothetical 'uc' format string flag: *)
let s = Printf.sprintf "Here is a UTF-8 camel: %uc" camel_char
The stdlib String module also includes functions to decode UTF-8 characters from strings, so it seems reasonable to me that we also have functions that can turn them directly back into strings.
Is this worth opening a feature request on GitHub?
I have been struggling with this the past day; here is my wip solution:
The relevant libraries (for unicode) (having not entirely made their way into ocaml yet) are Uucp (properties), Uutf (4.14 Buffer/String for encode/decode), Fmt (for printing), and Uuseg (for real string manipulation). (These are all @dbuenzli 's).
For the toplevel, I used #install_printer Fmt.Dump.uchar;;.
This prints the uchar as U+XXXX.
The standard library way, equivalently let dump ppf u = Format.fprintf ppf "U+%04X" (Uchar.to_int u);; #install_printer dump;; has the same functionality.
(But while I was still struggling/stumbling over a way to get the printer to work with only a dune utop cmd (deriving show?), nobrowser leaked a sweet set of unicode functions…)
Building a string from small components (with varying size) in an efficient way requires using a specific datatype (like Buffer) directly. In other words, as soon as one move beyond printing a single Uchar.t value, the buffer path is nearly necessary. For me, this is a sign that an Uchar.to_string function is more an efficiency trap for unsuspecting beginners than a really useful addition to the standard library.
I agree that to_string is probably not ideal, but what about the other possible functions? For the char type, we already have print_char, output_char, and the %c / %C printf format flags. I think that a lot of the time when people build strings from small components, using a function like sprintf is preferable over building a buffer anyway. It seems reasonable to me to include Uchar.t versions of those features.