OCaml standard library Unicode support

Continuing the discussion from Feedback / Help Wanted: Upcoming OCaml.org Cookbook Feature:

OK, let’s talk about an example that I came across recently: minttea/leaves/table.ml at b084ec7401c52167fae5087577133e52e3874899 · leostera/minttea · GitHub

Here we want to render a table in the CLI and we want to pad and truncate the table content text to fit inside the column lengths. Eg suppose one of the columns is 30 characters long. So we need to calculate the length of the text that will go inside each cell so we can pad it with one space character on each side (so it looks nice) and fit it inside.

Text: A 🤦‍♂️ walks into a bar, orders 🍻

Table:

| A 🤦‍♂️ walks into a bar,... |

Question: how would you calculate the correct length without grapheme clusters? EDIT: or using only the standard library?

2 Likes

That’s a great example, because you can’t actually measure the width of :man_facepalming: with grapheme clusters! :man_facepalming: is a single grapheme cluster, 3 scalars, but every terminal or editor worth its salt will print it as 2-wide!

Read up on the display rules for emoji, they’re supposed to be the same size as CJK ideographs. Traditionally, terminals have used the East_Asian_Width property (half-width or full-width) to determine whether a single scalar is 1-wide or 2-wide. Single-scalar emoji are defined as full-width, which is why they mostly work as is.[1]

But this became much harder once emoji ZWJ sequences were introduced: :man_facepalming: is a single grapheme cluster of two emojis connected with a ZWJ (zero-width-joiner). If, and only if, your terminal can actually render them as a single ligature (which depends on the renderer and font), you can wing it and pretend the cluster is 2-wide… but if I copy-paste it into the new Windows terminal right now, for example, it prints a 4-wide :person_facepalming::male_sign: pair!

So, to answer your question: As far as I know, there isn’t a VT sequence to query the terminal about the width of a Unicode sequence, so the only way to calculate the correct length of an emoji from a console app is to wing it, assume emoji are 2-wide and hope your user’s terminal is good enough.

And then I tell you that some grapheme clusters can be even wider, like the Basmala (“﷽”, a single Unicode scalar!), which VS Code renders “monospaced” as 3-wide. However, there isn’t much agreement on some of those; the Windows terminal somehow crams the basmala into a 2-wide cell. Update: I just double-checked and it’s even worse: my VS Code is actually rendering the basmala (and emoji!) as 2-wide, but using a much wider fallback font than mine, so the text isn’t even monospaced anymore.


  1. Although the East Asian Width annex warns:

    The East_Asian_Width property is not intended for use by modern terminal emulators without appropriate tailoring on a case-by-case basis. Such terminal emulators need a way to resolve the halfwidth/fullwidth dichotomy that is necessary for such environments, but the East_Asian_Width property does not provide an off-the-shelf solution for all situations.

    ↩︎
2 Likes

The length will depend on your text rendering engine and font.

1 Like

Interesting. I don’t know about you but on my Mac VSCode counts it as a single character.

Read up on the display rules for emoji

This says that emojis should typically have a square aspect ratio so I assume that means they should take up the width of a single character in a monospaced font.

And then I tell you that some grapheme clusters can be even wider,

Yeah, in Unicode every rule has a dozen exceptions.

But my main question was: how do you measure the length of a given string to truncate it to fit inside a given maximum width in a CLI app. I get that the real answer is ‘it depends’, but what is the best-effort answer that would be comparable to other modern languages, using the standard library? Or to put it another way, what functions would I use in the standard library to get a length of 1 for :man_facepalming:?

EDIT: ‘But Yawar, why are you insisting that the length of :man_facepalming: is 1?’ Please see The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!) @ tonsky.me

IIRC @pqwy’s investigations from a few years ago while developing notty, computing grapheme clusters for this actually doesn’t bring much.

Not in the standard library but the easiest path is to simply fold the result of Uucp.Break.tty_width_hint that he contributed to the project. Its extensive doc string is also worth reading.

But basically as long as you won’t have a way to query the terminal for the length of a string you want to render, there will always be cases that fail (think for example about programming fonts that ligature -> into a single glyph).

It seems a Text Terminal Group was created at the Unicode level last year and it is not dead. Maybe something could come out of that in the future.

1 Like

You do not want a length of 1 for :man_facepalming:. Emoji are full-width characters, like CJK ideographs, and every terminal emulator I know of with Unicode support is printing full-width characters as 2-wide cells:

imagen

So the best effort way to lay out :man_facepalming: is to actually parse emoji ZWJ sequences using the UTF-8 decoding API, assume they are 2-wide for metric, and give up all hope if any users come to you complaining about :person_facepalming::male_sign: breaking the layout.

Note that I’m not aware of a single standard library for any programming language with an appropiate function for this, not even Swift which is the poster child of (ab)using grapheme clusters; the closest thing are functions like wcwidth and that only works for single scalars, not ZWJ sequences. So I don’t understand how you hold this against OCaml’s stdlib, most terminal emulators have to implement this on their own for a reason.

2 Likes

OK, for terminals fair. But for humans, if you tell them that :man_facepalming: is 2 characters, they will look at you funny. See the Niki Tonsky blog post I linked earlier.

I don’t understand how you hold this against OCaml’s stdlib,

Just to be clear, I don’t hold anything against anything. I am just asking a question here.

I would actually prefer having “length” and “width” (with suitable names), where the “length” of every emoji is 1 and it’s “width” is something else.

But as there is only one such value it should be 1, as the “width” only has a (questionable) meaning for fixed width fonts anyway. Any such “width” is a property of the render (or, if it really needs to be, of the font), but certainly not the Unicode string itself.

1 Like

@yawaramin I’ll address first the article you linked earlier in an edit:

I’ve seen that article before and I dislike it because it doesn’t address which use cases do you want the grapheme clusters for, just barely mentions “character count in UI”; it reads like edutainment. We don’t compute string lengths and text boundaries out of intellectual curiosity, we use them for something. And that use case is going to drive your choice of metrics and tailorings.[1]

So back to your latest reply:

So what metric are you trying to show to your user? If it’s characters written, in which locale? The text segmentation annex also emphasizes:

This specification defines default mechanisms; more sophisticated implementations can and should tailor them for particular locales or environments and, for the purpose of claiming conformance, document the tailoring in the form of a profile.

Edit: To be clear, grapheme clusters are still your best bet here, but you’d have to actually check what does a character means to your users. Outdated example I’m familiar with: Until 1994, the digraphs “ch” and “ll” were considered their own letters in Spanish; I’m sure similar quirks still exist in many languages.

Or are we talking plain text? That’s barely even a format, so your editor is most likely counting Unicode scalar values (VS Code does, even Windows newlines are 2 characters), because otherwise you’d be exposed to a very leaky abstraction.

That the word length means too many different things in computing is a naming problem, not a computing problem. Of course, what we need is to be specific on which length/width/whatever we mean for each use case. All of these metrics are useful for different use cases.


  1. The same goes for most Unicode annexes, really: Unicode isn’t a unifying solution to all text processing, it’s a unifying set of interoperable defaults to write text processing tools, as much as we wish otherwise. ↩︎

Yes, and as I’ve said in the part you didn’t quote, the only sensible “length” that can be returned for a grapheme cluster by a library just looking at the “string” itself, is 1 (as in “the number of backspaces needed to delete it”). Of course that doesn’t solve the OP’s problém, which actually is independent of grapheme clusters or Unicode and a (well, “the”) problém of font rendering.

But that’s only if we’re talking grapheme cluster length, in which case it’s tautological unless you’re doing some really funky tailoring. We shouldn’t assume any kind of metric as the default is what I’m saying.

You try actually copying :man_facepalming: into an editor and pressing backspace, in many (most?) platforms/editors it will actually take multiple backspaces for the entire cluster to disappear. There was a joke on social media about slowly killing a family by pressing backspace on a family emoji ZWJ sequence.

2 Likes

Yes, of course. I just wanted to emphasise the fact that “displayed width after rendering” does not have (much) to do with the length of the string, even if we are talking about grapheme clusters (or whatever approximation of a “logical character” is used).

That’s why I explicitly included the example of backspace :smiley:

Sure. Maybe that something is ‘I want to show a count of characters typed in this text box’. Maybe it’s something else. Why do I have to prove my use case for something that’s already widely agreed on? Do you think the count shown here will make sense to users?

Can you just say ‘the standard library doesn’t really support that’? That’s OK! It’s not a failing of the standard library, it’s just what it is.

EDIT: also,

it reads like edutainment .

This article tries to explain a complicated subject to developers in a simple way, and help us to understand more about Unicode and how to think about it. To put it down condescendingly as ‘edutainment’ is not really helpful. If you can do better, please go ahead!

First off, just to be clear, I’m not trying to be confrontational, I just feel like you’re expecting there to be a clear and correct solution to a rather fuzzy problem in a messy domain. And I happen to prefer the forum format to, say, chatting on Discord about it, even if it feels a bit colder sometimes.

Huh? Of course your use case is valid, when did I claim otherwise? I was addressing the article there, not you. I’m just pointing out there’s not a single definition of a character count, not even with grapheme clusters, if only because there’s not a single definition of grapheme clusters. I’m not saying you can’t or shouldn’t count characters, I’m saying you should be aware that you’re making a choice on how to do so, and the defaults don’t necessarily meet your requirements.

Of course, if you don’t want to think about it, the easy solution is to throw default grapheme clusters at it; at least the count will be more sensible than Google Docs’, what the heck…

They’re apparently trying to measure the visual width of emoji, without any parsing of ZWJs, and reporting that as its character count: Single-scalar emojis report two characters, :man_facepalming: reports 5 (2 * 2-wide plus a ZWJ) and :family_man_woman_boy: (emoji sequence, not the family scalar) reports 8 (3 * 2-wide plus 2 ZWJs). Weird.

It’s been a while since I read that article, so I may had judged it too harshly from memory: It does introduce multiple topics about Unicode, and its technical explanations are correct. But it does assume that there is a single worthwhile definition of grapheme clusters, that this definition of grapheme clusters is preferable to scalars as a unit for text processing, and that counting these clusters gives the only sensible definition of a string’s length (whatever that word means at this point). All of this without any backing rationale other than counts being human-friendly (a single use case) and a strawman about substrings.

This is not a unique viewpoint from the author, it’s a popular bandwagon lately that is already leading to some rather awkward APIs by missing the point of text segmentation.

Rather than failing to make my points clear again, I’ll share here the Unicode proposal “Setting expectations for grapheme clusters”, written by someone who has worked on and researched this more than I have, formally proposing to the committee to drop from the annex the assumption that grapheme clusters actually correspond to user-perceived characters, and clarify that tailorings are generally needed. You’ll hopefully find this better argumented than my posts, or at least less adversarial.

Interestingly, it also points out that CSS Text Module Level 3 expects user-agents to provide language-specific text segmentation tailorings depending on the content type and the task. It provides a few examples of grapheme-cluster tailorings, suggesting different tailorings for different purposes, such as line-breaking, letter-spacing, vertical typesetting.

I don’t think I’ve currently got enough free time to give it a try, but hopefully in the future I’ll get to write a bunch about technical stuff I’ve learned, yes!

2 Likes

There is if you’re displaying a character count to a non-technical user who doesn’t care about the peculiarities of grapheme clusters. If you have a document like the one I showed, and you tell your user that :man_facepalming: is 5 characters, you will get a funny look. If you tell them it’s 1 character, you will get a nod of agreement. This is not that difficult :slight_smile:

Now, given this use case, how do I get the OCaml standard library to agree with me that this is 1 character? What specific series of function invocations gives me this count? I suspect the answer is ‘you can’t do that’, which is fine. I’m not passing a value judgment. Happy to be proven wrong.

Not deep into the emoji stuff, but I think something along this way: using the OCaml standard library you can easily fold over the Uchar.t values from a string and recognize the grammar of emoji sequences with the help of the Uucp.Emoji module. Now these sequences will count as 1 “character”. With the caveats that 1) not all emoji sequences are permissible (see the linked document for details, e.g. the ZWJ ones are listed here) and 2) not all sequences may supported by your font/text rendering engine and thus might not render as 1 “character”.

Also now that I remember at some point the grapheme cluster definition was changed in order not to break emoji sequences (see the rules GB11-GB13 here). So using grapheme clusters could help for your use case (with all the caveats remaining).

That is what I am currently doing. See the last code snippet in my message Feedback / Help Wanted: Upcoming OCaml.org Cookbook Feature - #17 by yawaramin

It seems a Text Terminal Group was created at the Unicode level last year and it is not dead. Maybe something could come out of that in the future.

If you have concrete proposals - please let me know, since I participate in that group discussions.

Well I’m not that much into terminal interfaces, I’m sure experts in this will have better ideas. But basically if you had an escape sequence that would allow you to output an UTF-8 encoded character sequence and the terminal rather than render it would respond with the number of columns that this sequence will take up on screen (e.g. via a terminal attribute, the same way you can get the window size via TIOCGWINSZ) that would help all these people who are trying to align stuff in the terminal.