First off, just to be clear, I’m not trying to be confrontational, I just feel like you’re expecting there to be a clear and correct solution to a rather fuzzy problem in a messy domain. And I happen to prefer the forum format to, say, chatting on Discord about it, even if it feels a bit colder sometimes.
Huh? Of course your use case is valid, when did I claim otherwise? I was addressing the article there, not you. I’m just pointing out there’s not a single definition of a character count, not even with grapheme clusters, if only because there’s not a single definition of grapheme clusters. I’m not saying you can’t or shouldn’t count characters, I’m saying you should be aware that you’re making a choice on how to do so, and the defaults don’t necessarily meet your requirements.
Of course, if you don’t want to think about it, the easy solution is to throw default grapheme clusters at it; at least the count will be more sensible than Google Docs’, what the heck…
They’re apparently trying to measure the visual width of emoji, without any parsing of ZWJs, and reporting that as its character count: Single-scalar emojis report two characters, reports 5 (2 * 2-wide plus a ZWJ) and (emoji sequence, not the family scalar) reports 8 (3 * 2-wide plus 2 ZWJs). Weird.
It’s been a while since I read that article, so I may had judged it too harshly from memory: It does introduce multiple topics about Unicode, and its technical explanations are correct. But it does assume that there is a single worthwhile definition of grapheme clusters, that this definition of grapheme clusters is preferable to scalars as a unit for text processing, and that counting these clusters gives the only sensible definition of a string’s length (whatever that word means at this point). All of this without any backing rationale other than counts being human-friendly (a single use case) and a strawman about substrings.
This is not a unique viewpoint from the author, it’s a popular bandwagon lately that is already leading to some rather awkward APIs by missing the point of text segmentation.
Rather than failing to make my points clear again, I’ll share here the Unicode proposal “Setting expectations for grapheme clusters”, written by someone who has worked on and researched this more than I have, formally proposing to the committee to drop from the annex the assumption that grapheme clusters actually correspond to user-perceived characters, and clarify that tailorings are generally needed. You’ll hopefully find this better argumented than my posts, or at least less adversarial.
Interestingly, it also points out that CSS Text Module Level 3 expects user-agents to provide language-specific text segmentation tailorings depending on the content type and the task. It provides a few examples of grapheme-cluster tailorings, suggesting different tailorings for different purposes, such as line-breaking, letter-spacing, vertical typesetting.
I don’t think I’ve currently got enough free time to give it a try, but hopefully in the future I’ll get to write a bunch about technical stuff I’ve learned, yes!