Feedback / Help Wanted: Upcoming OCaml.org Cookbook Feature

Maybe we should split this into a different thread, but:

They shouldn’t, UTF-8 means the encoding, not all of Unicode. Also, I’m noticing a big cargo cult of “grapheme clusters” as the smallest unit of Unicode text when that’s very rarely what one wants, and supporting them adds a lot of cross-cutting concerns.

For example, what do you mean by “grapheme cluster”? The default extended boundaries? The Unicode annex states that implementations “can and should” tailor the defaults (emphasis theirs). How do we add tailorings? How do we test our tailorings are compatible with our word/sentence/line boundaries? Does the boundary API support random access, and if it does, can we limit the bounds to search for a safe start? And finally, which version of the Unicode property tables will your app support, considering they can carry breaking changes, and why should it depend on your version of OCaml/ICU/uucp (pick your poison)?

The purpose of the segmentation annex is providing defaults, meant to be tailored, to find boundaries of sentences, words and grapheme clusters within a larger text for some given purpose; grapheme clusters here are not special over other kinds of boundaries and aren’t meant to redefine the smallest unit of Unicode strings.

1 Like