@dbuenzli Ah indeed I got the “code point” terminology wrong, thanks for correcting me.
The problem is that the standard actually defines good and precise terminology but no one uses it, including the people who define the standard themselves. In any case I always suggest people who are confused about all this to have a read at my minimal Unicode introduction.
Since it seems people are having fun discussing what a good Unicode text data structure would like, then I’d add (or likely repeat) my two cents.
First it should be stressed that for many programs just passing around UTF-8 encoded string
values is entirely good enough, even more so that structural text properties (e.g. think of splitting on a comma) often happens on US-ASCII code points which are represented by themselves in UTF-8 bytes.
Regarding something for tasks that need more sophisticated Unicode processing I think it would be nice to have in OCaml’s standard library a good and efficient all-round polymorphic immutable persistent vector 'a Pvec.t
.
Then you can define Unicode text as being:
type utext = Uchar.t Pvec.t
Sure that’s not memory efficient but you only use that when you actually need to munge your UTF-8 strings for Unicode heavy processing. This indexes your Unicode data by Unicode scalar values.
The nice thing with that representation is that you can then easily apply standard Unicode algorithms like the segmentation ones to get towers of vectors for easy processing while keeping the cost of doing so explicit. So for example if you are interested in grapheme clusters then you do:
val text : utext = …
val graphs : utext Pvect.t = Utext.segments `Grapheme_cluster text
So your functions acting on grapheme clusters take utext Pvec.t
and now your indexes correspond to grapheme clusters.
This alls combines and composes nicely you can first break into paragraphs:
let text : utext = …
let paragraphs : utext Pvec.t = Utext.paragraphs text
And then into paragraphs of gapheme clusters:
let gc_paragraphs = utext Pvec.t Pvec.t =
Pvec.map (Utext.segments `Grapheme_cluters) paragraphs
Now your first level of indexing, corresponds to paragraphs, the second one to grapheme cluters and the last one to scalar values.
That is the idea behind the design of utext
which I never got the round to finish (also by lack of actual strong need).