What are the biggest reasons newcomers give up on OCaml?

@dbuenzli Ah indeed I got the “code point” terminology wrong, thanks for correcting me. :slight_smile:

The problem is that the standard actually defines good and precise terminology but no one uses it, including the people who define the standard themselves. In any case I always suggest people who are confused about all this to have a read at my minimal Unicode introduction.

Since it seems people are having fun discussing what a good Unicode text data structure would like, then I’d add (or likely repeat) my two cents.

First it should be stressed that for many programs just passing around UTF-8 encoded string values is entirely good enough, even more so that structural text properties (e.g. think of splitting on a comma) often happens on US-ASCII code points which are represented by themselves in UTF-8 bytes.

Regarding something for tasks that need more sophisticated Unicode processing I think it would be nice to have in OCaml’s standard library a good and efficient all-round polymorphic immutable persistent vector 'a Pvec.t.

Then you can define Unicode text as being:

type utext = Uchar.t Pvec.t

Sure that’s not memory efficient but you only use that when you actually need to munge your UTF-8 strings for Unicode heavy processing. This indexes your Unicode data by Unicode scalar values.

The nice thing with that representation is that you can then easily apply standard Unicode algorithms like the segmentation ones to get towers of vectors for easy processing while keeping the cost of doing so explicit. So for example if you are interested in grapheme clusters then you do:

val text : utext = …
val graphs : utext Pvect.t = Utext.segments `Grapheme_cluster text

So your functions acting on grapheme clusters take utext Pvec.t and now your indexes correspond to grapheme clusters.

This alls combines and composes nicely you can first break into paragraphs:

let text : utext = …
let paragraphs : utext Pvec.t = Utext.paragraphs text

And then into paragraphs of gapheme clusters:

let gc_paragraphs = utext Pvec.t Pvec.t = 
   Pvec.map (Utext.segments `Grapheme_cluters) paragraphs

Now your first level of indexing, corresponds to paragraphs, the second one to grapheme cluters and the last one to scalar values.

That is the idea behind the design of utext which I never got the round to finish (also by lack of actual strong need).

5 Likes