"OCaml -- first impressions"

In what way are other Unicode implementations broken? Are you saying that C# and Java are broken?

I don’t think I said that they’re “broken.” I’m not even familiar with the Java and C# implementations. I know I have some disappointments with the Unicode character database representation in ICU (because I’ve seen what happens when you decide that OpenCFLite is a good idea in constrained resources embedded firmware applications) but I don’t think I would even describe that as “broken” just “suboptimal.”

Daniel wrote:

But the reality is that in the set of languages out there that do have a type for Unicode strings in their standard library, very few of them have a non broken one (the only ones I know for sure have not a broken one are Swift and rust ).

In what way are other Unicode implementations broken? Are you saying that C# and Java are broken?

Sorry, James. I was trying to reply to something Daniel wrote ages ago. This UI is completely alien to me!

Uh no, don’t jump to conclusions so fast… At most I’m saying their string datastructure model is broken.

I don’t know what C# uses but as far as Java is concerned and IIRC it represents Unicode text as arrays of UTF-16 encoded data with UTF-16 code unit indexing.

The latter is exactly the problem behind all these data structures: they don’t allow you to work with Unicode text (i.e. Unicode scalar values) they allow you to work with an encoding of Unicode text which is brittle, hard to use and surprising for the programmer. It is brittle because it’s very easy to programmatically break the encoding, which results in invalid Unicode data. It’s hard to use and surprising because indexing neither maps to Unicode data nor to grapheme clusters.

In other words I’m saying this model is broken because it lacks abstraction. As a programmer you want to work with the data not with an encoding of the data. The notion of encoding is only needed at the IO boundary of your programs.

To make an analogy suppose I allowed you to work with arrays of 64 bits integers but only the integers smaller than say (2^64)/2 can be indexed directly. If your integer is larger than this value you will need two array cells to represent it. Would you like to work and compute with such a data structure ?

5 Likes

@dbuenzli, You have put your finger on what annoys me about Unicode in almost every language I’ve worked with. What do you see as the optimal abstractions for daily work with unicode data?

Also @perry I think these models of Unicode strings based on encodings basically confused an entire generation of programmers about what Unicode really is and how it works.

Regarding your question the answer depends a bit on what you are doing. I think at least two useful representation are simple sequences of scalar values and sequences of grapheme clusters (what swift does).

Grapheme cluster sequences is mainly useful for human interaction. For example when you need to align text visually or let the user interact with text (e.g. cursor movement). But then according to some people on the Unicode mailing list the notion of grapheme clusters as defined by UAX #29 is completely useless on certain scripts.

Sequences of scalar values is really for basic text processing (sometimes to be done under a given normal form assumption), e.g. search, sorting, etc.

I have played a bit with a design based on polymorphic persistent vectors which you can consult here.

What I like with this proposal is the processing uniformity you get when you switch from a “Unicode string” whose type is Uchar.t Pvec.t to a decomposition of the string according to various units (grapheme clusters, paragraphs, lines) which become Uchar.t Pvec.t Pvec.t.

I hope to get back to it at some point to bring that to release so that people can play with it.

6 Likes