"OCaml -- first impressions"

Jon_Harrop · October 23, 2018, 8:45pm

In what way are other Unicode implementations broken? Are you saying that C# and Java are broken?

jhw · October 24, 2018, 2:51pm

I don’t think I said that they’re “broken.” I’m not even familiar with the Java and C# implementations. I know I have some disappointments with the Unicode character database representation in ICU (because I’ve seen what happens when you decide that OpenCFLite is a good idea in constrained resources embedded firmware applications) but I don’t think I would even describe that as “broken” just “suboptimal.”

Jon_Harrop · October 25, 2018, 11:49pm

Daniel wrote:

But the reality is that in the set of languages out there that do have a type for Unicode strings in their standard library, very few of them have a non broken one (the only ones I know for sure have not a broken one are Swift and rust ).

In what way are other Unicode implementations broken? Are you saying that C# and Java are broken?

Jon_Harrop · October 25, 2018, 11:50pm

Sorry, James. I was trying to reply to something Daniel wrote ages ago. This UI is completely alien to me!

dbuenzli · October 26, 2018, 8:16am

Uh no, don’t jump to conclusions so fast… At most I’m saying their string datastructure model is broken.

I don’t know what C# uses but as far as Java is concerned and IIRC it represents Unicode text as arrays of UTF-16 encoded data with UTF-16 code unit indexing.

The latter is exactly the problem behind all these data structures: they don’t allow you to work with Unicode text (i.e. Unicode scalar values) they allow you to work with an encoding of Unicode text which is brittle, hard to use and surprising for the programmer. It is brittle because it’s very easy to programmatically break the encoding, which results in invalid Unicode data. It’s hard to use and surprising because indexing neither maps to Unicode data nor to grapheme clusters.

In other words I’m saying this model is broken because it lacks abstraction. As a programmer you want to work with the data not with an encoding of the data. The notion of encoding is only needed at the IO boundary of your programs.

To make an analogy suppose I allowed you to work with arrays of 64 bits integers but only the integers smaller than say (2^64)/2 can be indexed directly. If your integer is larger than this value you will need two array cells to represent it. Would you like to work and compute with such a data structure ?

perry · October 26, 2018, 1:29pm

@dbuenzli, You have put your finger on what annoys me about Unicode in almost every language I’ve worked with. What do you see as the optimal abstractions for daily work with unicode data?

dbuenzli · October 26, 2018, 2:03pm

Also @perry I think these models of Unicode strings based on encodings basically confused an entire generation of programmers about what Unicode really is and how it works.

Regarding your question the answer depends a bit on what you are doing. I think at least two useful representation are simple sequences of scalar values and sequences of grapheme clusters (what swift does).

Grapheme cluster sequences is mainly useful for human interaction. For example when you need to align text visually or let the user interact with text (e.g. cursor movement). But then according to some people on the Unicode mailing list the notion of grapheme clusters as defined by UAX #29 is completely useless on certain scripts.

Sequences of scalar values is really for basic text processing (sometimes to be done under a given normal form assumption), e.g. search, sorting, etc.

I have played a bit with a design based on polymorphic persistent vectors which you can consult here.

What I like with this proposal is the processing uniformity you get when you switch from a “Unicode string” whose type is Uchar.t Pvec.t to a decomposition of the string according to various units (grapheme clusters, paragraphs, lines) which become Uchar.t Pvec.t Pvec.t.

I hope to get back to it at some point to bring that to release so that people can play with it.

Topic		Replies	Views
Survey on the new "Getting Started" Documentation on OCaml.org Learning user-feedback , ocamlorg	3	685	November 8, 2023
[ANN] New Get Started Documentation on OCaml.org Community ocamlorg	2	903	October 19, 2023
OCaml - first impressions Learning	26	2284	September 20, 2020
Feedback on RWO dev site Site Feedback real-world-ocaml	3	1106	December 18, 2023
OCaml at First Glance Learning	21	3305	August 30, 2022

"OCaml -- first impressions"

Related topics