Incidentally, I did try out Cosmopolitan / Esperanto as mentioned in the other thread. I might have done something wrong, but the executables it generates were not runnable for me! I’m certain there’s a way to make such things work, and I’m sorry if I am the only person who has had a problem, but the fact that “following the instructions to the letter” does not lead to a working result is evidence for the problem I am describing. This thread asks “What are the biggest reasons newcomers give up on OCaml?” and I am answering you. This is one of the biggest reasons newcomers give up. I don’t mean it to chastise the amazing OCaml team who have done so much wonderful work that I’m deeply grateful. But I think it is important to be very clear-headed about the distance between OCaml today (which is an excellent language that I love to use) and a language that would not turn away newcomers.
Please, make an issue about that
I agree that is a pain point. I also thought of the same solution as you using a ppx extension, but I imagine it will be quite a bit of work.
I started writing OCaml for work almost 2 years ago. I had written a small amount of OCaml code for assignments in a university course several years before, but I mostly started learning (or relearning) OCaml from the ground up around the time I started this work. Previously, I was mostly a C++ developer.
It took months for me to come to fully appreciate the benefit of type inference. After all, one usually needs to know the types. But now I love OCaml’s type inference. One of the things that was a revelation was using VSCode, which shows all of the inferred types, instead of vim (with or without merlin). The other thing is that even though early on I often had to add a large number of type annotations to track down the source of a type error and made mistakes when I forgot what type something was supposed to be, it seemed to me to eventually be, on net, saving me cognitive effort while still helping me produce correct code. I also have the feeling that I often create more complex types in OCaml than I did in C++.
I think that is why it annoys me to have to modify mli files. I still want to create and test modules that have private types and functions as I go, rather than having to wait until the end to get the benefit of auto-generating the right mli file. It’s true that C++ prior to modules required separate header files, but it didn’t have significant type inference. The decision to make was public vs. private (or protected). Making a header file (or going the other way and starting a code file from some function prototypes and class declarations) was an exercise in copy and paste, which in retrospect was also bothersome. But consider C# (and I assume, also, F#), which has access specifiers, “public”, “private”, etc., but doesn’t require separate interface files. Being able to see the interface at a glance is probably beneficial, but it shouldn’t mean I have to rewrite it manually every time it changes. Or that I have to adjust my creative process so that I have the interface fixed in stone before I start writing any other code.
Part of my process is moving functions around modules and files. Since modules are generally much looser than classes, functions can belong in multiple places, and finding their correct ‘home’ is often an iterative process. Having to update the .mli
every time places a higher burden on that, which is why I tend to only use .mli
files when I absolutely need to, and at the end of a project.
I also tend to use python’s convention for private functions, which works surprisingly well. I borrowed this usage in OCaml from @c-cube. Basically, any private function should start with an underscore. While it is just a convention, it turns out you don’t need much more than a convention for handling privacy of functions. This allows you to still access those functions for testing without getting involved in complex solutions.
This conflicts with the functionality of disabling “unused identifier” warnings.
True but it’s fine for function names.
I think there needs to be more editor support for editing mli, it would solve the issue you have.
I tried to implement some of it, it was way more complex than expected and did really work that well, maybe I will try again at one point.
Seems like a good time to link to a nice post by @CraigFe on a trick to address the ml/mli synchronization until interfaces get more stable.
(back to the topic of Unicode, sorry)
First, about highlight received by UTF-8 capable functions, I wasn’t only referring to the String
documentation proper, but also to the larger standard library documentation, manual and more generally available learning resources.
Now for specifics, and restricting to the official documentation. On the byte/character front, the sentence from the String
doc you’re quoting is rather clear indeed, but it is assigning a non-standard meaning to an otherwise-known word (“character”). Which is problematic because this non-standard meaning is used pervasively in places that are not hierarchically below the String
module (like in the doc-string of Stdlib.input
, or much earlier in the language manual). You have to know already that there is this definition, in this specific place in the documentation.
On highlight received by UTF functions: the UTF codecs exist, but they are not alluded to from anywhere; not even in the header of String
in the paragraph about UTF-8 (so upon reading that paragraph I might be tempted to believe that, in the purest C/OCaml tradition, UTF-8 is allowed but no function is actually provided to deal with it). In Bytes
there is not even a discussion about Unicode, and the UTF functions are found below a pack of unsafe arcane stuff. Also, there is no UTF-capable input/output function.
Luckily some bits of my concerns with the documentation are quick to fix, so I turned the easy bits into a constructive PR. But I don’t have much time for a more in-depth redaction.
I would have loved to contribute Unicode support to re
(the regex library). I find this to be one of the significant missing pieces w.r.t. Unicode (because you can’t just implement it yourself). Unfortunately I expect it to be a (fun but) somewhat complex task, and my spare time is allotted to other things.
Perhaps Python has other issues I’m not aware of, but it seems to me that the specific issue you’re pointing is that Unicode literals '\uXXXX'
have the surprising semantics of allowing to create UTF-16 surrogates (i.e. invalid code points). I would expect the syntax '\uD800'
to trigger an error. Isn’t it easy to fix?
(Unicode) string-indexing presupposes that you’re representing your unicode string as an array of unicode code-points, right? That’s wasteful of memory, isn’t it?
(Diverging from the initial topic more and more:) Not necessarily. As I said, Python’s str
has constant-time indexing thanks to a fixed-with encoding, but it adjusts the character width depending on contained data (see the PEP); so either Latin-1, UCS-2 or UCS-4, i.e. 1 byte, 2 bytes or 4 bytes per code point. Since code points greater than U+FFFF are rare (essentially: rare/ancient CJKV ideograms, antique or endangered writing systems, or fancy emojis), you’d rarely (if ever) resort to UCS-4. Even if you do need the full Unicode range, and space is a concern, you may implement a denser packing than UCS-4 (because Unicode code points are in fact 21-bit integers, not 32-bit). Also, one can imagine using a variable-with encoding such as UTF-16, but with each string you would maintain the set of (codepoint-wise) indexes where several “coding units” are used; so codepoint-wise indexing would be logarithmic-time in the worst case, and constant-time in the common case where you have no (or no more than a fixed number of) large characters. You can do the same with UTF-8 if you’re anticipating that most of your characters would be ASCII. If you’re serious about large portions of text you want to move around, copy, cut, concatenate, share… you would use ropes or something like that, so you would end up with that kind of indexing-by-searching-in-a-tree, anyway.
but it is assigning a non-standard meaning to an otherwise-known word (“character”)
Sorry to interject but what do you think that the known word “character” means? Because I don’t know of any well-known notion of characters.
… which is why I carefully wrote “code points” everywhere in my previous message. Because, indeed, “character“ has no obvious formal meaning, and will be understood by readers in a variety of informal, blurry ways. I believe “8-bit portions of encoded text” is not among the natural or common expectations around the word, though.
You can do the same with UTF-8 if you’re anticipating that most of your characters would be ASCII. If you’re serious about large portions of text you want to move around, copy, cut, concatenate, share… you would use ropes or something like that, so you would end up with that kind of indexing-by-searching-in-a-tree, anyway.
What you’re proposing is a new data-type, and you want to call it “string” – to push aside the already-existing datatype with that name. And you’re proposing quite a bit of structure and associated memory-overhead for the extra metadata. Yes, Python has that, but then again, Python is unsuitable for programs manipulating large, complex data-structures, due to its per-object overhead. But more important: you can already do everything you want to do and show that it’s a great idea! Why not just do it?
Re: “character” – I remember when Java arrived, and no, absolutely not did we think of “character” as automatically “unicode codepoint”. We thought of it as “byte”. I mean, by your argument, C/C++ should also change, no?
Look: I understand that OCaml’s unicode support might not be the best. But this is a fixable problem, and can be addressed without involving the core developers and core system. Do it, get it done, show the world that you’ve got a better solution. Otherwise, it feels like you’re asking somebody to change without proof that the change will be better.
After writing my response, I thought I should try again, b/c I pushed back pretty hard. What I really want to point out, is that you’re proposing to make a much-more-complex “string” type – actually you’re proposing several such – and this will have significant performance impact on existing programs. As I’ve noted several times over the years, when I started working with Java in 1995, I implemented a “byte string tower” (BString/BStringBuffer, etc) and did that for performance. And of course, a cursory analysis of the Java heap shows that you can also recover an absolutely insane amount of memory by doing so too.
Systems-jocks have relied on the performance characteristics of unadorned “string” for decades, and if you’re going to change that type, you’re going to need to demonstrate that it’s not deleterious. You could implement what you propose, sufficiently completely to demonstrate it’s utility and lack of bad side-effects. Then people could evaluate it.
But (speaking as a systems-jock) mimicking Java’s String isn’t a good idea: there needs to remain a byte-based core.
UTF-16 surrogates (i.e. invalid code points)
UTF-16 surrogates are valid code points. They are invalid scalar values. AFAIR python strings represent sequences of Unicode code points and that’s not a good model of text since it means that you embed the UTF-16 encoding space into your Unicode string and leads to the problems I linked to earlier.
@Chet_Murthy I must say I’m very confused now, you mentioned re-implementing strings and being concerned about space consumption of it. (to be clear, I wasn’t proposing these ideas for a language-blessed string datatype that the entire world would use)
I mean, by your argument, C/C++ should also change, no?
And indeed the situation in C is very confused. Except C/C++ is much older, set in stone, and that’s not going to happen, and that’s not the matter.
Edited:
Re: “character” – I remember when Java arrived, and no, absolutely not did we think of “character” as automatically “unicode codepoint”. We thought of it as “byte”.
Out of curiosity, around that time, did you happen to work in an English-speaking country?
@dbuenzli Ah indeed I got the “code point” terminology wrong, thanks for correcting me.
Except C/C++ is much older,
Um… citation needed…
Luckily some bits of my concerns with the documentation are quick to fix, so I turned the easy bits into a constructive PR.
Nice!
I would have loved to contribute Unicode support to
re
(the regex library). I find this to be one of the significant missing pieces w.r.t. Unicode (because you can’t just implement it yourself). Unfortunately I expect it to be a (fun but) somewhat complex task, and my spare time is allotted to other things.
Looks like there is an issue but no recent/active work. It does seem like an important one.
“a language-blessed string type”
But if you don’t want to change the meaning of string
, then you can already do what you want today, right?
C/C++ is much older
caml (heavy) dates to the 80s; caml-light (the lineal ancestor of ocaml) came out in 1991. caml-light predates unicode (practically speaking – maybe somebody was coming up with a standard in 1991, but it was a dream in some standards committee’s eyes)
did you happen to work in an English-speaking country
I started programming in the 1980s, in the USA. But I spent 1991-94 in France (INRIA) and there also, “character” meant “byte”. It was only after Java’s prevalence that people started thinking of "char’ as “short”.
As I said, Python’s
str
has constant-time indexing thanks to a fixed-with encoding, but it adjusts the character width depending on contained data (see the PEP); so either Latin-1, UCS-2 or UCS-4, i.e. 1 byte, 2 bytes or 4 bytes per code point.
But when you have unicode combining characters, and things like Hangul composable forms, what use does indexing by unicode code point actually have which would justify the complexity to which your refer? Iterating by whole grapheme cluster might possibly be useful but indexing could not be constant time, and I believe that the Julia language has such a thing, in its ‘graphemes’ function; but I don’t think python provides that.
@dbuenzli Ah indeed I got the “code point” terminology wrong, thanks for correcting me.
The problem is that the standard actually defines good and precise terminology but no one uses it, including the people who define the standard themselves. In any case I always suggest people who are confused about all this to have a read at my minimal Unicode introduction.
Since it seems people are having fun discussing what a good Unicode text data structure would like, then I’d add (or likely repeat) my two cents.
First it should be stressed that for many programs just passing around UTF-8 encoded string
values is entirely good enough, even more so that structural text properties (e.g. think of splitting on a comma) often happens on US-ASCII code points which are represented by themselves in UTF-8 bytes.
Regarding something for tasks that need more sophisticated Unicode processing I think it would be nice to have in OCaml’s standard library a good and efficient all-round polymorphic immutable persistent vector 'a Pvec.t
.
Then you can define Unicode text as being:
type utext = Uchar.t Pvec.t
Sure that’s not memory efficient but you only use that when you actually need to munge your UTF-8 strings for Unicode heavy processing. This indexes your Unicode data by Unicode scalar values.
The nice thing with that representation is that you can then easily apply standard Unicode algorithms like the segmentation ones to get towers of vectors for easy processing while keeping the cost of doing so explicit. So for example if you are interested in grapheme clusters then you do:
val text : utext = …
val graphs : utext Pvect.t = Utext.segments `Grapheme_cluster text
So your functions acting on grapheme clusters take utext Pvec.t
and now your indexes correspond to grapheme clusters.
This alls combines and composes nicely you can first break into paragraphs:
let text : utext = …
let paragraphs : utext Pvec.t = Utext.paragraphs text
And then into paragraphs of gapheme clusters:
let gc_paragraphs = utext Pvec.t Pvec.t =
Pvec.map (Utext.segments `Grapheme_cluters) paragraphs
Now your first level of indexing, corresponds to paragraphs, the second one to grapheme cluters and the last one to scalar values.
That is the idea behind the design of utext
which I never got the round to finish (also by lack of actual strong need).