What are the biggest reasons newcomers give up on OCaml?

jonsterling · September 9, 2023, 7:48am

Incidentally, I did try out Cosmopolitan / Esperanto as mentioned in the other thread. I might have done something wrong, but the executables it generates were not runnable for me! I’m certain there’s a way to make such things work, and I’m sorry if I am the only person who has had a problem, but the fact that “following the instructions to the letter” does not lead to a working result is evidence for the problem I am describing. This thread asks “What are the biggest reasons newcomers give up on OCaml?” and I am answering you. This is one of the biggest reasons newcomers give up. I don’t mean it to chastise the amazing OCaml team who have done so much wonderful work that I’m deeply grateful. But I think it is important to be very clear-headed about the distance between OCaml today (which is an excellent language that I love to use) and a language that would not turn away newcomers.

dinosaure · September 9, 2023, 8:55am

Please, make an issue about that

waleedmebane · September 11, 2023, 4:23am

I agree that is a pain point. I also thought of the same solution as you using a ppx extension, but I imagine it will be quite a bit of work.

I started writing OCaml for work almost 2 years ago. I had written a small amount of OCaml code for assignments in a university course several years before, but I mostly started learning (or relearning) OCaml from the ground up around the time I started this work. Previously, I was mostly a C++ developer.

It took months for me to come to fully appreciate the benefit of type inference. After all, one usually needs to know the types. But now I love OCaml’s type inference. One of the things that was a revelation was using VSCode, which shows all of the inferred types, instead of vim (with or without merlin). The other thing is that even though early on I often had to add a large number of type annotations to track down the source of a type error and made mistakes when I forgot what type something was supposed to be, it seemed to me to eventually be, on net, saving me cognitive effort while still helping me produce correct code. I also have the feeling that I often create more complex types in OCaml than I did in C++.

I think that is why it annoys me to have to modify mli files. I still want to create and test modules that have private types and functions as I go, rather than having to wait until the end to get the benefit of auto-generating the right mli file. It’s true that C++ prior to modules required separate header files, but it didn’t have significant type inference. The decision to make was public vs. private (or protected). Making a header file (or going the other way and starting a code file from some function prototypes and class declarations) was an exercise in copy and paste, which in retrospect was also bothersome. But consider C# (and I assume, also, F#), which has access specifiers, “public”, “private”, etc., but doesn’t require separate interface files. Being able to see the interface at a glance is probably beneficial, but it shouldn’t mean I have to rewrite it manually every time it changes. Or that I have to adjust my creative process so that I have the interface fixed in stone before I start writing any other code.

bluddy · September 11, 2023, 7:32am

Part of my process is moving functions around modules and files. Since modules are generally much looser than classes, functions can belong in multiple places, and finding their correct ‘home’ is often an iterative process. Having to update the .mli every time places a higher burden on that, which is why I tend to only use .mli files when I absolutely need to, and at the end of a project.

I also tend to use python’s convention for private functions, which works surprisingly well. I borrowed this usage in OCaml from @c-cube. Basically, any private function should start with an underscore. While it is just a convention, it turns out you don’t need much more than a convention for handling privacy of functions. This allows you to still access those functions for testing without getting involved in complex solutions.

lukstafi · September 11, 2023, 7:48am

This conflicts with the functionality of disabling “unused identifier” warnings.

bluddy · September 11, 2023, 8:16am

True but it’s fine for function names.

EmileTrotignon · September 11, 2023, 12:13pm

I think there needs to be more editor support for editing mli, it would solve the issue you have.
I tried to implement some of it, it was way more complex than expected and did really work that well, maybe I will try again at one point.

grayswandyr · September 11, 2023, 2:53pm

Seems like a good time to link to a nice post by @CraigFe on a trick to address the ml/mli synchronization until interfaces get more stable.

glen · September 11, 2023, 5:42pm

(back to the topic of Unicode, sorry)

First, about highlight received by UTF-8 capable functions, I wasn’t only referring to the String documentation proper, but also to the larger standard library documentation, manual and more generally available learning resources.

Now for specifics, and restricting to the official documentation. On the byte/character front, the sentence from the String doc you’re quoting is rather clear indeed, but it is assigning a non-standard meaning to an otherwise-known word (“character”). Which is problematic because this non-standard meaning is used pervasively in places that are not hierarchically below the String module (like in the doc-string of Stdlib.input, or much earlier in the language manual). You have to know already that there is this definition, in this specific place in the documentation.

On highlight received by UTF functions: the UTF codecs exist, but they are not alluded to from anywhere; not even in the header of String in the paragraph about UTF-8 (so upon reading that paragraph I might be tempted to believe that, in the purest C/OCaml tradition, UTF-8 is allowed but no function is actually provided to deal with it). In Bytes there is not even a discussion about Unicode, and the UTF functions are found below a pack of unsafe arcane stuff. Also, there is no UTF-capable input/output function.

Luckily some bits of my concerns with the documentation are quick to fix, so I turned the easy bits into a constructive PR. But I don’t have much time for a more in-depth redaction.

I would have loved to contribute Unicode support to re (the regex library). I find this to be one of the significant missing pieces w.r.t. Unicode (because you can’t just implement it yourself). Unfortunately I expect it to be a (fun but) somewhat complex task, and my spare time is allotted to other things.

Perhaps Python has other issues I’m not aware of, but it seems to me that the specific issue you’re pointing is that Unicode literals '\uXXXX' have the surprising semantics of allowing to create UTF-16 surrogates (i.e. invalid code points). I would expect the syntax '\uD800' to trigger an error. Isn’t it easy to fix?

(Diverging from the initial topic more and more:) Not necessarily. As I said, Python’s str has constant-time indexing thanks to a fixed-with encoding, but it adjusts the character width depending on contained data (see the PEP); so either Latin-1, UCS-2 or UCS-4, i.e. 1 byte, 2 bytes or 4 bytes per code point. Since code points greater than U+FFFF are rare (essentially: rare/ancient CJKV ideograms, antique or endangered writing systems, or fancy emojis), you’d rarely (if ever) resort to UCS-4. Even if you do need the full Unicode range, and space is a concern, you may implement a denser packing than UCS-4 (because Unicode code points are in fact 21-bit integers, not 32-bit). Also, one can imagine using a variable-with encoding such as UTF-16, but with each string you would maintain the set of (codepoint-wise) indexes where several “coding units” are used; so codepoint-wise indexing would be logarithmic-time in the worst case, and constant-time in the common case where you have no (or no more than a fixed number of) large characters. You can do the same with UTF-8 if you’re anticipating that most of your characters would be ASCII. If you’re serious about large portions of text you want to move around, copy, cut, concatenate, share… you would use ropes or something like that, so you would end up with that kind of indexing-by-searching-in-a-tree, anyway.

octachron · September 11, 2023, 5:45pm

Sorry to interject but what do you think that the known word “character” means? Because I don’t know of any well-known notion of characters.

glen · September 11, 2023, 5:53pm

… which is why I carefully wrote “code points” everywhere in my previous message. Because, indeed, “character“ has no obvious formal meaning, and will be understood by readers in a variety of informal, blurry ways. I believe “8-bit portions of encoded text” is not among the natural or common expectations around the word, though.

Chet_Murthy · September 11, 2023, 6:01pm

What you’re proposing is a new data-type, and you want to call it “string” – to push aside the already-existing datatype with that name. And you’re proposing quite a bit of structure and associated memory-overhead for the extra metadata. Yes, Python has that, but then again, Python is unsuitable for programs manipulating large, complex data-structures, due to its per-object overhead. But more important: you can already do everything you want to do and show that it’s a great idea! Why not just do it?

Re: “character” – I remember when Java arrived, and no, absolutely not did we think of “character” as automatically “unicode codepoint”. We thought of it as “byte”. I mean, by your argument, C/C++ should also change, no?

Look: I understand that OCaml’s unicode support might not be the best. But this is a fixable problem, and can be addressed without involving the core developers and core system. Do it, get it done, show the world that you’ve got a better solution. Otherwise, it feels like you’re asking somebody to change without proof that the change will be better.

Chet_Murthy · September 11, 2023, 6:15pm

After writing my response, I thought I should try again, b/c I pushed back pretty hard. What I really want to point out, is that you’re proposing to make a much-more-complex “string” type – actually you’re proposing several such – and this will have significant performance impact on existing programs. As I’ve noted several times over the years, when I started working with Java in 1995, I implemented a “byte string tower” (BString/BStringBuffer, etc) and did that for performance. And of course, a cursory analysis of the Java heap shows that you can also recover an absolutely insane amount of memory by doing so too.

Systems-jocks have relied on the performance characteristics of unadorned “string” for decades, and if you’re going to change that type, you’re going to need to demonstrate that it’s not deleterious. You could implement what you propose, sufficiently completely to demonstrate it’s utility and lack of bad side-effects. Then people could evaluate it.

But (speaking as a systems-jock) mimicking Java’s String isn’t a good idea: there needs to remain a byte-based core.

dbuenzli · September 11, 2023, 6:16pm

UTF-16 surrogates are valid code points. They are invalid scalar values. AFAIR python strings represent sequences of Unicode code points and that’s not a good model of text since it means that you embed the UTF-16 encoding space into your Unicode string and leads to the problems I linked to earlier.

glen · September 11, 2023, 6:42pm

@Chet_Murthy I must say I’m very confused now, you mentioned re-implementing strings and being concerned about space consumption of it. (to be clear, I wasn’t proposing these ideas for a language-blessed string datatype that the entire world would use)

And indeed the situation in C is very confused. Except C/C++ is much older, set in stone, and that’s not going to happen, and that’s not the matter.

Edited:

Out of curiosity, around that time, did you happen to work in an English-speaking country?

@dbuenzli Ah indeed I got the “code point” terminology wrong, thanks for correcting me.

jhw · September 11, 2023, 6:43pm

Um… citation needed…

jumpnbrownweasel · September 11, 2023, 6:44pm

Nice!

Looks like there is an issue but no recent/active work. It does seem like an important one.

Chet_Murthy · September 11, 2023, 7:06pm

“a language-blessed string type”

But if you don’t want to change the meaning of string, then you can already do what you want today, right?

C/C++ is much older

caml (heavy) dates to the 80s; caml-light (the lineal ancestor of ocaml) came out in 1991. caml-light predates unicode (practically speaking – maybe somebody was coming up with a standard in 1991, but it was a dream in some standards committee’s eyes)

did you happen to work in an English-speaking country

I started programming in the 1980s, in the USA. But I spent 1991-94 in France (INRIA) and there also, “character” meant “byte”. It was only after Java’s prevalence that people started thinking of "char’ as “short”.

cvine · September 11, 2023, 7:40pm

But when you have unicode combining characters, and things like Hangul composable forms, what use does indexing by unicode code point actually have which would justify the complexity to which your refer? Iterating by whole grapheme cluster might possibly be useful but indexing could not be constant time, and I believe that the Julia language has such a thing, in its ‘graphemes’ function; but I don’t think python provides that.

dbuenzli · September 12, 2023, 9:35am

@dbuenzli Ah indeed I got the “code point” terminology wrong, thanks for correcting me.

The problem is that the standard actually defines good and precise terminology but no one uses it, including the people who define the standard themselves. In any case I always suggest people who are confused about all this to have a read at my minimal Unicode introduction.

Since it seems people are having fun discussing what a good Unicode text data structure would like, then I’d add (or likely repeat) my two cents.

First it should be stressed that for many programs just passing around UTF-8 encoded string values is entirely good enough, even more so that structural text properties (e.g. think of splitting on a comma) often happens on US-ASCII code points which are represented by themselves in UTF-8 bytes.

Regarding something for tasks that need more sophisticated Unicode processing I think it would be nice to have in OCaml’s standard library a good and efficient all-round polymorphic immutable persistent vector 'a Pvec.t.

Then you can define Unicode text as being:

type utext = Uchar.t Pvec.t

Sure that’s not memory efficient but you only use that when you actually need to munge your UTF-8 strings for Unicode heavy processing. This indexes your Unicode data by Unicode scalar values.

The nice thing with that representation is that you can then easily apply standard Unicode algorithms like the segmentation ones to get towers of vectors for easy processing while keeping the cost of doing so explicit. So for example if you are interested in grapheme clusters then you do:

val text : utext = …
val graphs : utext Pvect.t = Utext.segments `Grapheme_cluster text

So your functions acting on grapheme clusters take utext Pvec.t and now your indexes correspond to grapheme clusters.

This alls combines and composes nicely you can first break into paragraphs:

let text : utext = …
let paragraphs : utext Pvec.t = Utext.paragraphs text

And then into paragraphs of gapheme clusters:

let gc_paragraphs = utext Pvec.t Pvec.t = 
   Pvec.map (Utext.segments `Grapheme_cluters) paragraphs

Now your first level of indexing, corresponds to paragraphs, the second one to grapheme cluters and the last one to scalar values.

That is the idea behind the design of utext which I never got the round to finish (also by lack of actual strong need).

Topic		Replies	Views
What I dislike about OCaml Community ocaml	117	11671	November 5, 2022
Blog: General thoughts on Ocaml & Haskell and OCaml's (supposedly) pathetic state of tooling Community opam , dune	51	8756	August 26, 2021
Why is building Ocaml projects still so hard? Ecosystem	39	3024	August 22, 2024
Usability improvements in the OCaml compiler Community compiler , usability	0	777	February 5, 2023
What is holding you back from upgrading to the latest OCaml compiler? Ecosystem compiler	28	4178	May 16, 2019

What are the biggest reasons newcomers give up on OCaml?

Related topics