Oh, two days of OCaml and I have just contributed with something?
Can you then tell me what is the most complete, stable, quality, etc …, generally accepted library to deal with Unicode in OCaml? Since there is no official support, and there are a few libraries, I need some guidance here …
That may not be a very good example since s is presumably already in UTF-8 you can simply split on the second ’ ’ US-ASCII character…
But in general if you need to conditionalize on code points you should decode your UTF-8 string via Uutf.String.fold_utf_8 and reencode what you want in a Buffer.t value via Uutf.Buffer.add_utf_8 (or if you are on OCaml >= 4.06.0, via Buffer.add_utf_8).
The sample code of in Uucp.Case module has a few example e.g. to lowercase or uppercase strings.
I was afraid you were going to answer like you did … yet I was hopping you were not … (oh my!)
Is that the easiest / efficient way to have substrings of Unicode strings in OCaml?
Beyond Uutf what other libraries do you recommend to use? For example, I am particularly interested in doing lexing and parsing of Unicode source texts.
An Uutf decoder works fine for this. It will do, encoding guessing, line normalization and scalar value position tracking for you. Some people prefer to use lexer generators, I don’t but you can have a look at ulex if you are into that sort of things.
Not more or less than in other languages I’m afraid. It’s not because that other language has a unicode string data structure that it’s not a broken one…
Looking at the documentation of this example, I was wondering if it should also mention that language specific casing, such as the dotted i pair (i/İ) in turkish, is also not handled beyond the specific problem with final sigma.
Daniel’s Unicode libraries are indeed very good, but they require some understanding of the topic (so they provide a sufficient introduction to Unicode as well). Another option for dealing with Unicode is Camomile [1].
Thanks for the link @Freyr666. I am looking into Camomile also.
These packages in fact help a lot in understanding how Unicode works, e.g., the difference between what is Unicode and what are its encodings, something that many people using other languages in which Unicode comes supported by default sometimes do not understand.
Neverthless, having a layer above Uutf (or perhaps another lib) offering common string manipulation functions, like substring, trim, replace, etc, would be extremely useful, if not for everyone, at least for beginners. Having to dig deeper into the current libraries is unnecessarily hard, and a waste of time when in hurry.
@dbuenzli This is why I said before that Unicode is painful in OCaml, not because OCaml does not support it out-of-the-box, but because the libraries are low-level to some extent.
A good part of the problem is that all these notions are hard to define consistently while simultaneously abstracting the five thousand years of accumulated technical debts in human scripts. For instance, the notion of character is well-defined in alphabet without diacritics (e.g. Zulu, but not English), it already becomes more complex with diacritics (é, ñ); but for abugida (like Hindi or sanskrit संस्कृतम्) or hangeul (How many characters in 한글 ?) the very notion of character is much more hazy. Similarly, segmenting a text in words is not that obvious in logographic scripts like Chinese or hieroglyphs (How many words in 中国 ?). Even segmenting on white spaces needs to take in account that there is currently 27 white space code points (For instance, French script uses a narrow non-breaking space before : and ;).
The same kind of issues rears their heads when replacing substrings because the notion of equality between substrings becomes quite subtle on the full unicode spectrum: if I am replacing I with K, what should I do with İ ? Should I replace it with K̇ or with K?
if I am replacing I with K, what should I do with İ ? Should I replace it with K̇ or with K?
I would say that replacing I with K should only replace I with K, not İ. I believe that equality must have the meaning of identical and never that of similar or equivalent.
Also by character, it should be understood as a Unicode codepoint irrespective of language, and by word to be a sequence of non-white characters, whether the white characters are very wide, wide, or narrow, as in French script.
I am not an expert in Unicode, but if in Unicode there is these concepts of white-space and character, then all other concepts building on these should be manageable generically, or defined for each language / locale by the corresponding “official authorities” (or whatever in charge or with that competence).
Also if other like Swift and Rust do it correctly, as someone said before, than any language or library can do. (Hint: just copy from Rust or Swift into OCaml, or an OCaml library, no?)
You are mixing different things, Unicode is inherently a messy technology and there are no easy answers to the problems you mention. What I said about Swift and Rust is that they provide a good conceptual model behind their string data structure, not good answers to these questions.
Things are not so simple in Unicode but you might want to look into Unicode text segmentation which uuseg implements. (Without language specific tailorings).
But then, İ can also be written as I + upper dot combining diacritics, if you use strict equality only this encoding of İ will be replaced by K̇.
The problem is that Unicode code points are quite removed from any intuitive definition of characters in most language: é can be one or two codepoint in Unicode (as a precomposed codepoint or e + combining diacritical mark), same thing with 한 that can be written with one or three codepoints.
This does not work at all with logographic scripts, e.g. in Japanese 春過ぎて transliterates to Haru sugite which means Spring has passed .
I am not sure I would call more than 1.5 billions people an exception. But yes, writing a library for manipulating utf8-encoded human texts as if they were written in Zulu — or English without loan words — is definitively doable.