I want to find the first unicode character in a string and then split the string on that character

I guess I can use regex to split the string (though it seems overkill), but I’m not really sure how to determine the first complete unicode character in a UTF-8 string.

Example string from the data:

"ƒT01ƒULatnƒx13ƒaTsionut datitƒdhisṭoriah, raʿyon, ḥevrahƒhʿorekh: Dov Shṿartsƒlkerekh 3"

I need to determine that the first character actaully is ƒ, and then split on every occurrence of ƒ.

Sorry if this is obvious. I haven’t been using OCaml very long. I’m using Base, if that makes a difference.

You need to precise what you mean by “character”, you probably means extended grapheme cluster. Similarly, you need to precise what it mean to be the same character. You probably mean either same under NFC or NFKC normalization. Once those detail precised, you can use uuseg to obtain a sequence of extended character and do your splitting from there.

1 Like

I mean the first codepoint.

But scalar values are not characters (and codepoints even less so). If you want something that at least partially match the human expectation of what is a graphical character, you need extended grapheme clusters. Anyway, the answer does’nt change with scalar values or extended grapheme clusters: use uuseg.

1 Like

I know what code points and normalized forms are.

Where is uuseg defined?

It is a library.

1 Like

Thanks so much! This is exactly what I was looking for!

In fact, if you need just codepoints, the right library is rather uutf (and uunf for normalization).