I want to find the first unicode character in a string and then split the string on that character

ninjaaron · March 16, 2020, 5:42pm

I guess I can use regex to split the string (though it seems overkill), but I’m not really sure how to determine the first complete unicode character in a UTF-8 string.

Example string from the data:

"ƒT01ƒULatnƒx13ƒaTsionut datitƒdhisṭoriah, raʿyon, ḥevrahƒhʿorekh: Dov Shṿartsƒlkerekh 3"

I need to determine that the first character actaully is ƒ, and then split on every occurrence of ƒ.

Sorry if this is obvious. I haven’t been using OCaml very long. I’m using Base, if that makes a difference.

octachron · March 16, 2020, 6:16pm

You need to precise what you mean by “character”, you probably means extended grapheme cluster. Similarly, you need to precise what it mean to be the same character. You probably mean either same under NFC or NFKC normalization. Once those detail precised, you can use uuseg to obtain a sequence of extended character and do your splitting from there.

ninjaaron · March 16, 2020, 6:22pm

I mean the first codepoint.

octachron · March 16, 2020, 6:38pm

But scalar values are not characters (and codepoints even less so). If you want something that at least partially match the human expectation of what is a graphical character, you need extended grapheme clusters. Anyway, the answer does’nt change with scalar values or extended grapheme clusters: use uuseg.

ninjaaron · March 16, 2020, 7:18pm

I know what code points and normalized forms are.

Where is uuseg defined?

octachron · March 16, 2020, 7:19pm

It is a library.

ninjaaron · March 16, 2020, 7:21pm

Thanks so much! This is exactly what I was looking for!

octachron · March 16, 2020, 7:22pm

In fact, if you need just codepoints, the right library is rather uutf (and uunf for normalization).

Topic		Replies	Views
How to access the module Uutf.String.UTF_8 Learning	23	4563	March 28, 2018
Substring of Unicode string including newlines in Windows Ecosystem windows , unicode	22	1778	September 27, 2021
Literals for Uchar.t (Unicode code points, more precisely Unicode scalar values)? Community	31	1547	October 28, 2023
OCaml standard library Unicode support Community	21	1232	July 6, 2024
[ANN] Unicode 15.0.0 update for Uucd, Uucp, Uunf and Uuseg Community announce , ocsf	1	746	September 15, 2022

I want to find the first unicode character in a string and then split the string on that character

Related topics