Feedback / Help Wanted: Upcoming OCaml.org Cookbook Feature

OK, I’ll bite…how do I calculate the length of this string using only the standard library?

let facepalm = "🤦‍♂️"

The standard library only talks about Unicode characters. It says nothing about grapheme clusters. Let’s say I want to use String.get_utf_8_uchar, get a utf_decode, get the length of the decode, move the cursor forward by that many bytes in the source string, then repeat for each decoded Unicode character:

# let rec ulen ~off ~len str =
  let dec = String.get_utf_8_uchar str off in
  if Uchar.utf_decode_is_valid dec then
    ulen ~off:(off + Uchar.utf_decode_length dec) ~len:(succ len) str
  else len;;
val ulen : off:int -> len:int -> string -> int = <fun>
# let ulen str = ulen ~off:0 ~len:0 str;;
val ulen : string -> int = <fun>
# ulen "🤦‍♂️";;
Exception: Invalid_argument "index out of bounds".

Whoops! It’s not as simple as it might seem. I guess we need to handle the exception where we go past the end of the string?

# let rec ulen ~off ~len str =
  match String.get_utf_8_uchar str off with
  | dec ->
    if Uchar.utf_decode_is_valid dec then
      ulen ~off:(off + Uchar.utf_decode_length dec) ~len:(succ len) str
    else len
  | exception Invalid_argument _ -> len;;
val ulen : off:int -> len:int -> string -> int = <fun>
# let ulen str = ulen ~off:0 ~len:0 str;;
val ulen : string -> int = <fun>
# ulen "🤦‍♂️"
- : int = 4

But that’s not correct either. This string contains one grapheme cluster, and it should give me that length:

# #require "uuseg.string";;
# Uuseg_string.fold_utf_8 `Grapheme_cluster (fun len _ -> len + 1) 0 "🤦🏼‍♂️";;
- : int = 1
2 Likes