OK, I’ll bite…how do I calculate the length of this string using only the standard library?
let facepalm = "🤦♂️"
The standard library only talks about Unicode characters. It says nothing about grapheme clusters. Let’s say I want to use String.get_utf_8_uchar
, get a utf_decode
, get the length of the decode, move the cursor forward by that many bytes in the source string, then repeat for each decoded Unicode character:
# let rec ulen ~off ~len str =
let dec = String.get_utf_8_uchar str off in
if Uchar.utf_decode_is_valid dec then
ulen ~off:(off + Uchar.utf_decode_length dec) ~len:(succ len) str
else len;;
val ulen : off:int -> len:int -> string -> int = <fun>
# let ulen str = ulen ~off:0 ~len:0 str;;
val ulen : string -> int = <fun>
# ulen "🤦♂️";;
Exception: Invalid_argument "index out of bounds".
Whoops! It’s not as simple as it might seem. I guess we need to handle the exception where we go past the end of the string?
# let rec ulen ~off ~len str =
match String.get_utf_8_uchar str off with
| dec ->
if Uchar.utf_decode_is_valid dec then
ulen ~off:(off + Uchar.utf_decode_length dec) ~len:(succ len) str
else len
| exception Invalid_argument _ -> len;;
val ulen : off:int -> len:int -> string -> int = <fun>
# let ulen str = ulen ~off:0 ~len:0 str;;
val ulen : string -> int = <fun>
# ulen "🤦♂️"
- : int = 4
But that’s not correct either. This string contains one grapheme cluster, and it should give me that length:
# #require "uuseg.string";;
# Uuseg_string.fold_utf_8 `Grapheme_cluster (fun len _ -> len + 1) 0 "🤦🏼♂️";;
- : int = 1