The module String from the standard library has a function to decode a unicode character
get_utf8_char: t -> int -> Uchar.utf_decode
such that the call String.get_utf8_char str pos decodes a utf8 character at position pos.
In case that there is a valid or an invalid unicode character starting at position pos I can use Uchar.decode_length to find out the length of the utf8 character and start the next decoding at pos + len.
However it is unclear to me what happens if pos is not a start point of a utf8 character or if pos is a valid start point but the following bytes are not correct. From reading and rereading the documentation I have found no way to find out the correct next start position.
Is there a way to find that out? Or expressed differently: How can I find out whether the utf8 encoding starting at pos is incorrect and what is the next starting position?
Then the function returns a Uchar.utf_decode value v which is invalid (for which Uchar.valid_decode v is false) and with a Uchar.utf_decode_length v which corresponds to the length of the malformed sequence of bytes .
First, correct next start position is an ill-defined notion: if you see one invalid byte sequence in a string it is quite possible that none of the string contents was utf-8 encoded. Or you could be decoding a binary data format with some sections being utf-8 encoded text.
However, if we assume that we had a utf-8 encoded string with a low rate of byte errors as input, the stdlib API is designed to replace invalid sequence of bytes by Uchar.rep if you use it in a straightforward manner:
let b = Bytes.of_string "серафими многоꙮчитїи"
(* Let's introduce an error on the second grapheme cluster `е` *)
let () = Bytes.set b 3 '\192'
let recoded =
let buff = Buffer.create 30 in
let rec decode_recode pos =
if pos = Bytes.length b then () else
let decode = Bytes.get_utf_8_uchar b pos in
let char = Uchar.utf_decode_uchar decode in
Buffer.add_utf_8_uchar buff char;
decode_recode (pos + Uchar.utf_decode_length decode)
in
decode_recode 0;
Buffer.contents buff
There is more than one answer to that question, see this link.
The simplest is to replace the invalid decode by Uchar.rep in your decoded data (which is what Uchar.utf_decode_uchar will return you on the invalid decode) and continue with the advance suggested by the invalid decode (returned by Uchar.utf_decode_length).
Effectively this means that if you are not interested in surfacing errors beyond Uchar.rep replacement your decoding loop can simply be:
let fold_utf_8 f acc s =
let rec loop f acc s max i =
if i > max then acc else
let dec = String.get_utf_8_uchar s i in
let u = Uchar.utf_decode_uchar dec in
loop f (f u acc) s max (i + Uchar.utf_decode_length dec)
in
loop f acc s (String.length s - 1) 0
Thanks @octachron and @dbuenzli. From your answers I draw the conclusion that get_utf_8_uchar stops as soon it finds an unexpected byte.
Because utf8 encoded code points can be at most 4 bytes long, the function Uchar.utf_decode_length returns a value between 1 (ascii character) and 4 (21 bit code point). It does no search for a synchronisation point in case of error.
Since the returned length is at least 1 by advancing from pos to pos + len cannot result in an infinite loop.
So the function do what I have expected except the search for a synchronisation point i.e. a search for the next leading byte or the end of the string.
Indeed by default it does not, as mentioned in the link I referred to there’s more than one way to go about this.
I took a scheme that allowed me to always fit the advance in a few bits even on 32-bit platforms to support ergonomic best-effort decoding (see code in my previous message) while keeping an allocation free decoding scheme. Looking for a synchronisation point would need arbitrary large (that is up to Sys.max_string_length) advance and thus a decode structure that cannot fit in a machine integer.
If you want to look for a synchronisation point, then simply search forward on invalid decode at pos for the first UTF-8 starter byte (0x00-7F, 0xC2-DF, 0xE0-EF, 0xF0-F4) after pos and start your next decode from there (don’t forget to add an Uchar.rep to your data for the invalid decode for security reasons).