UTF8 decoding invalid UTF8 encodings

The module String from the standard library has a function to decode a unicode character

   get_utf8_char: t -> int -> Uchar.utf_decode

such that the call String.get_utf8_char str pos decodes a utf8 character at position pos.

In case that there is a valid or an invalid unicode character starting at position pos I can use Uchar.decode_length to find out the length of the utf8 character and start the next decoding at pos + len.

However it is unclear to me what happens if pos is not a start point of a utf8 character or if pos is a valid start point but the following bytes are not correct. From reading and rereading the documentation I have found no way to find out the correct next start position.

Is there a way to find that out? Or expressed differently: How can I find out whether the utf8 encoding starting at pos is incorrect and what is the next starting position?

Regards
Helmut

Then the function returns a Uchar.utf_decode value v which is invalid (for which Uchar.valid_decode v is false) and with a Uchar.utf_decode_length v which corresponds to the length of the malformed sequence of bytes .

First, correct next start position is an ill-defined notion: if you see one invalid byte sequence in a string it is quite possible that none of the string contents was utf-8 encoded. Or you could be decoding a binary data format with some sections being utf-8 encoded text.

However, if we assume that we had a utf-8 encoded string with a low rate of byte errors as input, the stdlib API is designed to replace invalid sequence of bytes by Uchar.rep if you use it in a straightforward manner:

let b = Bytes.of_string "серафими многоꙮчитїи"
(* Let's introduce an error on the second grapheme cluster `е` *)
let () = Bytes.set b 3 '\192' 

let recoded =
   let buff = Buffer.create 30 in
   let rec decode_recode pos =
      if pos = Bytes.length b then () else
      let decode = Bytes.get_utf_8_uchar b pos in
      let char = Uchar.utf_decode_uchar decode in
      Buffer.add_utf_8_uchar buff char;
      decode_recode (pos + Uchar.utf_decode_length decode)
  in
  decode_recode 0;
  Buffer.contents buff

yields

с��рафими многоꙮчитїи

for the recoded string as expected.

3 Likes

Decode at the position and check the result with Uchar.utf_decode_is_valid.

There is more than one answer to that question, see this link.

The simplest is to replace the invalid decode by Uchar.rep in your decoded data (which is what Uchar.utf_decode_uchar will return you on the invalid decode) and continue with the advance suggested by the invalid decode (returned by Uchar.utf_decode_length).

Effectively this means that if you are not interested in surfacing errors beyond Uchar.rep replacement your decoding loop can simply be:

let fold_utf_8 f acc s =
  let rec loop f acc s max i =
    if i > max then acc else
    let dec = String.get_utf_8_uchar s i in
    let u = Uchar.utf_decode_uchar dec in
    loop f (f u acc) s max (i + Uchar.utf_decode_length dec)
  in
  loop f acc s (String.length s - 1) 0

By doing so your best effort decode will decode according to the WHATWG encoding standard.

3 Likes

Thanks @octachron and @dbuenzli. From your answers I draw the conclusion that get_utf_8_uchar stops as soon it finds an unexpected byte.

Because utf8 encoded code points can be at most 4 bytes long, the function Uchar.utf_decode_length returns a value between 1 (ascii character) and 4 (21 bit code point). It does no search for a synchronisation point in case of error.

Since the returned length is at least 1 by advancing from pos to pos + len cannot result in an infinite loop.

So the function do what I have expected except the search for a synchronisation point i.e. a search for the next leading byte or the end of the string.

Indeed by default it does not, as mentioned in the link I referred to there’s more than one way to go about this.

I took a scheme that allowed me to always fit the advance in a few bits even on 32-bit platforms to support ergonomic best-effort decoding (see code in my previous message) while keeping an allocation free decoding scheme. Looking for a synchronisation point would need arbitrary large (that is up to Sys.max_string_length) advance and thus a decode structure that cannot fit in a machine integer.

If you want to look for a synchronisation point, then simply search forward on invalid decode at pos for the first UTF-8 starter byte (0x00-7F, 0xC2-DF, 0xE0-EF, 0xF0-F4) after pos and start your next decode from there (don’t forget to add an Uchar.rep to your data for the invalid decode for security reasons).