UTF8 decoding invalid UTF8 encodings

hbr · April 5, 2023, 12:31pm

The module String from the standard library has a function to decode a unicode character

   get_utf8_char: t -> int -> Uchar.utf_decode

such that the call String.get_utf8_char str pos decodes a utf8 character at position pos.

In case that there is a valid or an invalid unicode character starting at position pos I can use Uchar.decode_length to find out the length of the utf8 character and start the next decoding at pos + len.

However it is unclear to me what happens if pos is not a start point of a utf8 character or if pos is a valid start point but the following bytes are not correct. From reading and rereading the documentation I have found no way to find out the correct next start position.

Is there a way to find that out? Or expressed differently: How can I find out whether the utf8 encoding starting at pos is incorrect and what is the next starting position?

Regards
Helmut

octachron · April 5, 2023, 1:04pm

Then the function returns a Uchar.utf_decode value v which is invalid (for which Uchar.valid_decode v is false) and with a Uchar.utf_decode_length v which corresponds to the length of the malformed sequence of bytes .

First, correct next start position is an ill-defined notion: if you see one invalid byte sequence in a string it is quite possible that none of the string contents was utf-8 encoded. Or you could be decoding a binary data format with some sections being utf-8 encoded text.

However, if we assume that we had a utf-8 encoded string with a low rate of byte errors as input, the stdlib API is designed to replace invalid sequence of bytes by Uchar.rep if you use it in a straightforward manner:

let b = Bytes.of_string "серафими многоꙮчитїи"
(* Let's introduce an error on the second grapheme cluster `е` *)
let () = Bytes.set b 3 '\192' 

let recoded =
   let buff = Buffer.create 30 in
   let rec decode_recode pos =
      if pos = Bytes.length b then () else
      let decode = Bytes.get_utf_8_uchar b pos in
      let char = Uchar.utf_decode_uchar decode in
      Buffer.add_utf_8_uchar buff char;
      decode_recode (pos + Uchar.utf_decode_length decode)
  in
  decode_recode 0;
  Buffer.contents buff

yields

с��рафими многоꙮчитїи

for the recoded string as expected.

dbuenzli · April 5, 2023, 1:16pm

Decode at the position and check the result with Uchar.utf_decode_is_valid.

There is more than one answer to that question, see this link.

The simplest is to replace the invalid decode by Uchar.rep in your decoded data (which is what Uchar.utf_decode_uchar will return you on the invalid decode) and continue with the advance suggested by the invalid decode (returned by Uchar.utf_decode_length).

Effectively this means that if you are not interested in surfacing errors beyond Uchar.rep replacement your decoding loop can simply be:

let fold_utf_8 f acc s =
  let rec loop f acc s max i =
    if i > max then acc else
    let dec = String.get_utf_8_uchar s i in
    let u = Uchar.utf_decode_uchar dec in
    loop f (f u acc) s max (i + Uchar.utf_decode_length dec)
  in
  loop f acc s (String.length s - 1) 0

By doing so your best effort decode will decode according to the WHATWG encoding standard.

hbr · April 5, 2023, 1:35pm

Thanks @octachron and @dbuenzli. From your answers I draw the conclusion that get_utf_8_uchar stops as soon it finds an unexpected byte.

Because utf8 encoded code points can be at most 4 bytes long, the function Uchar.utf_decode_length returns a value between 1 (ascii character) and 4 (21 bit code point). It does no search for a synchronisation point in case of error.

Since the returned length is at least 1 by advancing from pos to pos + len cannot result in an infinite loop.

So the function do what I have expected except the search for a synchronisation point i.e. a search for the next leading byte or the end of the string.

dbuenzli · April 5, 2023, 1:51pm

Indeed by default it does not, as mentioned in the link I referred to there’s more than one way to go about this.

I took a scheme that allowed me to always fit the advance in a few bits even on 32-bit platforms to support ergonomic best-effort decoding (see code in my previous message) while keeping an allocation free decoding scheme. Looking for a synchronisation point would need arbitrary large (that is up to Sys.max_string_length) advance and thus a decode structure that cannot fit in a machine integer.

If you want to look for a synchronisation point, then simply search forward on invalid decode at pos for the first UTF-8 starter byte (0x00-7F, 0xC2-DF, 0xE0-EF, 0xF0-F4) after pos and start your next decode from there (don’t forget to add an Uchar.rep to your data for the invalid decode for security reasons).

Topic		Replies	Views
Decoding many Unicode strings with uutf Learning	3	788	November 29, 2021
Printing Uchar.t values Ecosystem unicode , stdlib	6	1122	February 8, 2023
How to access the module Uutf.String.UTF_8 Learning	23	4556	March 28, 2018
Literals for Uchar.t (Unicode code points, more precisely Unicode scalar values)? Community	31	1530	October 28, 2023
A case for `In_channel.peek_char` Community	1	374	July 21, 2023

UTF8 decoding invalid UTF8 encodings

Related topics