Decoding many Unicode strings with uutf

lindig · November 29, 2021, 10:13am

In a server application we need to decode incoming UTF8 strings and are using Uutf for this. We observe a lot of memory usage and notice that Uutf allocates large buffers where our strings are usually small. But maybe we are using it wrong. We are currently creating a new decoder for every incoming string:

let utf8_recode str =
  let out_encoding = `UTF_8 in
  let b = Buffer.create 1024 in
  let dst = `Buffer b in
  let src = `String str in
  let rec loop d e =
    match Uutf.decode d with
    | `Uchar _ as u ->
        ignore (Uutf.encode e u) ;
        loop d e
    | `End ->
        ignore (Uutf.encode e `End)
    | `Malformed _ ->
        ignore (Uutf.encode e (`Uchar Uutf.u_rep)) ;
        loop d e
    | `Await ->
        assert false
  in
  let d = Uutf.decoder src in
  let e = Uutf.encoder out_encoding dst in
  loop d e ; Buffer.contents b

But maybe this is wasteful and we should create a global decoder value once (let decoder = Uutf.decoder ~encodig:UTF_8`) and use that for every string. What is the correct usage pattern and do others see problems with memory traffic as well?

dbuenzli · November 29, 2021, 10:33am

Uutf’s decoder and encoder abstractions do a lot of stuff for you that you are absolutely not using here.

Use Uuseg.String.fold_utf_8. It doesn’t go through the decoder abstraction.
If you are on > 4.06 directly use Stdlib.Buffer.add_utf_8_uchar.

And once you can afford 4.14, I suggest you ditch uutf for the Stdlib UTF decoders which do not allocate at all.

Incidentally I just wrote such a loop using them (warning code untested, also this replaces U+0000 by Uchar.rep), this will not allocate a new string in case there’s no decoding error.

let cleanup_input s =
  let clean s dirty =
    let flush b max start i =
      if start <= max then Buffer.add_substring b s start (i - start);
    in
    let rec loop b s max start i =
      if i > max then (flush b max start i; Buffer.contents b) else
      match String.unsafe_get s i with
      | '\x01' .. '\x7F' (* US-ASCII *) -> loop b s max start (i + 1)
      | '\x00' ->
          let next = i + 1 in
          flush b max start i; Buffer.add_utf_8_uchar b Uchar.rep;
          loop b s max next next
      | _ ->
          let d = String.get_utf_8_uchar s i in
          match Uchar.utf_decode_is_valid d with
          | true -> loop b s max start (i + Uchar.utf_decode_length d)
          | false ->
              let next = i + Uchar.utf_decode_length d in
              flush b max start i; Buffer.add_utf_8_uchar b Uchar.rep;
              loop b s max next next
    in
    let b = Buffer.create (String.length s + 2 (* assume only one error *)) in
    let max = String.length s - 1 in
    flush b max 0 dirty; loop b s max dirty dirty
  in
  let rec check s max i =
    if i > max then s else
    match String.unsafe_get s i with
    | '\x01' .. '\x7F' (* US-ASCII *) -> check s max (i + 1)
    | '\x00' -> clean s i
    | _ ->
        let d = String.get_utf_8_uchar s i in
        if Uchar.utf_decode_is_valid d
        then check s max (i + Uchar.utf_decode_length d)
        else clean s i
  in
  check s (String.length s - 1) 0

edwin · November 29, 2021, 12:05pm

Thanks for the quick response and the hint about the fold_utf_8 function and Buffer.add_utf_8_char. I found equivalents of those in Uutf and rewrote utf8_recode using them here: utf8_recode: use Uutf.{Buffer.add_utf_8,String.fold_utf_8} instead of Uutf.{encoder,decoder} by edwintorok · Pull Request #4586 · xapi-project/xen-api · GitHub (I’ve credited this forum thread in the commit message).
It is possible to do further optimizations (stopping on first invalid utf8 char and recode just the rest) as you’ve done, but for now this code needs to stay compatible with OCaml 4.02.3 and 4.10, and I quite like how short utf8_recode has become now (and should already reduce the high memory allocation reduce)

dbuenzli · November 29, 2021, 12:40pm

All that should happen on the minor gc collection but the folders are wasteful since they do allocate one value per decoded Uchar.t. The direct style API that was upstreamed in the Stdlib does not do that, it relies on an abstract type and bit-fiddling.

I’m a bit embarrassed it took me so long to find that design :–) but glad I didn’t try to upstream earlier ones I had.

Topic		Replies	Views
How to access the module Uutf.String.UTF_8 Learning	23	4558	March 28, 2018
Printing Uchar.t values Ecosystem unicode , stdlib	6	1125	February 8, 2023
UTF8 decoding invalid UTF8 encodings Learning	4	819	April 5, 2023
Newbie question: Unbound module Uchar.Utf8 Learning	5	247	August 5, 2024
What's the function of Uchar? Learning	3	1872	October 9, 2017

Decoding many Unicode strings with uutf

Related topics