Substring of Unicode string including newlines in Windows

I have an application whose core is developed in OCaml, and wrapped by js_of_ocaml to run in a browser. Recently, I switched from ocamllex to sedlex to support Unicode. And I used Uuseg_string to develop some utility functions like the following substring.

let uni_sub (s: string) (pos: int) (len: int) = 
  let (_, r) = 
    Uuseg_string.fold_utf_8 
      `Grapheme_cluster
      (fun (p, acc) ch -> if (p >= pos) && (p <= pos+len-1) then (p+1, acc ^ ch) else (p+1, acc))
      (0, "")
      s 
    in 
  r

I just realized that the behaviors are different between browsers under Mac OS and browsers under Windows.

In a browser under Mac OS, a newline in a string input is \n. Let s be a string \n32 received by OCaml, uni_sub s 1 2 returns well 32.

However, in a browser under Windows, a newline in a string input is \r\n. Let s be a string \r\n32 received by OCaml, uni_sub s 2 2 oddly returns 2 (rather than 32).

Does anyone know how to fix it here?

This is because \r\n is a single grapheme cluster (see UAX #29: Unicode Text Segmentation). You would get the expected answer if you use uni_sub 1 2 also with \r\n.

Orthogonal to that, note that repeatedly concatenating strings (eg acc ^ ch) is very inefficient. Instead you could use a Buffer.t to construct your final string.

Cheers,
Nicolas

1 Like

Thanks for your comment.

The pos input of uni_sub comes from position information collected by the parser in Menhir. It seems that Menhir does not consider \r\n as a single character.

Is there a standard or conventional way to deal with \r\n here?

It depends a lot on the needs of your code. One possibility is to do a preliminary pass on your input and rewrite every \r\n with \n.

Cheers,
Nicolas

1 Like

I see… Thank you very much…

It seems that Menhir does not consider \r\n as a single character.

Menhir considers tokens from your lexer, which provides it with positions, and sedlex measures positions over Unicode scalars (same as most text editors), so \r\n spans two scalars. Since grapheme clusters often span multiple scalars, by feeding scalar positions to uni_sub you’re mixing two different units of measure. This will give you the wrong substrings for many other Unicode inputs, not just Windows newlines.

I suspect segmentation isn’t what you’re looking for. To deal with newlines (and other unwanted characters) one would usually skip them in the lexer itself and extract the newline-free substrings with Sedlexing.Utf8.lexeme or sublexeme to return as tokens; this usually takes repeated uses of match%sedlex for a single token.

1 Like

“This will give you the wrong substrings for many other Unicode inputs, not just Windows newlines.” ==> That sounds scary.

In my lexer.ml, I do have the following code to skip newlines.

let rec token buf = 
  match%sedlex buf with
  ... ...
  | '\n' | "\r\n" -> token buf

However, it seems that the positions that the lexer sends to Menhir don’t ignore the existence of these skipped characters.

Additionally, I do need a substring function, which takes a Unicode string, a position and a length as parameters. Because String.sub (and String.length) doesn’t use the same units of measure as my lexer+parser for Unicode strings, I have to create my own substring function.

Does anyone have any idea?

The intended, encoding-independent way is to extract all substrings in the lexer with Sedlexing.Utf8.lexeme, return them as part of the tokens and then combine them if needed. Once the positions reach Menhir, you can’t substring from sedlex’s buffer anymore, and so you enter “hack” territory.

If you really need subtrings after the lexer, you could implement your uni_sub over scalars using Uutf.String.fold_utf_8 and Buffer.add_utf_8_uchar. Simple fix in your case, but you’d be decoding the entire input string every single time you substring, so I’d suggest it only as a temporary solution.

Alternatively, you could use String.sub if you computed the (encoding-dependent) byte positions as you tokenise, by incrementing an int ref after every match%sedlex by the length in bytes of the current lexeme. You’d need a function <encoding>_length : Uchar.t -> int to compute the encoded length of a scalar (it’s just an if-else chain), and then sum the lenghts for each scalar of the lexeme given by Sedlexing.lexeme_char.

Then attach these offsets to tokens or, if you don’t care about Menhir positions matching editor ones, replace the offsets given by Sedlexing.lexing_positions with your own and return yourself the supplier unit -> token * position * position rather than using a helper.

However, it seems that the positions that the lexer sends to Menhir don’t ignore the existence of these skipped characters.

Maybe this is related to the behaviour of $startpos compared to $symbolstartpos?

1 Like

(* I made the question unsolved, because it was not fully solved *)

I tried to “do a preliminary pass on your input and rewrite every \r\n with \n.” as @nojb suggested, it did solve a big part of problems, but not all. I decided to go with what @debugnik suggested:

If you really need subtrings after the lexer, you could implement your uni_sub over scalars using Uutf.String.fold_utf_8 and Buffer.add_utf_8_uchar . Simple fix in your case, but you’d be decoding the entire input string every single time you substring, so I’d suggest it only as a temporary solution.

However, without understanding well how the system works, I could only roughly write the following code and wanted to make the types work in the first place.

let uni_sub_scalars (s: string) (pos: int) (len: int) = 
  let b: Buffer.t = Buffer.create 42 in
  let rec add (acc: string list) (v: [ `Uchar of Stdlib.Uchar.t | `Await | `End ]) : Uuseg.ret =
    match v with
    | `Uchar u -> 
      Buffer.add_utf_8_uchar b u; 
      add acc `Await
    | `Await | `End -> failwith "don't know what to do"
  in
  let (_, r) = 
    Uuseg_string.fold_utf_8 
      (`Custom (Uuseg.custom ~add:add))
      (fun (p, acc) ch -> if (p >= pos) && (p <= pos+len-1) then (p+1, acc ^ ch) else (p+1, acc))
      (0, "")
      s 
    in 
  r

And the compilation returned an error that I don’t know how to fix:

File "lib/utility.ml", line 45, characters 6-39:
45 |       (`Custom (Uuseg.custom ~add:add))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error: This expression has type
         [> `Custom of
              ?mandatory:(string list -> bool) ->
              name:string ->
              create:(unit -> string list) ->
              copy:(string list -> string list) -> unit -> Uuseg.custom ]
       but an expression was expected of type [< Uuseg.boundary ]
       Types for tag `Custom are incompatible
make: *** [lib/utility.cmo] Error 2

Could anyone help me write this substring function of Unicode strings by scalars?

Thank you

Could you maybe explain us concisely what you are trying to achieve ? I doubt defining your own custom uuseg segmenter is the right way to go about this.

I want to build a substring function uni_sub_scalars: string -> int -> int -> string

  • for Unicode strings
  • which receives scalar positions, because they are what sedlex and menhir send to the function.

So let s = "\r\nabc", uni_sub_scalars s 2 2 is expected to return ab (rather than bc).

Hope it is clear now.

Not really it’s not clear what the arguments of your function are supposed to represent indices (which unit) ? length ?

In any case you don’t need unicode segmentation to deal with scalar values. At most Uutf. Unfortunately I’m not familiar with either mehnir or sedlex but don’t they give you back byte indices in the underlying buffer you are giving them ?

I would like to write a length function uni_length: string -> int over Unicode strings.

  • Given a string “这是个好问题”, $startpos.pos_cnum of Menhir returns 0 and $endpos.pos_cnum returns 6, but String.length "这是个好问题" returns 18. String.length does not match the location returned by Menhir, so I could not use it in my program; so I would expect uni_length "这是个好问题" to return 6.

  • Given a string “a\r\nbcd”, $startpos.pos_cnum of Menhir returns 0 and $endpos.pos_cnum returns 6, and String.length "a\r\nbcd" returns 6, which is good. I would expect uni_length "a\r\nbcd" to return 6 as well.

I would like to have a substring function uni_sub: string -> int -> int -> string over Unicode strings. uni_sub s pos len is a string of length len, containing the substring of s that starts at position pos and has length len.

  • Given a string “a\r\nbcd”, $startpos.pos_cnum of Menhir returns 0 and $endpos.pos_cnum returns 6, I would expect uni_sub "a\r\nbcd" 3 2 to return bc (rather than cd).

  • Given a string “a\nbcd”, $startpos.pos_cnum of Menhir returns 0 and $endpos.pos_cnum returns 5, I would expect uni_sub "a\nbcd" 3 2 to return cd.

  • Given a string “这是个好问题”, $startpos.pos_cnum of Menhir returns 0 and $endpos.pos_cnum returns 6, String.sub does not match the location returned by Menhir, so I would expect uni_sub "这是个好问题" 0 2 to return 这是.

I don’t know much about Unicode (and its vocabulary). I want these functions to deal with a Chinese character as one unit, see \r\n as two units, and see \n as one unit.

Thank you

You should make sure you at least have basic understanding of this material. Otherwise you are just going to bang your head on the keyboard :–)

So I just read the menhir manual about positions and according to what I understand it defers all definitions to the lexer.

That leaves us with the documentation of sedlex. And one question, why don’t you simply extract the unicode code points from the lexeme and re-encode them via Buffer.add_utf_8_uchar ?

In fact there is even Sedlex.sub_lexeme and Sedlex.Utf8.sub_lexeme that should do the work directly for you.

Lots of good advice has been given already, but just to answer the OP’s original question: below is an implementation of uni_sub and uni_length using Uutf:

let uni_sub s start count =
  assert (start >= 0 && count >= 0);
  let d = Uutf.decoder (`String s) in
  let rec get idx =
    if idx = 0 then
      Uutf.decoder_byte_count d
    else begin
      match Uutf.decode d with
      | `Await -> assert false
      | `End -> failwith "uni_sub"
      | `Malformed _ | `Uchar _ -> get (pred idx)
    end
  in
  let startofs = get start in
  let endofs = get count in
  String.sub s startofs (endofs - startofs)

let uni_length s =
  let d = Uutf.decoder (`String s) in
  let rec loop i =
    match Uutf.decode d with
    | `Await -> assert false
    | `End -> i
    | `Malformed _ | `Uchar _ -> loop (succ i)
  in
  loop 0

Cheers,
Nicolas

1 Like

At least for computing the length I would rather use Uutf's folders (also always do count something for Malformed, assuming an Uchar.rep has been added as it should imperatively be on best-effort decodes.):

let utf_8_uchar_len s = Uutf.String.fold_utf_8 (fun len _ _ -> len + 1) 0 s

Good point, I amended the code as suggested.

Cheers,
Nicolas

That is a very useful introduction to unicode, but can you clarify one point for me. The introduction states “Be aware though that as far as OCaml’s compiler is concerned [string literals] are just sequences of bytes and you can’t trust these strings to be valid UTF-8 as they depend on how correctly your editor encodes them.”

This suggests the the encoding of a string literal in an ocaml binary will necessarily be the same as that of the source encoding emitted by the code editor. Is that something which can always be relied on in the case of the ocaml compiler, because it is not true of C? C has the concept of a source character set for the compiler’s representation of a string literal in the source code, and an execution character set to represent the literal in the binary, and the two need not be the same. gcc complicates it a little by also having an “input character set” for source files which is converted into gcc’s notion of the source character set. In the absence of an explicit specification of the input file encoding using the -finput-charset option, this is assumed by gcc to be the locale encoding (thus potentially making source code unportable), or if there is none or it that cannot be determined it is assumed to be UTF-8, With gcc, the execution character set can be specified using the -fexec-charset option, or if none is specified it defaults to UTF-8.

Since the ocaml compiler is partly written in C does the ocaml compiler pick this up for you? My apologies if this is something dealt with in the manual - my knowledge of the manual is incomplete.

… if that encoding is recognized by the lexer (which means it will at least need to be US-ASCII compatible).

Although that is now deprecated, the compiler allows ISO 8859-1 sources and string literals. I don’t know what the compiler pipeline eventually does to your string literals, but if that claim wasn’t true it would definitively count as a bug for the deprecated ISO 8859-1 support (which encodes on 0-255).