Unicode-aware parser combinators

Interesting, thanks, didn’t know about this part of the specs:

However, a document is still well-formed even if it is not fully normalized. XML processors should provide a user option to verify that the document being processed is in fully normalized form, and report to the application whether it is or not. The option to not verify should be chosen only when the input text is certified, as defined by B Definitions for Character Normalization.

The verification of full normalization must be carried out as if by first verifying that the entity is in include-normalized form as defined by B Definitions for Character Normalization and by then verifying that none of the relevant constructs listed above begins (after character references are expanded) with a composing character as defined by B Definitions for Character Normalization. Non-validating processors must ignore possible denormalizations that would be caused by inclusion of external entities that they do not read.

XML processors must not transform the input to be in fully normalized form. XML applications that create XML 1.1 output from either XML 1.1 or XML 1.0 input should ensure that the output is fully normalized; it is not necessary for internal processing forms to be fully normalized.

The purpose of this section is to strongly encourage XML processors to ensure that the creators of XML documents have properly normalized them, so that XML applications can make tests such as identity comparisons of strings without having to worry about the different possible “spellings” of strings which Unicode allows.

Separating parsing from lexing and passing the Unicode-awareness off to the lexer is sensible. But from that perspective, most of the OCaml parser combinator libraries I’ve looked at are problematic because they don’t seem to be designed to work with a lexer but work directly on streams of characters rather than streams of tokens. Fmlib_parse appears to be an exception.

Since you’re looking to use a parser combinator library, why not define your own parser for unicode characters? Using Angstrom here, but this should be applicable to any combinator library.

Define parsers for the constituent bytes of UTF8:

let utf8_byte1_1, utf8_byte1_2, utf8_byte1_3, utf8_byte1_4, utf8_byte_follow =
    let check mask bits x = Char.code x land mask = bits in
    satisfy (check 0b1000_0000 0b0000_0000),
    satisfy (check 0b1110_0000 0b1100_0000),
    satisfy (check 0b1111_0000 0b1110_0000),
    satisfy (check 0b1111_1000 0b1111_0000),
    satisfy (check 0b1100_0000 0b1000_0000)

Combine up to four bytes for a single UTF-8 code point:

let utf8_bytes1, utf8_bytes2, utf8_bytes3, utf8_bytes4 =
    let of_chars cs = List.to_seq cs |> String.of_seq in
    lift (String.make 1) utf8_byte1_1,
    lift2 (fun c1 c2 -> of_chars [c1; c2])
        utf8_byte1_2 utf8_byte_follow,
    lift3 (fun c1 c2 c3 -> of_chars [c1; c2; c3])
        utf8_byte1_3 utf8_byte_follow utf8_byte_follow,
    lift4 (fun c1 c2 c3 c4 -> of_chars [c1; c2; c3; c4])
        utf8_byte1_4 utf8_byte_follow utf8_byte_follow utf8_byte_follow

A UTF-8 character is then 1 to 4 UTF-8 encoded bytes:

let any_utf8 =
    choice ~failure_msg:"invalid UTF-8 byte"
        [utf8_bytes1; utf8_bytes2; utf8_bytes3; utf8_bytes4]

Then, for example, asking the user to define a custom operator of length three could look like:

let utf8_take n = String.concat "" <$> count n any_utf8
let op3 = utf8_take 3

Obviously, this means re-defining any of the provided char and string combinators you need, which might be more of a hassle than it’s worth:

(** check if a string contains exactly one UTF-8 character *)
let is_valid_utf8_1 s =
    s <> ""
    && let u = String.get_utf_8_uchar s 0 in
       Uchar.utf_decode_is_valid u
       && Uchar.utf_decode_length u = String.length s

(** parse the given UTF-8 character *)
let utf8 c =
    if is_valid_utf8_1 c then
        (any_utf8 >>= fun c' ->
            if c = c' then return c'
            else fail (Printf.sprintf "expected %S" c))
    else failwith "expected one UTF-8 character"

I also can’t imagine it’s particularly efficient.

The Cf_scan combinators in Orsetto, which are the foundation of its Unicode aware combinators, support staging on token streams produced by a lexical analyzer, e.g. sedlex or ocamllex.

1 Like

I found myself in the same situation in OCaml and wrote a quick clone of FParsec (F#), which has decent error reporting for a parser combinator library. However, I don’t maintain that anymore, I’m now using the sedlex PPX for a recursive descent parser with a custom Sedlexing module implementing the PPX API to decode UTF-8 strings directly.

Both times I used the UTF decoding functions from the standard library to implement them.

I use fmlib_parse in precisely this kind of setup.