Sedlex (why not ocamllex?)

I’ve been using sedlex to write a YAML parser, and it’s worked very well. Pretty much, “does what it says on the can”.

But in the process, I’ve started to wonder why sedlex is different from ocamllex. A long time ago, I needed unicode parsing support, so I hacked ocamllex. It wasn’t very hard (and others have done the same thing in Haskell, probably other languages): you modify ocamllex, so that it

  1. normalizes single-character regexps (e.g. deals with union, complement, difference)
  2. then convert each single-character glyph into its utf8 character sequence
  3. and then, well, just let ocamllex do its thing.

You have to modify the regular expression for “any char” to properly represent them, but that’s not tricky. I might have forgotten a step, but I think that was it.

I don’t remember any other issues, but obviously sedlex didn’t take this route. I wondered why.

Mostly, I’m just curious. But also, I’m planning on implementing this YAML parser in other programming languages, at least some of which don’t have unicode-capable lexical-analyzer-generators, so I figured I might as well ask the question.

1 Like
  1. Is your YAML parser available?
  2. Can you share more explicitely your tricks to handle unicode? Do you have a link to an ocamllex files doing those things?
  1. Still under development [more accurate to say: “it’s raw code, no doc, just starting off”], and currently I’m getting the testsuite to work. Also, this isn’t for YAML, but for what I’m calling (provisionally) “Block Style for JSON” (BS4J). That is to say, it subtracts a lot of what makes YAML hard to parse, as well as the bits that aren’t JSON:
    • anchors
    • tags
    • complex keys
    • special characters aren’t allowed in unquoted strings
      and it adds C++ “raw string literals” for multiline scalars.
  1. I assume you mean “how to handle unicode when parsing with ocamllex”. It’s been nearly 20yr since I did the hack, and the code’s been lost to the sands of time.

P.S. I think the current YAML spec is a complete mess. Nobody designing a “language” would design it at the level of characters; rather, one would define the lexemes of the language, and then the grammar, as is done for JSON. Y’know, the way we define all other languages.

1 Like