Sedlex (why not ocamllex?)

Chet_Murthy · March 24, 2021, 4:49am

I’ve been using sedlex to write a YAML parser, and it’s worked very well. Pretty much, “does what it says on the can”.

But in the process, I’ve started to wonder why sedlex is different from ocamllex. A long time ago, I needed unicode parsing support, so I hacked ocamllex. It wasn’t very hard (and others have done the same thing in Haskell, probably other languages): you modify ocamllex, so that it

normalizes single-character regexps (e.g. deals with union, complement, difference)
then convert each single-character glyph into its utf8 character sequence
and then, well, just let ocamllex do its thing.

You have to modify the regular expression for “any char” to properly represent them, but that’s not tricky. I might have forgotten a step, but I think that was it.

I don’t remember any other issues, but obviously sedlex didn’t take this route. I wondered why.

Mostly, I’m just curious. But also, I’m planning on implementing this YAML parser in other programming languages, at least some of which don’t have unicode-capable lexical-analyzer-generators, so I figured I might as well ask the question.

aryx · March 26, 2021, 12:15pm

Is your YAML parser available?
Can you share more explicitely your tricks to handle unicode? Do you have a link to an ocamllex files doing those things?

Chet_Murthy · March 26, 2021, 12:27pm

Still under development [more accurate to say: “it’s raw code, no doc, just starting off”], and currently I’m getting the testsuite to work. Also, this isn’t for YAML, but for what I’m calling (provisionally) “Block Style for JSON” (BS4J). That is to say, it subtracts a lot of what makes YAML hard to parse, as well as the bits that aren’t JSON:
- anchors
- tags
- complex keys
- special characters aren’t allowed in unquoted strings
  and it adds C++ “raw string literals” for multiline scalars.

I assume you mean “how to handle unicode when parsing with ocamllex”. It’s been nearly 20yr since I did the hack, and the code’s been lost to the sands of time.

P.S. I think the current YAML spec is a complete mess. Nobody designing a “language” would design it at the level of characters; rather, one would define the lexemes of the language, and then the grammar, as is done for JSON. Y’know, the way we define all other languages.

Topic		Replies	Views
Sedlex moved to ocaml-community Ecosystem	15	2244	September 11, 2018
High-performance lexing in OCaml Ecosystem performance , ocamllex	20	3026	April 1, 2021
[ANN] Sedlex 3.2 Ecosystem announce , ppx , unicode , sedlex	0	507	June 30, 2023
[ANN] first release of nice_parser Community announce	10	1391	August 5, 2019
OCaml YAML library Community yaml , library	5	1762	October 4, 2019

Sedlex (why not ocamllex?)

Related topics