I’ve been using sedlex
to write a YAML parser, and it’s worked very well. Pretty much, “does what it says on the can”.
But in the process, I’ve started to wonder why sedlex is different from ocamllex. A long time ago, I needed unicode parsing support, so I hacked ocamllex. It wasn’t very hard (and others have done the same thing in Haskell, probably other languages): you modify ocamllex, so that it
- normalizes single-character regexps (e.g. deals with union, complement, difference)
- then convert each single-character glyph into its utf8 character sequence
- and then, well, just let ocamllex do its thing.
You have to modify the regular expression for “any char” to properly represent them, but that’s not tricky. I might have forgotten a step, but I think that was it.
I don’t remember any other issues, but obviously sedlex didn’t take this route. I wondered why.
Mostly, I’m just curious. But also, I’m planning on implementing this YAML parser in other programming languages, at least some of which don’t have unicode-capable lexical-analyzer-generators, so I figured I might as well ask the question.