Hi I’m very new to OCaml and haven’t written a parser in it before. I’m curious if you could code review this simple ~70 line parser for my PPX. I wrote it because my old regex based implementation could handle
Hello there {name} I am {name2} and {string_of_int 1}
but not
Hi {some_func {a=1; b=2;}} you're \\{great!\\}
It’s tested pretty decently so I know it works, but I want to know if I did it in the optimal way or if there’s a better way to do it whether that means using more advanced parsing techniques or OCaml features I don’t know about.
type tstr =
| TStr of string
| TVar of string
type parse_state =
| Start
| InString
| InCurly of int
type parse_error =
| NO_EXPRESSION
| NO_OPENING
| NO_CLOSING
| EMPTY_STR
exception Parse_error of int * parse_error
let parse_str str =
let len = String.length str in
let buf = Buffer.create 500 in
let add c = Buffer.add_char buf c in
let contents () =
let s = Buffer.contents buf in
Buffer.clear buf;
s
in
let rec parser ~i ~state ~lst str =
if i = len then
match state with
| Start ->
raise (Parse_error (0, EMPTY_STR))
| InString ->
if Buffer.length buf = 0 then
lst
else
(TStr (contents ()) :: lst)
| InCurly _ ->
raise (Parse_error (i, NO_CLOSING))
else
let c = str.[i] in
match state with
| Start ->
if c == '{' then
str |> parser ~i:(i + 1) ~state:(InCurly 0) ~lst
else
str |> parser ~i ~state:InString ~lst
| InString ->
if i = len - 1 then
match c with
| '{' ->
raise (Parse_error (i, NO_CLOSING))
| '}' ->
raise (Parse_error (i, NO_OPENING))
| _ ->
add c;
str |> parser ~i:(i + 1) ~state:InString ~lst
else
let c2 = str.[i + 1] in (
match c, c2 with
| '}', _ ->
raise (Parse_error (i, NO_OPENING))
| '\\', ('{' | '}') ->
add c2;
str |> parser ~i:(i + 2) ~state:InString ~lst
| '{', _ ->
let s = TStr (contents ()) in
str |> parser ~i:(i + 1) ~state:(InCurly 0) ~lst:(s :: lst)
| _ ->
add c;
str |> parser ~i:(i + 1) ~state:InString ~lst
)
| InCurly n ->
match c with
| '{' ->
add c;
str |> parser ~i:(i + 1) ~state:(InCurly (n + 1)) ~lst
| '}' ->
if Buffer.length buf = 0 then
raise (Parse_error (i, NO_EXPRESSION))
else if n = 0 then (
let s = TVar (contents ()) in
str |> parser ~i:(i + 1) ~state:InString ~lst:(s :: lst)
)
else (
add c;
str |> parser ~i:(i + 1) ~state:(InCurly (n - 1)) ~lst
)
| _ ->
add c;
str |> parser ~i:(i + 1) ~state:(InCurly n) ~lst
in
str
|> parser ~i:0 ~state:Start ~lst:[]
|> List.rev
Have you considered implementing the string parser with a scanner generator like Menhir (or OCamlLex)?
Well I don’t have much experience in parsing so I might be wrong here, but lexing seemed like overkill because the language was so simple. Also with regards to Menhir, I had trouble writing a grammar where the outermost set of curly braces matter, but any curly braces inside of those don’t matter.
I was wrong in suggesting Menhir (a parser generator) but using OCamlLex would be an option. I agree that it requires a bit of experience to encode nested constructs and escape mechanisms. The advantage is that you can focus on the interesting cases and don’t have to deal with all the book keeping. I have a small article Recipes for OCamlLex that advocates this approach but of course it’s fine to use a hand-written scanner as well.
I agree that it requires a bit of experience to encode nested constructs and escape mechanisms
How would you define a lexer that deals with the nested constructs? I actually tried using sedlexing and re, and I was unable to do it. As far as I know, you’d need regular expressions with non-greedy matching and negative/positive lookaheads/lookbehinds.
The main scanner calls into a sub scanner for comments which maintains a counter for nesting. Sub scanners are typically used for strings and comments and provide power beyond what simple regular expressions can do.
Here is a first sketch how to implement scanning with OCamlLex. Handling of escapes needs to be improved. The main part are two rules - one for the outer string and one for embedded variables. This can be made also more efficient by recognising longer strings that don’t contain critical characters rather than adding one character at the time.
rule string b acc = parse
| '{' { let s = B.contents b in
let v = variable (B.create 100) 0 lexbuf in
string (B.create 100) (VAR(v)::STR(s)::acc) lexbuf
}
| "\\{" { B.add_string b "{" ; string b acc lexbuf }
| "\\}" { B.add_string b "}" ; string b acc lexbuf }
| "\\" _ { error lexbuf "unrecognized escape: %s" (get lexbuf) }
| _ { B.add_string b (get lexbuf); string b acc lexbuf }
| eof { let s = B.contents b in
let acc = STR(s)::acc in
List.rev acc
}
and variable b n = parse
| '}' { if n = 0 then
let s = B.contents b in s
else begin
B.add_string b (get lexbuf); variable b (n-1) lexbuf
end
}
| '{' { B.add_string b (get lexbuf); variable b (n+1) lexbuf }
| _ { B.add_string b (get lexbuf); variable b n lexbuf }
| eof { error lexbuf "unexpected end of string" }
I start to have doubts about this approach. If the parts inside {...} can contain arbitrary OCaml code, you would need a full OCaml parser to recognise where this ends. The code could contain something like "...}..." or "...\\}..." and the scanner would have to recognise OCaml’s string syntax in order to parse this correctly.
I think you have to allow braces in nested expressions. It’s marginally easier if you don’t allow comments. Then you just need to lex strings, which can’t be nested, so that’s simpler. There are several programs that deal with this sort-of properly in their lexers, including Menhir, ocamllex, ocamlyacc, etc. It’s a little bit messy but not overly so.
It might simply be easier to mandate that the syntax only takes variables however. That’s far simpler still.
BTW, random syntax note: Ruby does #{foo}, and I would suggest doing that instead of {foo} for two reasons: (1) precedent, and (2) #{ is less likely to show up in ordinary strings than {, so there will be fewer times when ugly \\ quoting ends up being needed.
Yeah and JS, which is what inspired me does ${. Perhaps saving one character of typing doesn’t justify making my parser (lexer?) so much more complicated.
You are correct that these problems have been solved before by the tools you mention but I no longer would call this a simple template string parser. The overall goal was to implement a PPX for template strings. Unless these templates are really long and complex, maybe OCaml’s {| ... |} strings are good enough?
let name = "aria"
let str = Printf.sprintf {|
Hello there %s I am %d
|} name 1
vs
let name = "aria"
let name2 = "ppx_str"
let my_str = [%str
{|Hello there {name} I am {name2} and {string_of_int 1}|}
]
Either way, the presence of the two character sequence makes unintended uses rarer, so it’s more convenient that way.
You ignored the other rationale. Braces alone are common. Things like #{ or ${ or whatever aren’t. It’s nicer to do it that way. And as for python, “precedent” is the weakest possible argument, right?
In Ruby, these things tend to be long and complex (like HTML pages being substituted into). It’s true that if you’re dealing with something short there’s truly no point.
I like that idea a lot. In combination with long strings ({|...|}), this would make creating in-line documents a lot easier. I assume {..} would contain expressions and not just names.