Code review: Simple template string parser


#1

Code

Hi I’m very new to OCaml and haven’t written a parser in it before. I’m curious if you could code review this simple ~70 line parser for my PPX. I wrote it because my old regex based implementation could handle

Hello there {name} I am {name2} and {string_of_int 1}

but not

Hi {some_func {a=1; b=2;}} you're \\{great!\\}

It’s tested pretty decently so I know it works, but I want to know if I did it in the optimal way or if there’s a better way to do it whether that means using more advanced parsing techniques or OCaml features I don’t know about.

type tstr =
  | TStr of string
  | TVar of string
type parse_state =
  | Start
  | InString
  | InCurly of int
type parse_error =
  | NO_EXPRESSION
  | NO_OPENING
  | NO_CLOSING
  | EMPTY_STR

exception Parse_error of int * parse_error

let parse_str str =
  let len = String.length str in
  let buf = Buffer.create 500 in

  let add c = Buffer.add_char buf c in
  let contents () =
    let s = Buffer.contents buf in
    Buffer.clear buf;
    s
  in

  let rec parser ~i ~state ~lst str =
    if i = len then
      match state with
      | Start ->
        raise (Parse_error (0, EMPTY_STR))
      | InString ->
          if Buffer.length buf = 0 then
            lst
          else
            (TStr (contents ()) :: lst)
      | InCurly _ ->
        raise (Parse_error (i, NO_CLOSING))
    else
      let c = str.[i] in
      match state with
      | Start ->
        if c == '{' then
          str |> parser ~i:(i + 1) ~state:(InCurly 0) ~lst
        else
          str |> parser ~i ~state:InString ~lst
      | InString ->
          if i = len - 1 then
            match c with
            | '{' ->
              raise (Parse_error (i, NO_CLOSING))
            | '}' ->
              raise (Parse_error (i, NO_OPENING))
            | _ ->
              add c;
              str |> parser ~i:(i + 1) ~state:InString ~lst
          else
            let c2 = str.[i + 1] in (
            match c, c2 with
            | '}', _ ->
              raise (Parse_error (i, NO_OPENING))
            | '\\', ('{' | '}') ->
              add c2;
              str |> parser ~i:(i + 2) ~state:InString ~lst
            | '{', _ ->
              let s = TStr (contents ()) in
              str |> parser ~i:(i + 1) ~state:(InCurly 0) ~lst:(s :: lst)
            | _ ->
              add c;
              str |> parser ~i:(i + 1) ~state:InString ~lst
            )
      | InCurly n ->
        match c with
        | '{' ->
          add c;
          str |> parser ~i:(i + 1) ~state:(InCurly (n + 1)) ~lst
        | '}' ->
          if Buffer.length buf = 0 then
            raise (Parse_error (i, NO_EXPRESSION))
          else if n = 0 then (
            let s = TVar (contents ()) in
            str |> parser ~i:(i + 1) ~state:InString ~lst:(s :: lst)
          )
          else (
            add c;
            str |> parser ~i:(i + 1) ~state:(InCurly (n - 1)) ~lst
          )
        | _ ->
          add c;
          str |> parser ~i:(i + 1) ~state:(InCurly n) ~lst
    in
  str
    |> parser ~i:0 ~state:Start ~lst:[]
    |> List.rev

#2

Have you considered implementing the string parser with a scanner generator like OCamlLex?


#3

Have you considered implementing the string parser with a scanner generator like Menhir (or OCamlLex)?

Well I don’t have much experience in parsing so I might be wrong here, but lexing seemed like overkill because the language was so simple. Also with regards to Menhir, I had trouble writing a grammar where the outermost set of curly braces matter, but any curly braces inside of those don’t matter.


#4

I was wrong in suggesting Menhir (a parser generator) but using OCamlLex would be an option. I agree that it requires a bit of experience to encode nested constructs and escape mechanisms. The advantage is that you can focus on the interesting cases and don’t have to deal with all the book keeping. I have a small article Recipes for OCamlLex that advocates this approach but of course it’s fine to use a hand-written scanner as well.


#5

I agree that it requires a bit of experience to encode nested constructs and escape mechanisms

How would you define a lexer that deals with the nested constructs? I actually tried using sedlexing and re, and I was unable to do it. As far as I know, you’d need regular expressions with non-greedy matching and negative/positive lookaheads/lookbehinds.


#6

As an example, take a look at how OCaml scans comments:

The main scanner calls into a sub scanner for comments which maintains a counter for nesting. Sub scanners are typically used for strings and comments and provide power beyond what simple regular expressions can do.

Here is a first sketch how to implement scanning with OCamlLex. Handling of escapes needs to be improved. The main part are two rules - one for the outer string and one for embedded variables. This can be made also more efficient by recognising longer strings that don’t contain critical characters rather than adding one character at the time.

rule string b acc = parse
| '{'       { let s = B.contents b in 
              let v = variable (B.create 100) 0 lexbuf in
              string (B.create 100) (VAR(v)::STR(s)::acc) lexbuf
            }
| "\\{"     { B.add_string b "{" ; string b acc lexbuf }  
| "\\}"     { B.add_string b "}" ; string b acc lexbuf }  
| "\\" _    { error lexbuf "unrecognized escape: %s" (get lexbuf) }
| _         { B.add_string b (get lexbuf); string b acc lexbuf }
| eof       { let s   = B.contents b in 
              let acc = STR(s)::acc in
              List.rev acc
            }
 
and variable b n = parse
| '}'       { if n = 0 then 
                let s = B.contents b in s 
              else begin
                B.add_string b (get lexbuf); variable b (n-1) lexbuf
              end
            }
| '{'       { B.add_string b (get lexbuf); variable b (n+1) lexbuf }
| _         { B.add_string b (get lexbuf); variable b n lexbuf }
| eof       { error lexbuf "unexpected end of string" }



#7

Here is another example of how to lex nested comments. Should be easy to adapt: https://github.com/smolkaj/ocaml-parsing/blob/master/src/Lexer.cppo.ml


#8

I start to have doubts about this approach. If the parts inside {...} can contain arbitrary OCaml code, you would need a full OCaml parser to recognise where this ends. The code could contain something like "...}..." or "...\\}..." and the scanner would have to recognise OCaml’s string syntax in order to parse this correctly.


#9

You need to lex ocaml strings, characters and comments inside {...}, similar to how ocamllex actions are handled in https://github.com/ocaml/ocaml/blob/f04bbafd05d5060ba3c9839031c4f68fac75e977/lex/lexer.mll#L312-L342.


#10

Well I guess I can add that in, or just ask people not to nest strings/comments with unbalanced curly braces in their strings :sweat_smile:


#11

I think you have to allow braces in nested expressions. It’s marginally easier if you don’t allow comments. Then you just need to lex strings, which can’t be nested, so that’s simpler. There are several programs that deal with this sort-of properly in their lexers, including Menhir, ocamllex, ocamlyacc, etc. It’s a little bit messy but not overly so.

It might simply be easier to mandate that the syntax only takes variables however. That’s far simpler still.

BTW, random syntax note: Ruby does #{foo}, and I would suggest doing that instead of {foo} for two reasons: (1) precedent, and (2) #{ is less likely to show up in ordinary strings than {, so there will be fewer times when ugly \\ quoting ends up being needed.


#12

Yeah and JS, which is what inspired me does ${. Perhaps saving one character of typing doesn’t justify making my parser (lexer?) so much more complicated.


#13

I believe “precedent” is the weakest possible argument, and since there are other ones issued here, it’s negligible. And anyway python does {foo}.


#14

You are correct that these problems have been solved before by the tools you mention but I no longer would call this a simple template string parser. The overall goal was to implement a PPX for template strings. Unless these templates are really long and complex, maybe OCaml’s {| ... |} strings are good enough?

let name  = "aria"

let str = Printf.sprintf {|
  Hello there %s I am %d
  |} name 1

vs

let name = "aria"
let name2 = "ppx_str"


let my_str = [%str
  {|Hello there {name} I am {name2} and {string_of_int 1}|}
]

#15

Either way, the presence of the two character sequence makes unintended uses rarer, so it’s more convenient that way.

You ignored the other rationale. Braces alone are common. Things like #{ or ${ or whatever aren’t. It’s nicer to do it that way. And as for python, “precedent” is the weakest possible argument, right?


#16

In Ruby, these things tend to be long and complex (like HTML pages being substituted into). It’s true that if you’re dealing with something short there’s truly no point.


#17

I would just like to add:

  • I think string templating along these lines is a good idea
  • Allowing expressions (rather than just vars) is a good idea
  • Parsing ocaml expressions may be complicated in the general case, but the simple case (where brackets are nested properly etc) is still very useful

#18

Sometimes I wonder if we could add a templating syntax to the current format parser, something like

"Hello %{user.name}s, your account is %{account_number user}d."

#19

I like that idea a lot. In combination with long strings ({|...|}), this would make creating in-line documents a lot easier. I assume {..} would contain expressions and not just names.


#20

Yes, I reformulated the example to use full expressions.

(Note for format experts: "%{print_user}{user}a"…)