In Angstrom, how to specify pasers to consume the whole input string?

The simplified version of my code is at (gist link).

The snippet is this:

type exp = 
  | Int of int
  | Plus of exp * exp

let integer =
  ws *>
  take_while1 is_digit >>| (fun x -> Int (int_of_string x))

(* this works *)
let int_plus = 
  let* es = (sep_by1 (lchar '+') integer) in
  let e1, er = List.hd es, List.tl es in
  (return (List.fold_left (fun acc e -> Plus (acc, e)) e1 er))

(* this doesn't work as expected *)
let expr : exp t = 
  fix (fun expr ->
      let plus = 
        let* es = (sep_by1 (lchar '+') expr) in
        let e1, er = List.hd es, List.tl es in
        return (List.fold_left (fun acc e -> Plus (acc, e)) e1 er)
      in
      ws *> 
      (choice [integer; plus ]))

While parsing 1 + 1,
int_plus can parse 1 + 1 to Plus (Int 1, Int 1) as I expected,
expr just returns 1 (then how does it deal with the rest + 1?)

I check several examples like this and this, but they all use look-ahead.

May I ask how to fix my code to let expr parse the whole string and reject the partial result?

1 Like

Since you are are using Angstrom.choice it will run the list of parsers (integer and plus) in your case and return the first one that succeeds. In your scenario integer matches the 1 so its a successfull match.

You would need to update the definition for the integer parser. Your integer parser would probably need to know that its not a success if there are still some content left to consume. Lookahead is a nice to achieve that behavior.

I dou’t quite understand how lookahead could help this. In the official and your examples, lookahead one char is sufficient to make a choice, however, in this example, whether it’s
123 or 123 + 321 can not be determined if we just know the first 1.

Angstrom also provides a peek_string. Although in your example with the integer parser the you’ll have 123 since you use a take_while. You could probably do something with ws *> integer <* ws and then perform a lookahead maybe?

I tried ws *> integer <* ws but fails.

In several examples I read, whitespaces are ignored at the beginning of terms. Is it the suggested style or just coincident?

I think my core problem is I don’t understand how Backtracking by default works.
If I can require the outermost expr to consume the whole string, it should reject the case after it chooses integer.

I think I know what’s happening, but it’d be great id someone more knowledgable can chime in.

let expr : exp t =
  fix (fun expr ->
        let* es = sep_by1 (lchar '+') (parens expr <|> integer) in
        let e1, er = (List.hd es, List.tl es) in
        return (List.fold_left (fun acc e -> Plus (acc, e)) e1 er)
      )

Changing the expr definition the one above gets things to behave the way you expect. I think the choice [integer; plus ] pretty much guarantees that the parser will stop after parsing the first number.

What we want to express with fix is that we’d like to parse an expression but we need to refer to the same expr parser inside that.

Note that I used the parens since i saw it in your gist and assumed that’s important for your use case as well.

1 Like

Thanks for your answer and kindness.

(parens expr <|> integer) in your solution is the same idea of lookahead one char.
If it’s a ‘(’, then it must be an expr, otherwise it’s an int.

parens are useful in my work but not in this case. I tried to use parens to enforce an outermost wrapping of the whole string like this. But it just returns a failure.

let expr : exp t = 
  parens @@
  fix (fun expr ->
      let plus = 
        let* es = (sep_by1 (lchar '+') expr) in
        let e1, er = List.hd es, List.tl es in
        return (List.fold_left (fun acc e -> Plus (acc, e)) e1 er)
      in
      ws *> 
      (integer <|> plus)

;;
eval "(1 + 1)";;

I have a menhir lexer and parser for the language and I am trying writing an Angstrom parser for it. I think your point on choice is correct and I am even more wondering on how should I achieve this.

I think without the parens things sort of got stuck in a loop and parens just caused the first parser to fail so we fell back on integer. So in any scenario you’d need at least one peek to decide how to start the parsing run. I think you are on the right track. You probably just need one peek just to start parsing.

Stepping away from fix for a while, if you still want a parser that ensures that the entire string is consumed, you can use the end_of_input, or if you just care about a line, end_of_line and use parser <* end_of_input etc. This will fail if the entire content isn’t consumed.

1 Like

I also used the end_of_input before when the eval is like

let eval (s : string) =
  match parse_string (expr <* end_of_input) s with
  | Ok v      -> v
  | Error msg -> failwith msg

The sad thing is the problem remains.
1 + 1 return failure since the choice chooses the int.

I may read some source code to gain more knowledge. The program’s behavior doesn’t match my understanding on non-deterministic monad / monadic parsing so I should have some mis-concept somewhere.

peek should solve the code problems as every demo including yours have done.

end_of_input would be useful for just the integer parser. So you could define that as integer <* end_of_input. This would match “12” but not “12 + …”. So then you could write choose [integer ; plus].

But that doesn’t really work in a situation where you are dealing with expressions and not just a one off comparison between Int and Plus. peek combined with stripping whitespace should be enough to cover most of the cases in a parser like the one you showed.

If you can think about examples that might help people understand this better, please do open issues against my repo. I’ll be be more than happy to work on adding more examples and accepting contributions to improve the documentation so they are easier to understand.

it’s been a while that I come back to this post.

1 - It’s easy to test how angstrom deal with backtrack

open Angstrom
let expr = 
  fix (fun expr ->
    choice [char '2'; char '1'; char '1' *> (char '2')]
  )

let eval1 s =
  match parse_string ~consume:All expr s with
  | Ok v      -> v
  | Error msg -> failwith msg

let no_fix =
  choice [char '2'; char '1'; char '1' *> (char '2')]

let eval2 s =
  match parse_string ~consume:All expr s with
  | Ok v      -> v
  | Error msg -> failwith msg

;;
eval1 "1";;
(* eval1 "12";; *)

eval2 "1";;
(* eval2 "12";; *)

Either recursive and non-recursive parser cannot parse “12” due to char '1' in the choice succeeds. I think the backtracking here somewhat shallow (in my word). It can test and come back from char '2', but cannot come back from char '1'. I was thinking the backtracking should be universal, no matter how deep we currently progress. I see the source code at a glance which isn’t like a non-deterministic monadic implementation to support a deep backtrack.

Therefore, to parse an expression, which has associativity and precedence issues e.g. 1 + 2 * 3 - 4, we might use the traditional ideas like term factor to split the parser, just like the usage section of angstrom README