In Angstrom, how to specify pasers to consume the whole input string?

arbipher · December 20, 2019, 6:25am

The simplified version of my code is at (gist link).

The snippet is this:

type exp = 
  | Int of int
  | Plus of exp * exp

let integer =
  ws *>
  take_while1 is_digit >>| (fun x -> Int (int_of_string x))

(* this works *)
let int_plus = 
  let* es = (sep_by1 (lchar '+') integer) in
  let e1, er = List.hd es, List.tl es in
  (return (List.fold_left (fun acc e -> Plus (acc, e)) e1 er))

(* this doesn't work as expected *)
let expr : exp t = 
  fix (fun expr ->
      let plus = 
        let* es = (sep_by1 (lchar '+') expr) in
        let e1, er = List.hd es, List.tl es in
        return (List.fold_left (fun acc e -> Plus (acc, e)) e1 er)
      in
      ws *> 
      (choice [integer; plus ]))

While parsing 1 + 1,
int_plus can parse 1 + 1 to Plus (Int 1, Int 1) as I expected,
expr just returns 1 (then how does it deal with the rest + 1?)

I check several examples like this and this, but they all use look-ahead.

May I ask how to fix my code to let expr parse the whole string and reject the partial result?

anuragsoni · December 20, 2019, 2:04pm

Since you are are using Angstrom.choice it will run the list of parsers (integer and plus) in your case and return the first one that succeeds. In your scenario integer matches the 1 so its a successfull match.

You would need to update the definition for the integer parser. Your integer parser would probably need to know that its not a success if there are still some content left to consume. Lookahead is a nice to achieve that behavior.

arbipher · December 20, 2019, 2:22pm

I dou’t quite understand how lookahead could help this. In the official and your examples, lookahead one char is sufficient to make a choice, however, in this example, whether it’s
123 or 123 + 321 can not be determined if we just know the first 1.

anuragsoni · December 20, 2019, 2:26pm

Angstrom also provides a peek_string. Although in your example with the integer parser the you’ll have 123 since you use a take_while. You could probably do something with ws *> integer <* ws and then perform a lookahead maybe?

arbipher · December 20, 2019, 2:51pm

I tried ws *> integer <* ws but fails.

In several examples I read, whitespaces are ignored at the beginning of terms. Is it the suggested style or just coincident?

I think my core problem is I don’t understand how Backtracking by default works.
If I can require the outermost expr to consume the whole string, it should reject the case after it chooses integer.

anuragsoni · December 20, 2019, 6:20pm

I think I know what’s happening, but it’d be great id someone more knowledgable can chime in.

let expr : exp t =
  fix (fun expr ->
        let* es = sep_by1 (lchar '+') (parens expr <|> integer) in
        let e1, er = (List.hd es, List.tl es) in
        return (List.fold_left (fun acc e -> Plus (acc, e)) e1 er)
      )

Changing the expr definition the one above gets things to behave the way you expect. I think the choice [integer; plus ] pretty much guarantees that the parser will stop after parsing the first number.

What we want to express with fix is that we’d like to parse an expression but we need to refer to the same expr parser inside that.

Note that I used the parens since i saw it in your gist and assumed that’s important for your use case as well.

arbipher · December 20, 2019, 6:59pm

Thanks for your answer and kindness.

(parens expr <|> integer) in your solution is the same idea of lookahead one char.
If it’s a ‘(’, then it must be an expr, otherwise it’s an int.

parens are useful in my work but not in this case. I tried to use parens to enforce an outermost wrapping of the whole string like this. But it just returns a failure.

let expr : exp t = 
  parens @@
  fix (fun expr ->
      let plus = 
        let* es = (sep_by1 (lchar '+') expr) in
        let e1, er = List.hd es, List.tl es in
        return (List.fold_left (fun acc e -> Plus (acc, e)) e1 er)
      in
      ws *> 
      (integer <|> plus)

;;
eval "(1 + 1)";;

I have a menhir lexer and parser for the language and I am trying writing an Angstrom parser for it. I think your point on choice is correct and I am even more wondering on how should I achieve this.

anuragsoni · December 20, 2019, 7:21pm

I think without the parens things sort of got stuck in a loop and parens just caused the first parser to fail so we fell back on integer. So in any scenario you’d need at least one peek to decide how to start the parsing run. I think you are on the right track. You probably just need one peek just to start parsing.

Stepping away from fix for a while, if you still want a parser that ensures that the entire string is consumed, you can use the end_of_input, or if you just care about a line, end_of_line and use parser <* end_of_input etc. This will fail if the entire content isn’t consumed.

arbipher · December 20, 2019, 7:37pm

I also used the end_of_input before when the eval is like

let eval (s : string) =
  match parse_string (expr <* end_of_input) s with
  | Ok v      -> v
  | Error msg -> failwith msg

The sad thing is the problem remains.
1 + 1 return failure since the choice chooses the int.

I may read some source code to gain more knowledge. The program’s behavior doesn’t match my understanding on non-deterministic monad / monadic parsing so I should have some mis-concept somewhere.

peek should solve the code problems as every demo including yours have done.

anuragsoni · December 20, 2019, 7:49pm

end_of_input would be useful for just the integer parser. So you could define that as integer <* end_of_input. This would match “12” but not “12 + …”. So then you could write choose [integer ; plus].

But that doesn’t really work in a situation where you are dealing with expressions and not just a one off comparison between Int and Plus. peek combined with stripping whitespace should be enough to cover most of the cases in a parser like the one you showed.

anuragsoni · December 20, 2019, 7:50pm

If you can think about examples that might help people understand this better, please do open issues against my repo. I’ll be be more than happy to work on adding more examples and accepting contributions to improve the documentation so they are easier to understand.

arbipher · May 29, 2020, 4:17pm

it’s been a while that I come back to this post.

1 - It’s easy to test how angstrom deal with backtrack

open Angstrom
let expr = 
  fix (fun expr ->
    choice [char '2'; char '1'; char '1' *> (char '2')]
  )

let eval1 s =
  match parse_string ~consume:All expr s with
  | Ok v      -> v
  | Error msg -> failwith msg

let no_fix =
  choice [char '2'; char '1'; char '1' *> (char '2')]

let eval2 s =
  match parse_string ~consume:All expr s with
  | Ok v      -> v
  | Error msg -> failwith msg

;;
eval1 "1";;
(* eval1 "12";; *)

eval2 "1";;
(* eval2 "12";; *)

Either recursive and non-recursive parser cannot parse “12” due to char '1' in the choice succeeds. I think the backtracking here somewhat shallow (in my word). It can test and come back from char '2', but cannot come back from char '1'. I was thinking the backtracking should be universal, no matter how deep we currently progress. I see the source code at a glance which isn’t like a non-deterministic monadic implementation to support a deep backtrack.

Therefore, to parse an expression, which has associativity and precedence issues e.g. 1 + 2 * 3 - 4, we might use the traditional ideas like term factor to split the parser, just like the usage section of angstrom README

Topic		Replies	Views
Parsing simple recursive expressions with Angstrom Learning angstrom , parsing	4	479	December 29, 2023
Parsing alternative for more than one char with angstrom Learning angstrom	2	604	March 31, 2021
Debugging Angstrom (or parser combinators in general) Learning angstrom , parsing	5	414	December 29, 2023
Parse tree structure with Angstrom Learning angstrom	0	626	October 14, 2021
Angstrom parser optimization Learning angstrom	1	2611	March 23, 2018

In Angstrom, how to specify pasers to consume the whole input string?

Related topics