The simplified version of my code is at (gist link).
The snippet is this:
type exp =
| Int of int
| Plus of exp * exp
let integer =
ws *>
take_while1 is_digit >>| (fun x -> Int (int_of_string x))
(* this works *)
let int_plus =
let* es = (sep_by1 (lchar '+') integer) in
let e1, er = List.hd es, List.tl es in
(return (List.fold_left (fun acc e -> Plus (acc, e)) e1 er))
(* this doesn't work as expected *)
let expr : exp t =
fix (fun expr ->
let plus =
let* es = (sep_by1 (lchar '+') expr) in
let e1, er = List.hd es, List.tl es in
return (List.fold_left (fun acc e -> Plus (acc, e)) e1 er)
in
ws *>
(choice [integer; plus ]))
While parsing 1 + 1
,
int_plus
can parse 1 + 1
to Plus (Int 1, Int 1)
as I expected,
expr
just returns 1
(then how does it deal with the rest + 1
?)
I check several examples like this and this, but they all use look-ahead.
May I ask how to fix my code to let expr
parse the whole string and reject the partial result?
1 Like
Since you are are using Angstrom.choice
it will run the list of parsers (integer and plus) in your case and return the first one that succeeds. In your scenario integer
matches the 1
so its a successfull match.
You would need to update the definition for the integer
parser. Your integer
parser would probably need to know that its not a success if there are still some content left to consume. Lookahead is a nice to achieve that behavior.
I douât quite understand how lookahead could help this. In the official and your examples, lookahead one char is sufficient to make a choice, however, in this example, whether itâs
123
or 123 + 321
can not be determined if we just know the first 1
.
Angstrom also provides a peek_string
. Although in your example with the integer parser the youâll have 123
since you use a take_while
. You could probably do something with ws *> integer <* ws
and then perform a lookahead maybe?
I tried ws *> integer <* ws
but fails.
In several examples I read, whitespaces are ignored at the beginning of terms. Is it the suggested style or just coincident?
I think my core problem is I donât understand how Backtracking by default works.
If I can require the outermost expr
to consume the whole string, it should reject the case after it chooses integer
.
I think I know whatâs happening, but itâd be great id someone more knowledgable can chime in.
let expr : exp t =
fix (fun expr ->
let* es = sep_by1 (lchar '+') (parens expr <|> integer) in
let e1, er = (List.hd es, List.tl es) in
return (List.fold_left (fun acc e -> Plus (acc, e)) e1 er)
)
Changing the expr
definition the one above gets things to behave the way you expect. I think the choice [integer; plus ]
pretty much guarantees that the parser will stop after parsing the first number.
What we want to express with fix is that weâd like to parse an expression but we need to refer to the same expr
parser inside that.
Note that I used the parens
since i saw it in your gist and assumed thatâs important for your use case as well.
1 Like
Thanks for your answer and kindness.
(parens expr <|> integer)
in your solution is the same idea of lookahead one char.
If itâs a â(â, then it must be an expr, otherwise itâs an int.
parens
are useful in my work but not in this case. I tried to use parens
to enforce an outermost wrapping of the whole string like this. But it just returns a failure.
let expr : exp t =
parens @@
fix (fun expr ->
let plus =
let* es = (sep_by1 (lchar '+') expr) in
let e1, er = List.hd es, List.tl es in
return (List.fold_left (fun acc e -> Plus (acc, e)) e1 er)
in
ws *>
(integer <|> plus)
;;
eval "(1 + 1)";;
I have a menhir lexer and parser for the language and I am trying writing an Angstrom parser for it. I think your point on choice
is correct and I am even more wondering on how should I achieve this.
I think without the parens things sort of got stuck in a loop and parens just caused the first parser to fail so we fell back on integer
. So in any scenario youâd need at least one peek to decide how to start the parsing run. I think you are on the right track. You probably just need one peek just to start parsing.
Stepping away from fix
for a while, if you still want a parser that ensures that the entire string is consumed, you can use the end_of_input
, or if you just care about a line, end_of_line
and use parser <* end_of_input
etc. This will fail if the entire content isnât consumed.
1 Like
I also used the end_of_input
before when the eval is like
let eval (s : string) =
match parse_string (expr <* end_of_input) s with
| Ok v -> v
| Error msg -> failwith msg
The sad thing is the problem remains.
1 + 1
return failure since the choice chooses the int.
I may read some source code to gain more knowledge. The programâs behavior doesnât match my understanding on non-deterministic monad / monadic parsing so I should have some mis-concept somewhere.
peek
should solve the code problems as every demo including yours have done.
end_of_input would be useful for just the integer parser. So you could define that as integer <* end_of_input
. This would match â12â but not â12 + âŚâ. So then you could write choose [integer ; plus]
.
But that doesnât really work in a situation where you are dealing with expressions and not just a one off comparison between Int
and Plus
. peek
combined with stripping whitespace should be enough to cover most of the cases in a parser like the one you showed.
If you can think about examples that might help people understand this better, please do open issues against my repo. Iâll be be more than happy to work on adding more examples and accepting contributions to improve the documentation so they are easier to understand.
itâs been a while that I come back to this post.
1 - Itâs easy to test how angstrom deal with backtrack
open Angstrom
let expr =
fix (fun expr ->
choice [char '2'; char '1'; char '1' *> (char '2')]
)
let eval1 s =
match parse_string ~consume:All expr s with
| Ok v -> v
| Error msg -> failwith msg
let no_fix =
choice [char '2'; char '1'; char '1' *> (char '2')]
let eval2 s =
match parse_string ~consume:All expr s with
| Ok v -> v
| Error msg -> failwith msg
;;
eval1 "1";;
(* eval1 "12";; *)
eval2 "1";;
(* eval2 "12";; *)
Either recursive and non-recursive parser cannot parse â12â due to char '1'
in the choice
succeeds. I think the backtracking here somewhat shallow (in my word). It can test and come back from char '2'
, but cannot come back from char '1'
. I was thinking the backtracking should be universal, no matter how deep we currently progress. I see the source code at a glance which isnât like a non-deterministic monadic implementation to support a deep backtrack.
Therefore, to parse an expression, which has associativity and precedence issues e.g. 1 + 2 * 3 - 4
, we might use the traditional ideas like term
factor
to split the parser, just like the usage section of angstrom README