Hi all,
I’ve been playing around with what i think is a bug with ocamllex. I am creating a compiler and have noticed some unusual activity with the ‘-’ operator. I have defined my set of tokens using ocamllex
rule read_tok =
parse
…
| “-” { MINUS }
…
and when compiling a program (for example fib(n-1)), i get a syntax error that points to the ‘-’ operator. Though, when i change the token from ‘-’ to anything else that doesn’t start with a ‘-’, it compiles successfully. All other binary operators work, so the parser logic is fine. Here are the helper regexes i’ve defined at the top of the lexer file, which are all pretty standard stuff.
It was not meant at all to be funny or question the author. Rather to indicate a psychological bias that is very common and that I often fall for. But re-reading it I agree that it doesn’t come across like that, so I have deleted that part of my reply. Apologies to @os12345678 it if came across badly.
To debug issues in my lexer, I usually define a token_to_string function and then something along the lines of:
let verbose_lexer lexbuf =
let token = original_lexer lexbuf in
Printf.eprintf ">>%s<<\n" (token_to_string token);
token
and use that as a lexer. If the flow of token is what I expect it to be, then the problem lies in the parser. If I have unexpected then it means that one of my regular expressions (or semantic action in the lexer) is buggy.
In your case, since I’m assuming that you have in your rules:
| "-" { MINUS }
| ...
| int { ... }
and since ocamllex is longest match (and only first-match in the case the two matched prefixes are the same length) then “-1” is always lexed a int, never as MINUS followed by INT.
Does your example work if you write f (n- 1) ?
Yes again to these. It seems that n-1 is being parsed as an identifier followed by a number, but n- 1 is parsed correctly as an identifier, minus, number. what do you suggest doing to fix this?
Upon investigating further, it is being lexed as IDENT INT. Though, since ocaml applies rules; longest match, and match first rule, i am unsure why it is being lexed as this - the minus token is defined before the int token, shown by
I’m just a bit unsure why its happening. any thoughts?
The rule is longest match, and only use first match to discriminate between matches of equal lengths.
Here, the string “-1” could be matched as “-” (followed by stuff to be read later) or as “-1” (via your int regex). In that latter case, the longest match has length 2 (“-1”) and therefore that rule is selected.
How to fix this really depends on what kind of parser you use and of your syntax. For instance, from your example it’s not clear what syntax you use for function calls. If it’s a “traditional” syntax f(x,y,z) then a solution might be to just remove the minus sign from your regex, parse MINUS expr (unary minus) with a higher priority than expr MINUS expr (subtraction). An example of how to do that with OCamlyacc can be found in the OCaml manual (see the %prec directive for the MINUS expr case).
If your syntax for function call is like in OCaml, that is f x y z then you have an ambiguity between f -1, meaning the function call f (-1) or f - 1 meaning the subtraction. In OCaml, it is the latter, and so to be able to write the function call, one has to either use explicit parentheses or use the unary minus symbol ~- as in f ~- 1.
Another solution is to wait until the typing phase (if you have one) and use types to disambiguate. If f has a function type, then transform f - 1 (subtraction) into f (-1) function call.
And if you decide to always have to use parens for the negative sign - (-a) and (- a) meaning -a - and allow Haskell-style curried infix operators, where (+ a) means fun x -> x + a, you end up with the Haskell special case of having to define fun x -> x - a as (subtract a) because (- a) is just -a. (there is now a compiler extension for that: LexicalNegation).
And as I’ve said, this really trips up almost everybody doing his first “calculator”.
My advice would be to never use a lexer and parser generator if you don’t have to (because it’s homework ;). You always end up fighting both in the end when you want readable error messages (and all of them, not just the first) and incremental lexing/parsing. Which you always want to design for from the start.