Bug with Ocamllex

os12345678 · November 13, 2023, 10:11am

Hi all,
I’ve been playing around with what i think is a bug with ocamllex. I am creating a compiler and have noticed some unusual activity with the ‘-’ operator. I have defined my set of tokens using ocamllex

rule read_tok =
parse
…
| “-” { MINUS }
…

and when compiling a program (for example fib(n-1)), i get a syntax error that points to the ‘-’ operator. Though, when i change the token from ‘-’ to anything else that doesn’t start with a ‘-’, it compiles successfully. All other binary operators work, so the parser logic is fine. Here are the helper regexes i’ve defined at the top of the lexer file, which are all pretty standard stuff.

let digit = [‘0’-‘9’]

let alpha = [‘a’-‘z’ ‘A’-‘Z’]

let int = ‘-’? digit+

let id = (alpha) (alpha|digit|‘_’)*

let whitespace = [’ ’ ‘\t’]+

let newline = ‘\r’ | ‘\n’ | “\r\n”

Has anyone else experienced this issue before?

nojb · November 13, 2023, 10:38am

Are you sure the argument is not being lexed as IDENT INT?

Cheers,
Nicolas

Release-Candidate · November 13, 2023, 11:29am

Yes, prefix and infix minus are always a PITA, that’s why OCaml uses ~-

nojb · November 13, 2023, 12:58pm

I think this was directed at my comment:

It was not meant at all to be funny or question the author. Rather to indicate a psychological bias that is very common and that I often fall for. But re-reading it I agree that it doesn’t come across like that, so I have deleted that part of my reply. Apologies to @os12345678 it if came across badly.

Thank you @polytypic for pointing it out.

Cheers,
Nicolas

K_N · November 13, 2023, 4:00pm

To debug issues in my lexer, I usually define a token_to_string function and then something along the lines of:

let verbose_lexer lexbuf = 
   let token = original_lexer lexbuf in
   Printf.eprintf ">>%s<<\n" (token_to_string token);
   token

and use that as a lexer. If the flow of token is what I expect it to be, then the problem lies in the parser. If I have unexpected then it means that one of my regular expressions (or semantic action in the lexer) is buggy.

In your case, since I’m assuming that you have in your rules:

| "-" { MINUS }
| ...
| int { ... }

and since ocamllex is longest match (and only first-match in the case the two matched prefixes are the same length) then “-1” is always lexed a int, never as MINUS followed by INT.
Does your example work if you write f (n- 1) ?

Cheers,

os12345678 · November 13, 2023, 11:04pm

Yes it does

Yes again to these. It seems that n-1 is being parsed as an identifier followed by a number, but n- 1 is parsed correctly as an identifier, minus, number. what do you suggest doing to fix this?

thanks!

os12345678 · November 13, 2023, 11:30pm

Upon investigating further, it is being lexed as IDENT INT. Though, since ocaml applies rules; longest match, and match first rule, i am unsure why it is being lexed as this - the minus token is defined before the int token, shown by

I’m just a bit unsure why its happening. any thoughts?

thanks

K_N · November 13, 2023, 11:54pm

The rule is longest match, and only use first match to discriminate between matches of equal lengths.
Here, the string “-1” could be matched as “-” (followed by stuff to be read later) or as “-1” (via your int regex). In that latter case, the longest match has length 2 (“-1”) and therefore that rule is selected.

How to fix this really depends on what kind of parser you use and of your syntax. For instance, from your example it’s not clear what syntax you use for function calls. If it’s a “traditional” syntax f(x,y,z) then a solution might be to just remove the minus sign from your regex, parse MINUS expr (unary minus) with a higher priority than expr MINUS expr (subtraction). An example of how to do that with OCamlyacc can be found in the OCaml manual (see the %prec directive for the MINUS expr case).

If your syntax for function call is like in OCaml, that is f x y z then you have an ambiguity between
f -1, meaning the function call f (-1) or f - 1 meaning the subtraction. In OCaml, it is the latter, and so to be able to write the function call, one has to either use explicit parentheses or use the unary minus symbol ~- as in f ~- 1.

Another solution is to wait until the typing phase (if you have one) and use types to disambiguate. If f has a function type, then transform f - 1 (subtraction) into f (-1) function call.

Cheers

Release-Candidate · November 14, 2023, 9:22am

And if you decide to always have to use parens for the negative sign - (-a) and (- a) meaning -a - and allow Haskell-style curried infix operators, where (+ a) means fun x -> x + a, you end up with the Haskell special case of having to define fun x -> x - a as (subtract a) because (- a) is just -a. (there is now a compiler extension for that: LexicalNegation).
And as I’ve said, this really trips up almost everybody doing his first “calculator”.

My advice would be to never use a lexer and parser generator if you don’t have to (because it’s homework ;). You always end up fighting both in the end when you want readable error messages (and all of them, not just the first) and incremental lexing/parsing. Which you always want to design for from the start.

Topic		Replies	Views
Float and constructor syntax Learning ocaml	5	572	November 6, 2022
[ANN] OCamlformat 0.15.0 Ecosystem announce , ocamlformat	0	928	August 7, 2020
How does one specify types of functions for lexers in ocamlex? Learning	1	512	November 18, 2019
Reason on OCaml 4.06 Community announce	4	1973	November 30, 2017
How to write lexer rules correctly Learning ocamllex	1	1288	July 18, 2021

Bug with Ocamllex

Related topics