Heisenbug causing stack overflow

I’m getting a stack overflow, and in trying to diagnose it, encountered surprising behavior shown below. In the eight test cases shown, I include or omit each of three independent aspects of the test, all three of which should be irrelevant to the behavior of any test, yet all three apparently do affect the result. The overflow occurs iff I print a named string before evaluating the function, and either the result of the function (which is unit) is ignored or I print a string after the function is evaluated, but not both. There may be more things that affect the result; when I realized that all three of these irrelevancies mattered, I decided it was time to call for help.

Can any of you offer any insight?

let x = "x" in begin print_endline  x ; ignore (parse_spec f); print_endline "y"; end;; (* Stack overflow during evaluation *)
let x = "x" in begin print_endline  x ; ignore (parse_spec f);                    end;; (* Stack overflow during evaluation *)
let x = "x" in begin print_endline  x ;         parse_spec f ; print_endline "y"; end;; (* Stack overflow during evaluation *)
let x = "x" in begin print_endline  x ;         parse_spec f ;                    end;; (* No stack overflow *)
               begin print_endline "x"; ignore (parse_spec f); print_endline "y"; end;; (* No stack overflow *)
               begin print_endline "x"; ignore (parse_spec f);                    end;; (* No stack overflow *)
               begin print_endline "x";         parse_spec f ; print_endline "y"; end;; (* No stack overflow *)
               begin print_endline "x";         parse_spec f ;                    end;; (* No stack overflow *)

It’s unclear from the provided code sample. The only thing that could be causing it is parse_spec f though, as nothing in the other given code is capable of stack overflow.

It is hard to tell for sure from these snippets. But here is a potential explanation. Let us assume that parse_spec is using almost all the stack, so any additional stack slot will cause a stack overflow. Your first four cases use a stack slot for let x, while the last four do not. That explains why the last four do not cause a stack overflow. But what about case 4 then? It performs a tail call of parse_spec, so the stack slot of let x gets freed before the actual call, hence avoiding the stack overflow. So, the first three cases are the only cases that call parse_spec while an extra stack slot is in use, hence the stack overflow.

5 Likes

Since we can’t see the code of parse_spec (or run the testcase) here are some general observations:

  1. I assume that you’re running on a modern machine, e.g. with decent memory, etc. 64bit.
  2. I’d halve the hard stacksize limit and see how that affects things. Before doing anything else, in fact.
  3. Anything that actually uses anything near the full allowed hardware stack segment, is probably broken code.
  4. What I mean is, get past all these variations, and to a testcase that focuses on -where- the stack overflow is happening, and once you’ve debugged that, then come back to these variations.
  5. Is the behaviour different with bytecode vs. opt versions?

I guess I wasn’t clear what I was asking.

I don’t need help with ‘parse_spec f’. (Since my post, I have diagnosed and fixed that problem, so the stack overflow no longer occurs.) The issue is that there must be a bug in the ocaml environment or it would not display the behavior that this example demonstrated. To see that, ask yourself how you would write a program ‘parse_spec f’ so that it showed this behavior. It should be impossible, since it should have no way of determining whether the preceding ‘print_endline’ was given a named string rather than a string literal; it should have no way of determining whether it’s being executed from within an ‘ignore’, and it should have no way of determining whether it’s followed by a ‘print_endline’.

So I was trying to understand: is the behavior of the ocaml environment simply unpredictable in the presence of a stack overflow? Or is it supposed to be predictable, and this behavior shows that it’s got a bug?

  1. to really understand, one would need to examine the machine code.
  2. So if the bug shows up the same way with bytecode, that would help a lot
  3. but really, I think @silene might be right in their analysis.