Heisenbug causing stack overflow

TimLeonard · November 15, 2021, 11:24pm

I’m getting a stack overflow, and in trying to diagnose it, encountered surprising behavior shown below. In the eight test cases shown, I include or omit each of three independent aspects of the test, all three of which should be irrelevant to the behavior of any test, yet all three apparently do affect the result. The overflow occurs iff I print a named string before evaluating the function, and either the result of the function (which is unit) is ignored or I print a string after the function is evaluated, but not both. There may be more things that affect the result; when I realized that all three of these irrelevancies mattered, I decided it was time to call for help.

Can any of you offer any insight?

let x = "x" in begin print_endline  x ; ignore (parse_spec f); print_endline "y"; end;; (* Stack overflow during evaluation *)
let x = "x" in begin print_endline  x ; ignore (parse_spec f);                    end;; (* Stack overflow during evaluation *)
let x = "x" in begin print_endline  x ;         parse_spec f ; print_endline "y"; end;; (* Stack overflow during evaluation *)
let x = "x" in begin print_endline  x ;         parse_spec f ;                    end;; (* No stack overflow *)
               begin print_endline "x"; ignore (parse_spec f); print_endline "y"; end;; (* No stack overflow *)
               begin print_endline "x"; ignore (parse_spec f);                    end;; (* No stack overflow *)
               begin print_endline "x";         parse_spec f ; print_endline "y"; end;; (* No stack overflow *)
               begin print_endline "x";         parse_spec f ;                    end;; (* No stack overflow *)

yawaramin · November 16, 2021, 12:33am

It’s unclear from the provided code sample. The only thing that could be causing it is parse_spec f though, as nothing in the other given code is capable of stack overflow.

silene · November 16, 2021, 7:52am

It is hard to tell for sure from these snippets. But here is a potential explanation. Let us assume that parse_spec is using almost all the stack, so any additional stack slot will cause a stack overflow. Your first four cases use a stack slot for let x, while the last four do not. That explains why the last four do not cause a stack overflow. But what about case 4 then? It performs a tail call of parse_spec, so the stack slot of let x gets freed before the actual call, hence avoiding the stack overflow. So, the first three cases are the only cases that call parse_spec while an extra stack slot is in use, hence the stack overflow.

Chet_Murthy · November 16, 2021, 9:59pm

Since we can’t see the code of parse_spec (or run the testcase) here are some general observations:

I assume that you’re running on a modern machine, e.g. with decent memory, etc. 64bit.
I’d halve the hard stacksize limit and see how that affects things. Before doing anything else, in fact.
Anything that actually uses anything near the full allowed hardware stack segment, is probably broken code.
What I mean is, get past all these variations, and to a testcase that focuses on -where- the stack overflow is happening, and once you’ve debugged that, then come back to these variations.
Is the behaviour different with bytecode vs. opt versions?

TimLeonard · November 17, 2021, 4:43am

I guess I wasn’t clear what I was asking.

I don’t need help with ‘parse_spec f’. (Since my post, I have diagnosed and fixed that problem, so the stack overflow no longer occurs.) The issue is that there must be a bug in the ocaml environment or it would not display the behavior that this example demonstrated. To see that, ask yourself how you would write a program ‘parse_spec f’ so that it showed this behavior. It should be impossible, since it should have no way of determining whether the preceding ‘print_endline’ was given a named string rather than a string literal; it should have no way of determining whether it’s being executed from within an ‘ignore’, and it should have no way of determining whether it’s followed by a ‘print_endline’.

So I was trying to understand: is the behavior of the ocaml environment simply unpredictable in the presence of a stack overflow? Or is it supposed to be predictable, and this behavior shows that it’s got a bug?

Chet_Murthy · November 17, 2021, 6:18am

to really understand, one would need to examine the machine code.
So if the bug shows up the same way with bytecode, that would help a lot
but really, I think @silene might be right in their analysis.

Topic		Replies	Views
tests/runtime-errors/'stackoverflow.ml' test skips on s390x architecture. Community	0	404	May 8, 2022
Same code doesnt work with ocamlc vs online editor Learning compilation	1	153	December 20, 2024
Stack overflow reported as segfault (?) Learning	7	2537	October 19, 2021
I got a Stack_overflow error when take n elements from a stream Learning	4	806	August 20, 2021
Add support for stack allocation Learning language-design	28	3905	January 11, 2021

Heisenbug causing stack overflow

Related topics