Lwt program; stack overflows and SIGSEGVs; heap check assetion fail in freelists.c with debug runtime

Hi. I have a service compiled with OCaml 4.13.1, with Lwt as an async framework. It crashes in production with a stack overflow or segmentation fault error. Initially it looked like an error in my code: in call graphs and memtrace output I see Lwt promises resolving, so my first guess was my program generate too many Lwt promises.

But compiled with debug runtime my service almost immediately crashed with the following message:

<>!<>!<>!<>%<>$### OCaml runtime: heap check ###
file freelist.c; line 950 ### Assertion failed: caml_gc_phase != Phase_sweep || caml_fl_merge == Val_NULL || bf_small_fl[i].merge < &Next_small(caml_fl_merge)

Is it important? What should I check in my code? IIUC it is the following assertion:

What exactly is checked here? My knowledge of OCaml internals is quite poor, I even have no idea what «freelists» are.

The stack trace itself is not very informative (well, as usual in Lwt programs):

Uncaught exception:
  
  Stack overflow

Raised by primitive operation at Lwt.Resolution_loop.run_callbacks_or_defer_them.(fun) in file "src/core/lwt.ml", line 1291, characters 29-79
Called from Lwt.Resolution_loop.resolve in file "src/core/lwt.ml", line 1298, characters 4-103
Called from Lwt.Sequential_composition.backtrace_catch.create_result_promise_and_callback_if_deferred.callback in file "src/core/lwt.ml", line 2079, characters 12-55
Called from Lwt.Resolution_loop.run_callbacks.run_regular_callbacks.iter_callback_list in file "src/core/lwt.ml", line 1196, characters 10-18
Called from Lwt.Resolution_loop.run_in_resolution_loop in file "src/core/lwt.ml", line 1262, characters 17-21
Called from Lwt.Resolution_loop.resolve in file "src/core/lwt.ml", line 1298, characters 4-103
Called from Lwt.Sequential_composition.backtrace_bind.create_result_promise_and_callback_if_deferred.callback in file "src/core/lwt.ml", line 1919, characters 12-72
Called from Lwt.Resolution_loop.run_callbacks.run_regular_callbacks.iter_callback_list in file "src/core/lwt.ml", line 1196, characters 10-18
Called from Lwt.Resolution_loop.run_in_resolution_loop in file "src/core/lwt.ml", line 1262, characters 17-21
Called from Lwt.Resolution_loop.resolve in file "src/core/lwt.ml", line 1298, characters 4-103
Called from Lwt.Sequential_composition.backtrace_catch.create_result_promise_and_callback_if_deferred.callback in file "src/core/lwt.ml", line 2079, characters 12-55
Called from Lwt.Resolution_loop.run_callbacks.run_regular_callbacks.iter_callback_list in file "src/core/lwt.ml", line 1196, characters 10-18
Called from Lwt.Resolution_loop.run_in_resolution_loop in file "src/core/lwt.ml", line 1262, characters 17-21
Called from Lwt.Resolution_loop.resolve in file "src/core/lwt.ml", line 1298, characters 4-103

…

Called from Lwt.Sequential_composition.backtrace_catch.create_result_promise_and_callback_if_deferred.callback in file "src/core/lwt.ml", line 2079, characters 12-55
Called from Lwt.Resolution_loop.run_callbacks.run_regular_callbacks.iter_callback_list in file "src/core/lwt.ml", line 1196, characters 10-18
Called from Lwt.Resolution_loop.run_in_resolution_loop in file "src/core/lwt.ml", line 1262, characters 17-21
Called from Lwt.Resolution_loop.resolve in file "src/core/lwt.ml", line 1298, characters 4-103
Called from Lwt.Sequential_composition.backtrace_bind.create_result_promise_and_callback_if_deferred.callback in file "src/core/lwt.ml", line 1919, characters 12-72
Called from Lwt.Resolution_loop.run_callbacks.Fatal error: out of memory

The assertion failure that you are seeing are an internal invariant of the GC.
If you are not using unsafe C code through the OCaml FFI that should essentially never happen.
There is possibility that is a bug in the runtime, would you be able to extract a small reproduction case ?

Another hypothesis to explore is the Fatal error: out of memory exception that appears in the trace. An out-of-memory exception in the Lwt trace is suspicious. It might happen that Lwt is trying to capture a Out_of_memory exception rather than panicking, and nothing good can happen in this scenario.

You should probably also check your program in 4.14.1, I don’t expect that this would change anything, but it would make it easier for you to test the potential bug fix once we have once.

1 Like

With the most recent version of Lwt you can call Exception_filter.(set handle_all_except_runtime). After this call, Lwt will not attempt to catch Out_of_memory (which it does (as part of a catchall) by default otherwise).

2 Likes

Well, after a couple of weeks of debugging, I’m completely stuck.

I cannot reproduce the bug locally (it reproduces only in the cloud), cannot create a minimal example, cannot invent a trick to change the behavior. One small success: I found that moving one recursive function to Lwt.async significantly mitigates (but not eliminates) stack frame leak. In this case stack overflow occures without huge heap allocations. Lwt.pause inside this functions have no effect.

I tried gdb, memtrace, perf. All those tools just gave me highly-fragmented call graphs with long chains of internal Lwt functions.

Are there some Lwt-specific debugging techniques?

Adding the setting suggested by @raphael-proust didn’t change anything?
Did you confirm it was not simply a crash due to OOM?

1 Like

Unfortunately I have dependencies incompatible with Lwt 5.7.0.

But now I have a version that fails with stack overflow with a lot of free memory.

It is difficult to see the names in the PNG (can you post a SVG?), but you can try to create a more meaningful flamegraph by removing recursive calls.
Try:

perf script | stackcollapse-perf.pl | stackcollapse-recursive.pl| flamegraph.pl >|flame.svg

With very deep call trees you may have to increase the stack sampling depth during recording too, try
perf record --call-graph=dwarf,65528 -F 47 $yourcommand

What CPU do you have? It might be possible to record and replay the execution using rr, but it works better on Intel CPUs than AMD: https://rr-project.org/ and GitHub - rr-debugger/rr: Record and Replay Framework (if you have and AMD CPU follow the instructions there). With this recording you can replay and go backwards/forwards in time many times to inspect the heap or test various hypothesis. See here on how to use it with OCaml: Deterministically debugging concurrent GC bugs with rr · KC Sivaramakrishnan

In fact GDB itself has support for recording and reverse execution, but again it is very CPU dependent on whether it works at all: Process Record and Replay (Debugging with GDB) (you can try the btrace bts or btrace pt methods on an Intel CPU).

If you can run arbitrary commands in the cloud environment (e.g. a CI) then you may be able to record an execution trace there and analyse it later.

1 Like

Well. Thank you all for your help. Now I know much more about the debugging of OCaml programs. As usual, the problem was in completely different place.

Our team use custom monad on top of Lwt.t. Sometimes the internal context of this monad starts to grow. It doesn’t matter for my coworkers, because they create classical microservices with short-living values (all the request context is abandoned when the request processing is finished), but it matters for me, because my service has very long tasks. Fine-tuning of the operations with this monad fixed the problem.

But the question «how to restore the original call graph when debugging Lwt program» remains unanswered for me. AFAIR for the old version of OCaml the usage of let%lwt allowed to show original backtrace (instead of scheduler internals), but looks like in modern OCaml it doesn’t work.

If you are interested in backtraces of exceptions (which is a smaller problem than that of restoring the original call graph) then there’s a modern way of doing so: Lwt.reraise. Check the doc for full details but TL;DR: replace your old catch catch all case | exc -> raise exc with the new | exc -> Lwt.reraise exc.

Lwt.reraise is bound to the OCaml’s internal %reraise which does some magic when maintaining the backtraces of exceptions. It essentially allows you to have the same behaviour as omitting a catch-all case when writing a vanilla try-with.

3 Likes

Thank you! Looks like the tool I need.