Hi. I have a service compiled with OCaml 4.13.1, with Lwt as an async framework. It crashes in production with a stack overflow or segmentation fault error. Initially it looked like an error in my code: in call graphs and memtrace output I see Lwt promises resolving, so my first guess was my program generate too many Lwt promises.
But compiled with debug runtime my service almost immediately crashed with the following message:
The stack trace itself is not very informative (well, as usual in Lwt programs):
Uncaught exception:
Stack overflow
Raised by primitive operation at Lwt.Resolution_loop.run_callbacks_or_defer_them.(fun) in file "src/core/lwt.ml", line 1291, characters 29-79
Called from Lwt.Resolution_loop.resolve in file "src/core/lwt.ml", line 1298, characters 4-103
Called from Lwt.Sequential_composition.backtrace_catch.create_result_promise_and_callback_if_deferred.callback in file "src/core/lwt.ml", line 2079, characters 12-55
Called from Lwt.Resolution_loop.run_callbacks.run_regular_callbacks.iter_callback_list in file "src/core/lwt.ml", line 1196, characters 10-18
Called from Lwt.Resolution_loop.run_in_resolution_loop in file "src/core/lwt.ml", line 1262, characters 17-21
Called from Lwt.Resolution_loop.resolve in file "src/core/lwt.ml", line 1298, characters 4-103
Called from Lwt.Sequential_composition.backtrace_bind.create_result_promise_and_callback_if_deferred.callback in file "src/core/lwt.ml", line 1919, characters 12-72
Called from Lwt.Resolution_loop.run_callbacks.run_regular_callbacks.iter_callback_list in file "src/core/lwt.ml", line 1196, characters 10-18
Called from Lwt.Resolution_loop.run_in_resolution_loop in file "src/core/lwt.ml", line 1262, characters 17-21
Called from Lwt.Resolution_loop.resolve in file "src/core/lwt.ml", line 1298, characters 4-103
Called from Lwt.Sequential_composition.backtrace_catch.create_result_promise_and_callback_if_deferred.callback in file "src/core/lwt.ml", line 2079, characters 12-55
Called from Lwt.Resolution_loop.run_callbacks.run_regular_callbacks.iter_callback_list in file "src/core/lwt.ml", line 1196, characters 10-18
Called from Lwt.Resolution_loop.run_in_resolution_loop in file "src/core/lwt.ml", line 1262, characters 17-21
Called from Lwt.Resolution_loop.resolve in file "src/core/lwt.ml", line 1298, characters 4-103
…
Called from Lwt.Sequential_composition.backtrace_catch.create_result_promise_and_callback_if_deferred.callback in file "src/core/lwt.ml", line 2079, characters 12-55
Called from Lwt.Resolution_loop.run_callbacks.run_regular_callbacks.iter_callback_list in file "src/core/lwt.ml", line 1196, characters 10-18
Called from Lwt.Resolution_loop.run_in_resolution_loop in file "src/core/lwt.ml", line 1262, characters 17-21
Called from Lwt.Resolution_loop.resolve in file "src/core/lwt.ml", line 1298, characters 4-103
Called from Lwt.Sequential_composition.backtrace_bind.create_result_promise_and_callback_if_deferred.callback in file "src/core/lwt.ml", line 1919, characters 12-72
Called from Lwt.Resolution_loop.run_callbacks.Fatal error: out of memory
The assertion failure that you are seeing are an internal invariant of the GC.
If you are not using unsafe C code through the OCaml FFI that should essentially never happen.
There is possibility that is a bug in the runtime, would you be able to extract a small reproduction case ?
Another hypothesis to explore is the Fatal error: out of memory exception that appears in the trace. An out-of-memory exception in the Lwt trace is suspicious. It might happen that Lwt is trying to capture a Out_of_memory exception rather than panicking, and nothing good can happen in this scenario.
You should probably also check your program in 4.14.1, I don’t expect that this would change anything, but it would make it easier for you to test the potential bug fix once we have once.
With the most recent version of Lwt you can call Exception_filter.(set handle_all_except_runtime). After this call, Lwt will not attempt to catch Out_of_memory (which it does (as part of a catchall) by default otherwise).
Well, after a couple of weeks of debugging, I’m completely stuck.
I cannot reproduce the bug locally (it reproduces only in the cloud), cannot create a minimal example, cannot invent a trick to change the behavior. One small success: I found that moving one recursive function to Lwt.async significantly mitigates (but not eliminates) stack frame leak. In this case stack overflow occures without huge heap allocations. Lwt.pause inside this functions have no effect.
I tried gdb, memtrace, perf. All those tools just gave me highly-fragmented call graphs with long chains of internal Lwt functions.
It is difficult to see the names in the PNG (can you post a SVG?), but you can try to create a more meaningful flamegraph by removing recursive calls.
Try:
With very deep call trees you may have to increase the stack sampling depth during recording too, try perf record --call-graph=dwarf,65528 -F 47 $yourcommand
In fact GDB itself has support for recording and reverse execution, but again it is very CPU dependent on whether it works at all: Process Record and Replay (Debugging with GDB) (you can try the btrace bts or btrace pt methods on an Intel CPU).
If you can run arbitrary commands in the cloud environment (e.g. a CI) then you may be able to record an execution trace there and analyse it later.
Well. Thank you all for your help. Now I know much more about the debugging of OCaml programs. As usual, the problem was in completely different place.
Our team use custom monad on top of Lwt.t. Sometimes the internal context of this monad starts to grow. It doesn’t matter for my coworkers, because they create classical microservices with short-living values (all the request context is abandoned when the request processing is finished), but it matters for me, because my service has very long tasks. Fine-tuning of the operations with this monad fixed the problem.
But the question «how to restore the original call graph when debugging Lwt program» remains unanswered for me. AFAIR for the old version of OCaml the usage of let%lwt allowed to show original backtrace (instead of scheduler internals), but looks like in modern OCaml it doesn’t work.
If you are interested in backtraces of exceptions (which is a smaller problem than that of restoring the original call graph) then there’s a modern way of doing so: Lwt.reraise. Check the doc for full details but TL;DR: replace your old catch catch all case | exc -> raise exc with the new | exc -> Lwt.reraise exc.
Lwt.reraise is bound to the OCaml’s internal %reraise which does some magic when maintaining the backtraces of exceptions. It essentially allows you to have the same behaviour as omitting a catch-all case when writing a vanilla try-with.