How to debug a terminated by signal SIGSEGV (Address boundary error)

I’m writing an app with lwt, caqti, postgres,… and I’ve done some rather big refactorings and upgraded some packages. Now that I thought that I had everything in a working state again, I regularly see my app crash with: terminated by signal SIGSEGV (Address boundary error) and I have no idea what’s the source of this.

I’ve tried to see if I could reproduce it in bytecode mode, but my app just keeps running, so the issue only happens in binary mode.

So, how do I start hunting this issue down?
I’m using ocaml 4.9, dune and esy.

two thoughts (from back in the day, been a long time since I had to find one of these):
(1) have you tried valgrind?
(2) set things up so that when/if the problem occurs, you’ll get a full (not truncated core-dump. You can test out whether it’s set up right, by sending the SEGV signal with "kill -SEGV ". Once you can load that core into gdb and walk around, then wait for a real failure.

Also, limit the amount of memory your program can use – maybe that’ll make it faster to produce the crash?

What native libraries are you using? a look at /proc//maps should tell you, for starters

ETA: Also, IIRC there are memory-errors that no process can catch – so might be worthwhile looking into what your OS records about such errors. I know on Linux there is some provision for capturing minimal information, controllable by something-or-other via sysctl. Also, turn on GC logging, see if that shows you anything useful (can’t hurt, might tell you something).
ETA2: yet another: how frequent is the failure? The more you can do, to make that failure more frequent, the better. So it’s worth keeping track of what you do to run the program, and how long it takes to fail, so you can try to reduce that time. Low MTBF is your friend, when debugging what appears to be a Heisenbug.

1 Like

Thanks for some of these pointers. To clarify, I have no experience at all with gdb, valgrind or c programming for that matter. So for me this is completely new.

I’ve tried to run my program in gdb and have been able to get a backtrace that at least gives some pointers.

It’s not complete, and I haven’t found how to fill in the ??. I’ve tried setting the -g flag but that doesn’t seem to change anything. I’ve also tried -O0 in the dune flags, but then I get that this is not a valid argument for ocamlc

But so, this is the backtrace, and it looks like something in Postgresql or Caqti?

I’m not confident in the code that I wrote for this, I’m trying to use postgres notifications together with Caqti that doesn’t support these.
I’ve adapted the code from: https://github.com/mmottl/postgresql-ocaml/blob/ba39bd7be6bd807e5508a9b825c447775b800760/examples/prompt.ml#L58 to use Lwt_preemptive threads so maybe I made a mistake somewhere.

This is the backtrace:

Thread 1 "server.exe" received signal SIGSEGV, Segmentation fault.
0x0000000000000003 in ?? ()
(gdb) bt
#0  0x0000000000000003 in ?? ()
#1  0x00005555558bb043 in camlPostgresql__protectx_967 ()
#2  0x0000000000000001 in ?? ()
#3  0x00007ffff75c4bf8 in ?? ()
#4  0x00007ffff73ba940 in ?? ()
#5  0x00005555558ad08d in camlCaqti_driver_postgresql__aux_1952 ()
#6  0x00007fffffffdcb0 in ?? ()
#7  0x00005555558ad0b4 in camlCaqti_driver_postgresql__aux_1952 ()
#8  0x00007ffff75d6310 in ?? ()
#9  0x00007ffff75d64d0 in ?? ()
#10 0x0000000000000001 in ?? ()
#11 0x0000555555a4bfcb in camlLwt__callback_1225 ()
#12 0x00007fffffffe080 in ?? ()
#13 0x0000555555a4bfd4 in camlLwt__callback_1225 ()
#14 0x00007ffff75d5928 in ?? ()
#15 0xa87add3fe9485300 in ?? ()
#16 0x0000000000000001 in ?? ()
#17 0x0000555555a4abc8 in camlLwt__iter_callback_list_848 ()
#18 0x00007ffff75c4d70 in ?? ()
#19 0x0000000000000001 in ?? ()
#20 0x0000000000000017 in ?? ()
#21 0x0000555555a4ad39 in camlLwt__run_in_resolution_loop_899 ()
#22 0x00007ffff75c4db0 in ?? ()
#23 0x0000000000000001 in ?? ()
#24 0x00007ffff75d5a30 in ?? ()
#25 0x0000555555a4aeeb in camlLwt__resolve_916 ()
#26 0x0000555555cd4ea8 in camlLwt__20 ()
#27 0x00007ffff75d5a30 in ?? ()
#28 0x00007fffffffe080 in ?? ()
#29 0x0000555555a4bff5 in camlLwt__callback_1225 ()
#30 0x00007ffff75d5a00 in ?? ()
#31 0x00007ffff75c4dd8 in ?? ()
#32 0x0000000000000001 in ?? ()
#33 0x0000555555a4abc8 in camlLwt__iter_callback_list_848 ()
#34 0x00007ffff75c4e18 in ?? ()
#35 0x0000000000000001 in ?? ()
#36 0x00007ffff75c5ad0 in ?? ()
#37 0x0000555555a4ad39 in camlLwt__run_in_resolution_loop_899 ()
#38 0x00007ffff75c4e58 in ?? ()
#39 0x0000000000000001 in ?? ()
#40 0x00007ffff75d5b08 in ?? ()
#41 0x0000555555a4aeeb in camlLwt__resolve_916 ()
#42 0x0000555555cd4ea8 in camlLwt__20 ()
#43 0x00007ffff75d5b08 in ?? ()
#44 0x0000000000000002 in ?? ()
#45 0x0000555555a4cc97 in camlLwt__callback_1343 ()
#46 0x00007ffff75c4ee8 in ?? ()
#47 0x0000555555a4acfe in camlLwt__leave_resolution_loop_894 ()
#48 0x0000000000000001 in ?? ()
#49 0x0000555555a4abc8 in camlLwt__iter_callback_list_848 ()
#50 0x00007ffff75c4e80 in ?? ()
#51 0x0000000000000001 in ?? ()
#52 0x00007ffff75c5ba8 in ?? ()
#53 0x0000555555a4ad39 in camlLwt__run_in_resolution_loop_899 ()
#54 0x00007ffff75c4ec0 in ?? ()
#55 0x0000000000000001 in ?? ()
#56 0x00007ffff75d5be0 in ?? ()
#57 0x0000555555a4aeeb in camlLwt__resolve_916 ()
#58 0x0000555555cd4ea8 in camlLwt__20 ()
#59 0x00007ffff75d5be0 in ?? ()
#60 0x00007fffffffe080 in ?? ()
#61 0x0000555555a4c82f in camlLwt__callback_1304 ()
#62 0x00007ffff75d5bb0 in ?? ()
#63 0x00007ffff75c4ee8 in ?? ()
#64 0x0000000000000001 in ?? ()
#65 0x0000555555a4abc8 in camlLwt__iter_callback_list_848 ()
#66 0x00007ffff75c4ef8 in ?? ()
#67 0x0000000000000001 in ?? ()
#68 0x00007ffff75c5c80 in ?? ()
#69 0x0000555555a4ad39 in camlLwt__run_in_resolution_loop_899 ()
#70 0x00007ffff75c4f38 in ?? ()
#71 0x0000000000000001 in ?? ()
#72 0x00007ffff75d5d40 in ?? ()
#73 0x0000555555a4aeeb in camlLwt__resolve_916 ()
#74 0x0000555555cd4ea8 in camlLwt__20 ()
#75 0x00007ffff75d5d40 in ?? ()
#76 0x00007ffff75d5ce0 in ?? ()
#77 0x0000555555a4f98b in camlLwt__callback_1848 ()
#78 0x00007ffff75c5030 in ?? ()
#79 0x0000555555a4abf8 in camlLwt__iter_callback_list_848 ()
#80 0x00007ffff75c4f60 in ?? ()
#81 0x0000000000000001 in ?? ()
#82 0x0000000000000001 in ?? ()
#83 0x0000555555a4ad39 in camlLwt__run_in_resolution_loop_899 ()
#84 0x00007ffff75c4fa0 in ?? ()
#85 0x0000000000000001 in ?? ()
#86 0x00007ffff75d5e48 in ?? ()
#87 0x0000555555a4aeeb in camlLwt__resolve_916 ()
#88 0x0000555555cd4ea8 in camlLwt__20 ()
#89 0x00007ffff75d5e48 in ?? ()
#90 0x00007ffff75c6168 in ?? ()
#91 0x0000555555a4cc97 in camlLwt__callback_1343 ()
#92 0x00007ffff75c5030 in ?? ()
#93 0x0000555555a4b9b8 in camlLwt__add_task_r_1142 ()
#94 0x00007ffff733d6c0 in ?? ()
#95 0x0000555555a4abc8 in camlLwt__iter_callback_list_848 ()
#96 0x00007ffff75c4fc8 in ?? ()
#97 0x0000000000000001 in ?? ()
#98 0x00007ffff73db6a0 in ?? ()
#99 0x0000555555a4ad39 in camlLwt__run_in_resolution_loop_899 ()
#100 0x00007ffff75c5008 in ?? ()
#101 0x0000000000000001 in ?? ()
#102 0x00007ffff75d60d8 in ?? ()
#103 0x0000555555a4aeeb in camlLwt__resolve_916 ()
#104 0x0000555555cd4ea8 in camlLwt__20 ()
#105 0x00007ffff75d60d8 in ?? ()
#106 0x00007ffff75d60b0 in ?? ()
#107 0x0000555555a4b120 in camlLwt__wakeup_general_951 ()
#108 0x00007ffff75c5040 in ?? ()

And here is my adapted code:

  let run () =
    let rec dump_notification conn =
      match conn#notifies with
      | Some { Postgresql.Notification.name; pid; extra } ->
          let () =
            Lwt_preemptive.run_in_main (fun () ->
                let () =
                  Logs.debug @@ fun m -> m "Notication from backend %i: [%s] [%s]\n" pid name extra
                in
                Lwt_mvar.put mvar (name, pid, extra))
          in
          dump_notification conn
      | None ->
          ()
    in
    let listener conn =
      try
        while true do
          let socket : Unix.file_descr = Caml.Obj.magic conn#socket in
          let _ = Unix.select [ socket ] [] [] 1. in
          let () = conn#consume_input in
          dump_notification conn
        done
      with
      | e ->
          Logs.err @@ fun m -> m "%s" (Exn.to_string e)
    in
    Lwt.async (fun () ->
        let conninfo = "postgresql://postgres:postgres@localhost:5432/board_games_dev" in
        let conn = new Postgresql.connection ~conninfo () in
        conn#exec ~expect:[ Command_ok ] "listen events" |> ignore ;
        conn#set_notice_processor (fun s -> Fmt.pr "postgresql error [%s]\n" s) ;
        Lwt_preemptive.detach listener conn) ;

If I don’t run this code, I can’t get the code to crash. I suspect that the consume_input is where the crash happens although I’m not sure. Without the dump_notification I can still get the code to crash although it seems that it takes longer to crash in that case.

This is the usual suspect for runtime crashes. I don’t know how the postgresql library you use manages its connections, but I’d recommend that you check if they provide a way to obtain a socket directly from their interface.

Either that, or try to see if there’s lwt-aware bindings to the postgresql library you use.