Help diagnosing crashes inside libsanitizer?

Hi OCaml community,

I’ve been struggling to run tsan on Semgrep (a large legacy OCaml codebase) and I’d like to ask how I can start diagnosing a crash within the sanitizer that looks thus:

[00.55][INFO]: Executed as: /home/ntaylor/.local/share/virtualenvs/cli-sQcphUvE/lib/python3.10/site-packages/semgrep/bin/semgrep-core-proprietary -json -rules /home/ntaylor/.semgrep/semgrep_rules.json -use_eio -j 8 -targets /home/ntaylor/.semgrep/semgrep_targets.txt -timeout 5 -timeout_threshold 3 -max_memory 5000 -fast -symbol_analysis -pro_inter_file -timeout_for_interfile_analysis 10800 . -debug
[00.55][INFO]: Version: 1.124.0
[00.55][INFO]: Proxy was configured with { Proxy.http_proxy = None;
                                           https_proxy = None;
                                           all_proxy = None; no_proxy = None;
                                           credentials = None }
[00.63][INFO]: Parsing rules in /home/ntaylor/.semgrep/semgrep_rules.json

Program received signal SIGSEGV, Segmentation fault.
-----------------------------------------------------------------------------------------------------------------------[regs]
  RAX: 0x0000600000FFFFF8  RBX: 0x000055555A658D89  RBP: 0x00007B6000000C00  RSP: 0x00007FFFFFFFD028  o d I t s Z a P c
  RDI: 0x000055555AE7D21D  RSI: 0x00007FFFF5DECA00  RDX: 0x00000000000095B0  RCX: 0x200055555AE7D21D  RIP: 0x00007FFFF749D880
  R8 : 0x00007FFFF5DECA00  R9 : 0x00000FFFD78B16A0  R10: 0x00007FFFCBBFF000  R11: 0x00007FFFFFFFD090  R12: 0x00007FFFCBC1E9C8
  R13: 0x00007B6000000CA0  R14: 0x00007FFFFFFFD040  R15: 0x00007FFFE2916EC8
  CS: 0033  DS: 0000  ES: 0000  FS: 0000  GS: 0000  SS: 002B
-----------------------------------------------------------------------------------------------------------------------[code]
=> 0x7ffff749d880 <__tsan_func_entry(void*)+112>:       mov    QWORD PTR [rax],rdi
   0x7ffff749d883 <__tsan_func_entry(void*)+115>:       add    rax,0x8
   0x7ffff749d887 <__tsan_func_entry(void*)+119>:       mov    QWORD PTR [rsi+0xc8],rax
   0x7ffff749d88e <__tsan_func_entry(void*)+126>:       ret
   0x7ffff749d88f <__tsan_func_entry(void*)+127>:       nop
   0x7ffff749d890 <__tsan_func_entry(void*)+128>:       sub    rsp,0x400
   0x7ffff749d897 <__tsan_func_entry(void*)+135>:       call   0x7ffff74a8caf <__tsan_trace_switch_thunk>
   0x7ffff749d89c <__tsan_func_entry(void*)+140>:       add    rsp,0x400
-----------------------------------------------------------------------------------------------------------------------------
0x00007ffff749d880 in __tsan::FuncEntry (pc=0x55555ae7d21d, thr=0x7ffff5deca00) at ../../../../src/libsanitizer/tsan/tsan_rtl.cpp:1039
1039    ../../../../src/libsanitizer/tsan/tsan_rtl.cpp: No such file or directory.
gdb$ bt
#0  0x00007ffff749d880 in __tsan::FuncEntry (pc=0x55555ae7d21d, thr=0x7ffff5deca00) at ../../../../src/libsanitizer/tsan/tsan_rtl.cpp:1039
#1  __tsan_func_entry (pc=0x55555ae7d21d <caml_raise_exception+57>) at ../../../../src/libsanitizer/tsan/tsan_interface_inl.h:104
#2  0x000055555ae7923c in caml_tsan_exit_on_raise (pc=0x55555a658d89, sp=<optimized out>, trapsp=0x7fffcbc1e9c8 "") at runtime/tsan.c:216
#3  0x000055555ae7d21d in caml_raise_exception ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
gdb$

For sure, 0x0000600000FFFFF8 does not feel like an address I should attempt to dereference. It perhaps makes sense that gdb isn’t able to walk the stack if we are unwinding from an exception; but, also, the program counter value seems nonsensical so it’s possible the issue is more fundamental.

I admit I’m not sure how to even begin diagnosing this - the crash is at least deterministic (suggesting perhaps it isn’t owing to a race per se). I’m running on 5.3.0 with the tsan variant on linux, so nothing should be terribly nonstandard here. If you were me, what would your first step be?

Thanks,
Nathan

3 Likes

Hi, this looks like an overflow of TSan’s internal stack of return addresses. The most likely cause it that your program allocates more stack frames than TSan’s hard limit of 64k. So a first step would be to check whether that is the case.

There are various ways to do this; if your installation of libsanitizer is an LLVM one with debug symbols, you should be able to print the value of thr->shadow_stack_pos - thr->shadow_stack from the debugger, and see if this number of bytes is equal to 64k machine words.

3 Likes

Hi there! Thanks for the reply!

Bad news: we the value of that expression is slightly larger than 64k :wink:

gdb$ p thr->shadow_stack_pos - thr->shadow_stack
$6 = 0x91f30291dfcecee

Safe to say that whatever we are scribbling over also includes our thread state. (Should I perhaps be trying out address sanitizer, if this is true?)

Shortly before crashing, by hand-sampling some stack traces in gdb, I see a stack trace with ~2000 frames, but nothing suggesting we are getting anywhere close 64k frames. (Just confirming that you meant 64k total frames and not a stack size of 64k?) Just inspecting the values of $rsp in main versus the tops of those stacks, even in terms of byte count vs #frames, assuming I can trust the register file here these stacks are just not that big, I don’t think.

>>> stack_bot = 0x7fffffffc518
>>> stack_top = 0x7fffffffd178
>>> stack_top - stack_bot
3168

One thing that I’m thinking about right now is that the faulting stack passes is attempting to raise an exn on the other side of some FFI C code:

#0  __tsan::FuncExit (thr=0x7ffff7ebe096) at ../../../../src/libsanitizer/tsan/tsan_rtl.h:779
#1  __tsan_func_exit () at ../../../../src/libsanitizer/tsan/tsan_interface.inc:162
#2  0x0000000004692bdb in caml_tsan_exit_on_raise_c (limit=limit@entry=0x7fffffffd0c0 "") at runtime/tsan.c:285
#3  0x0000000004694efc in caml_raise (v=0x8fe0268) at runtime/fail_nat.c:86
#4  0x000000000465aa91 in caml_raise_not_found () at runtime/fail.c:140
#5  0x0000000004621139 in handle_exec_error (loc=loc@entry=0x86078ac "pcre_exec_stub", ret=<optimized out>) at pcre_stubs.c:539
#6  0x00000000046223b5 in pcre_exec_stub0 (v_opt=0x0, v_rex=<optimized out>, v_pos=<optimized out>, v_subj_start=0x0, v_subj=<optimized out>, v_ovec=<optimized out>, v_maybe_cof=0x1, v_workspace=0x0) at pcre_stubs.c:619
#7  0x00000000046223d3 in pcre_exec_stub (v_opt=<optimized out>, v_rex=<optimized out>, v_pos=<optimized out>, v_subj_start=<optimized out>, v_subj=<optimized out>, v_ovec=<optimized out>, v_maybe_cof=0x1) at pcre_stubs.c:717
#8  <signal handler called>
#9  0x0000000003df5c92 in camlPcre.loop_1303 () at lib/pcre.ml:748

Is there a world where this is a problem? I’ve added fsanitize=thread to my c_flags stanza when I compile the relevant C library, but perhaps I still need to do something else there?

Thanks again,
Nathan

1 Like

Hm.

I agree that TSan’s internal state is being corrupted somehow. As for analyzing it with ASan, it might be a sanity check. If ASan detects nothing, it means that it is the TSan instrumentation that causes the memory corruption.

(Just confirming that you meant 64k total frames and not a stack size of 64k?)

Yes.

Raising across C frames should work. Manually adding -fsanitize=thread shouldn’t be required, this should be taken care of for you. You may want to check that all your non-OCaml libraries are instrumented just to be sure (it’s quite easy to spot: all functions contains (possibly dynamically-linked) calls to __tsan_func_entry and __tsan_func_exit. Linking with a library that wasn’t instrumented with TSan can cause crashes when exceptions are raised from or across C.

If I were in your situation, after checking the above, I suppose I would fire a reverse debugger to determine how the contents of thr gets nonsensical.

1 Like