Many of you may have already heard of a debugger called rr – it allows you to record and replay programs on Linux. It is extremely useful for instance to debug issues with garbage collection or other low level issues in natively compiled OCaml programs. Once you capture a bug during the record phase, that bug can be replayed any number of times during replay.
One major limitation of rr is that it requires access to CPU Hardware Performance counters which is usually not available in cloud VMs or containers. Sometimes HW counters can be unreliable/high latency for some CPUs (e.g. Zen) or it could just be difficult to get them working for your particular configuration.
Software Counters moderr is a modification of the rr debugger that lifts this limitation – access to CPU Hardware Performance counters is not required. This means you can run rr in many more configurations.
I’ve been able to successfully record/replay the whole OCaml compiler test suite using Software Counters mode rr (Except for a single ocaml test called pr2195 which exhausts the file descriptors).
I’ve also written a blog post about record/replay debugging generally and Software Counters moderr in particular. Please see here.
This is very cool. I’ve been a heavy rr user. I appreciate the work to get the software mode working. How slow is the software mode compared to the hw mode in practice? Are there other limitations?
Software counters mode is definitely slower due to overheads from dynamic instrumentation. Some of that overhead can be managed by using (optional) static instrumentation via custom C/C++ compiler plugins (which are provided in the repo).
But given that you would probably be using this with OCaml compiler compiled binaries rather than C/C++ binaries the dynamic instrumentation overhead is probably unavoidable.
I would love it if you were able to try it out sometime and tell me how it went – Is the performance acceptable for your use case ? Is the debugging experience reliable/robust ?
As far as limitations – in general, with dynamic instrumentation, things can be more fragile apart from being slower.
But when HW counters are not available the Software Counters mode could be better than no option at all. Additionally on some CPUs Software Counters mode might be a good idea to try out also (On Zen CPUs the HW counters can sometime be unreliable).
Also currently I don’t not instrument JITed code (but nevertheless in most cases debugging should work fine when JITs are present also). Given that OCaml does not do JIT this should not be a big concern.
One of additional limitations of Software Counters mode rr is that it currently needs to be used in a mainstream Linux distribution (fedora / ubuntu / debian unstable) due to a need for good debuginfod support. See Home · sidkshatriya/rr.soft Wiki · GitHub for detailed information for why this is the case and other useful information !
Do you think ARM64 support will be difficult to get working? When I last looked rr needed Linux to correctly identify the underlying CPUs (e.g. Apple Silicon M3Pro) to setup hardware performance counters. At the time only M1/M2 were supported by Linux.
aarch64 support exists for Software Counters mode rr !
I am for instance able to run rr in Linux VMs on apple silicon macOS machines which I think is very cool (if I say so myself !). There are still some rough edges that need to be resolved but it generally works well.
I’ve just not made the code public for aarch64 currently – I’m trying to figure out my strategy for that. As a start I’ve released x86-64 version to the world.