I’ve been experimenting with AI code assistants lately. They’re surprisingly good at generating new code but it’s not very good at remembering things: it’s like having a steady stream of good interns spending a few months on a project to expand it in one direction and then just disappearing. They produce a lot of code quickly, but that also means they leave a lot behind: unused definitions, inconsistent style, and half-finished refactorings.
Since I couldn’t keep up with reviewing code at the speed these AI assistants generate it, I ended up writing (with their help) a couple of tools to clean things up.
prune: removes top-level definitions (values, types, constructors, etc.) that are exposed in .mli files but unused elsewhere in the codebase. It relies on merlin occurrences to find unused identifiers and parses compiler warnings to clean up any breakage (see warning.mli for the full list of handled cases).
merlint: runs a set of small static checks using Merlin, ppxlib, and some basic Dune file parsing. It looks for things like code complexity, documentation style, test organisation, etc. This is all very specific to my own style of writing OCaml code but can be made configurable (a bit). See here for the full list of rules (suggestions welcome to add/change some of them).
I originally wrote these to deal with AI-generated code, but they’ve been useful on more conventional OCaml projects too, especially older ones or ones with many contributors or the ones that I wrote a long time ago.
Both are still evolving but they’ve made my workflow a bit smoother already. Feedback and suggestions welcome. I’m still undecided if I should publish those to opam or not.
Since you seem to be partly asking if there is interest, FWIW I was interested enough in prune to get far enough building it from source to see that there is a dependency on ocaml 5.3, that is more than a simple stdlib compat shim away. Mostly I still also need ocaml 4.14 compatibility, but otherwise it looks very useful. (To be clear, I’m not asking you for 4.14 compat, I know that for these things that operate on the parsetree that is nontrivial.) So take this as a vote for publishing.
As prune is heavily relying on merlin occurrences that itself relies on recent patches on the compiler to be remove the last false-positive that made this kind of analysis precise enough to be useful. So I don’t think this could easily be backported to 4.14
Thanks a lot for these tools. I find this very interesting, and quite fascinating to see the rapid progress in AI assisted projects.
I’ve been experimenting with assistants too, and this further motivated me to try and be more disciplined about the use of systematic testing and code coverage in projects, if I am to imagine being able to onboard AI generated code in practice one day in any non trivial scale.
That leads me to a first question I have about prune. How would you compare it to using bisect_ppx to find unused code? I have been experimenting with a pattern by which code is tested using mostly end-to-end functional tests, and using bisect_ppx to closely monitor what ends up being exercised. If something is untested, it’s either an indication that I am missing a test - or that I need to delete dead code. Once you have managed (it’s a lot of work) to be in a place where you are close to 100% intended coverage, I found this potentially subsumes and absorbs the use-case for a tool like prune. What do you think?
I have been playing with merlint too, and this was a lot of fun so far. I have some reservations about its built-in aversion to cyclomatic complexity (first time I heard that word by the way!) but that gave me material to think about and it’s great. Where would be a good place to discuss details? I’m happy to open issues in the project and meet you there. Thanks a lot!
I agree code coverage is also an important part of the tooling (and I’ll probably need to integrate this more in my workflow!) but I am not sure how you would compare them directly.
The goal of prune is to remove declaration from mlis: if something is used only internally, it will just remove it from the exported symbols. This in turn can generate compiler warnings (for instance for unused code, as it turns out that the export were making the symbol alive, but no internal code was actually using it). prune will then parse the compiler warning/errors and fix the code to remove that code (that works also for unused variant cases, field records, etc).
prune is able to exclude some files from the analysis (with say --exclude tests/) so that the tests that use some code do not keep it “alive” by mistake.
On the other hand, I feel like " If something is untested, it’s either an indication that I am missing a test - or that I need to delete dead code" is pretty manual. However, if you have managed to automate this then I agree that it would be similar! (A side note: I found out that unit tests are actually more efficient to let the AI assistant fix the error, whereas integration tests, where the layers of stuff to unfold are sometimes too large for them.)
I tried prune briefly and found that it was reporting large amounts of code that I know to be used as unused. Is there something I need to do to tell it what my entrypoint/executable is, or is it primarily intended for libraries without some central use-all-the-pieces code?
Thanks for the report, this is surprising. Could you check prune doctor?
The intended use is a project with at least an executable that uses any library code defined in your project. Otherwise it’ll detect that all the library code is unused. But maybe something odd is happening with detecting executables. Do you have a small repro by any chance? Feel free to open issues on the GitHub tracker with more details.
In the end-to-end full-coverage scenario, you know that there is no dead code in the ml files, but that doesn’t actually say anything about the mlis. You could be exporting every single function and that wouldn’t make any difference.
Now you’re saying that with prune, you have a tool to help you reduce your exported API, based on analysis with exclusion, etc.
I understand the part about refining your mlis and how that “can generate compiler warnings”, etc. However in the bisect_ppx scenario, reducing the mlis would generally not create any unused code compiler warning, given that all the code contributes to the end-to-end tests. But this seems useful in general!
I’ll do some more experimenting. Better control and analysis of mlis sounds like a nice and very useful capability to have in your toolbox! And I start to see how that can be complementary to the coverage approach. Thanks for the explanation!