I’ve put together an ocaml-ai-disclosure proposal to allow voluntary disclosure of AI usage in published OCaml code, using opam metadata and extension attributes in source code.
The repository and blog post have more details, some prototype tooling to extract attributes, and a FAQ, but in a nutshell I’m proposing something very similar to a W3C disclosure proposal for HTML.
Package Disclosures
An opam package can declare its disclosure using extension fields:
I couldn’t find any other prior art of other language ecosystems trying anything similar, so I’d be interested in hearing about any others you all know about. If there’s no interest in the wider ecosystem in doing this, then I’ll just use it myself, but I figured there’s no harm in starting the discussion!
No need to sprinkle across code; just add to the opam file and it’ll tag the whole repo as well. It’s actually just convenient to know which model/etc was used as well.
As simple as repeated attributes; the opam-ai-disclosure plugin picks that up and makes a list.
e.g. with a toplevel attribute in the opam-ai-disclosure plugin:
I’m not sure why we want OCaml-specific machinery for something like this.
Seems like REUSE and its tooling are a better model, especially when codebases can be heterogeneous, with components shifting between languages, and use language-agnostic build tools like Make, Bazel, Buck 2, etc. (or use several build tools together).
Would love to use something like this, and I think having the ability to use module-level annotations is great.
That being said, I agree with @henrytill that the way packages disclose use of AI should not be dependent of OCaml or dune.
Module-level annotations should help populate a toml/json/whatever file that describes the use of AI within a codebase. (The same way the license field of an opam file doesn’t remove the need to write a LICENSE file that is independent of OCaml)
Since you bring in licensing. I’m wondering how much of that couldn’t simply be covered by the copyright holder and/or license details since that’s were all the interesting legalities are going to happen in the future.
More precisely I’d be more happy to try to frame that into a spdx license exception (WITH)
(** SPDX-License-Identifier: ISC WITH x-anthropic-claude-opus-4-6 *)
I like the idea of tying this to licenses. The SPDX entry is per file, which is more realistic than one per project to capture how this evolves.
That being said, I am skeptical about the overall benefit. The open source movement and its vocal license advocates have little to show for enforcing licenses when AI models have absorbed the intellectual property in a way that was unforeseen. How is that going to change?
While at the moment it’s quite easy to recognize generated code slop (no character, stupid circumvolutions, longer than needed, etc.) if that manages to evolve I’m more interested in sources being explicitly tagged as radioactive liability material than in the proper legal argument.
this is already possible within the SPDX license spec (since v3):
ISC WITH AdditionRef-anthropic-claude-opus-4-6
opam supports the full spec and will not raise any warning with this syntax (since opam 2.4, or 2.2 if you build it with the latest version of the spdx_licenses library)
Companies are pushing out new models in short period of time. If multi models are used and different contributors use different vendor, the list can get long quickly. In-code annotation might not be ideal for the long run.
I am a firm believer that outright refusal should be considered a viable option for dealing with llm output in the ecosystem, say for example in the opam repo.
Although as far as I understand there was no precedent of opinionated decisions in the repo, and a repo is quite different from a single project, even a language implementation.
If this relies on voluntary declaration rather than detection, this is a losing battle when it can’t be enforced or even decided. The spectrum of what constitutes LLM output is too wide and the potential for AI in software development too exciting.
This argument, that it is impossible to practically enforce, is often made against LLM-content bans and you will see versions of it in the discussions I linked. In my opinion, it misses the point. The point isn’t technical, it’s sociological. That is to lay expectations of conduct in the community.
There are sufficient responses for it up in those links but I will attempt my own here, for the interest of discussion.
I implore you to consider: much like voluntary disclosure or just regular licensed software contributions or you locking your door when leaving your dwelling space unattended, rely on trust boundaries and majority good faith – that the actor isn’t doing subversive behavior to avoid disclosure or plagiarize different-licensed code or use a weirdly-bent hairpin and easily bypass your lock, respectively – much the same a LLM-ban policy would set a boundary and assume good faith.
A ban would set playing rules for those interested in playing fair. And that is the majority of our community.
It would go further than disclosure, however, at standing guard against low-effort contributions. Someone trying to subvert the rule would, ironically, have to go in the contribution they made and put effort to make it look like it was the work of a human, the same human submitting it, and that this submitter understands the output when challenged on it. This is effective enough at stopping a large amount of slop from ever having to be dealt with.
It also gives the maintainers a stronger liability shield and empowers them to effortlessly reject thankless work, really. At least output used to somewhat correlate to effort when it was authentic and not generated.
the potential for AI in software development too exciting
Exciting as well was the potential for plastics in the 90s. As Bender would say.
Perhaps if there was stronger more widespread refusal of plastics at the time, more ecologically viable alternatives would’ve been developed much sooner. bypassing today’s ecological crises entirely.
There may be anxiety for catching on “relevance” and “progress” that motivates a looser grip on LLM content, but my impression of the OCaml community was that we’ve always been steady and forward-looking. Valuing high-quality solutions (technical or otherwise) over moving fast and breaking things.
None of these communities try to enforce a full LLM-generated code on the whole ecosystem. The way the opam-repo works is sometimes opinionated, but is it its role to enforce such decisions on everyone?
I did point out it’s unprecedented but worth considering nonetheless.
You could think of the opam-repo as a project, not a registry of the “whole ecosystem” but a (blessed, central, community-maintained) set of packages which participants in the ecosystem can choose to submit their work to and the repo maintainers can choose to reject for any reason.
opam is designed such that this central repo is a convenience not a requirement (opam pin, opam repo, etc.). That means it is absolutely plausible for opam-repo maintainers to be opinionated and reflect an “official” position. Neutrality shouldn’t be pushed on upstream as a non-negotiable, because the tool itself affords you choice to expand on upstream trivially.
The closest analogy to opam’s function in “the ecosystem” and the proposed stance would be gentoo’s position