A Proposal for Voluntary AI Disclosure in OCaml Code

Thanks, everyone, for the thoughtful and polite feedback to my proposal. I’ve received a lot of private comments as well, from many perspectives, so I’ll attempt to digest them here.

The prevailing concerns seem to hinge around quality and security and (to a lesser extent) legalities. This is not to diminish the debate around ethics, but this is such an active and evolving topic that I can’t pin much down there yet.

Security

This is a growing concern for the opam-repository, and is one I think that goes well beyond CVE tracking. We often use a social signal as opam repo maintainers to “sniff” a packaging PR and browse around the original source to ensure that it’s reasonable. In many cases, we offer suggestions to the package submitter, many of whom apply those changes.

Now, however, with LLM generated content, this social signal is demolished since every package comes with confidently verbose reams of text. It’s no longer practical to assess code by quickly reading through it, and we’ll need some other measure or automation to help out here. I offer no quick solution here, except for some emerging type driven linters that can distinguish “bad vibe coding” from the more curated agentically boosted approaches.

A major problem here is that backdoors could slip quite easily into this high volume code, which leads onto the next topic of quality.

Quality

We’ve resisted measuring popularity by the number of downloads in opam, preferring instead to look for more stable metrics such as the number of downstream dependencies on a package. This signal has been pretty good; there are islands of popular maintainers and packages, and the opam repository serves to aggregate them all and sort out incompatibilities at package submission time via constraints. In other words, the opam repo is a collective database that is more than the sum of the individual packages.

With LLM generated code, there’s often a desire to ‘throw something over the wall’ and not keep it updated. If we accept these sorts of packages into the opam repository, we’re not improving the health of our collective database, since unmaintained packages could rapidly accrue dependencies without humans behind them.

Therefore, our maintainer intention field might become more important moving forward. I can see us accepting LLM packages (that are beyond a minimum level of slop that we can leave to opam repo maintainer judgement) that are set to a maintenance intent of none. This would, at least, be honest, and a signal that other people are welcome to pick up the baton and iteratively improve that particular effort.

A useful improvement to opam itself may be to avoid packages in the dependency chain that have declared themselves unmaintained.

Legality

This one’s the most potentially serious, especially given the diverse and international nature of our contributors (from individuals, to corporates, to academic). Unfortunately, it’s also the most in flux; the current legal situation is murky, varies by country, and is being actively legislated almost everywhere.

The goal of my proposal above is voluntary disclosure to make future provenance easier to figure out, but I have doubts it’s going to take off: even within my own group, people are reluctant to disclose AI usage for a variety of reasons. Some worry it’s a poor social signal, others have it tightly integrated into their workflows and treat it like a code editor, and yet others are not computing experts and do not distinguish.

However, if you do have strong opinions, then now is the time to feed back to your legislative bodies! @samoht pointed out to me that the EU is seeking feedback on Article 50, so I’ll be submitting a synopsis to that.

SO what do we do next?

I have just three concrete suggestions for now:

Make maintenance intent first-class in opam

We could promote the x-maintenance-intent field to be a first class opam field, and actively ‘solve around’ unmaintained packages. We have this really fancy solver, so why not use it?

Improve tooling for multiple package repositories

opam supports handling multiple simultaneous package repositories just fine. In fact, we’ve got two active ones: ocaml/opam-repository and ocaml/opam-repository-archive today.

What’s missing is the tooling to manipulate, filter and merge multiple opam repositories easily (I pushed repomin for this purpose). Having better tooling here would allow us to (for example) have:

  • an opam repository just for all OCaml compilers. This is extremely useful for the developers and packagers and testers to have just the build rules and patches in one place.
  • an opam repository that’s compatible with Windows, with non-building packages filtered out.
  • an opam repository that’s got just the latest versions of packages (an equivalent of Stack).
  • an opam repository with only a core of curated and maintained packages that’s small and portable.
  • an opam repository that explicitly accepts ‘work in progress’ LLM generated outputs, for those who want to live on the agentic bleeding edge.

Is it time to consider a reputation system?

@hannes has worked on conex for many years, but it hasn’t been pushed into opam repository due to the significant hassle involved in key management for end users.

Is it now time to bring back a system like this, but with vouching as a first-class feature? The good folk at tangled.org have been building in “evidence” to their vouching system, which took me back to the good old days of Advogato (for the really oldies among you!).

As with all such efforts, this will require coordination and contribution from all interested in making such change happen :slight_smile: I’m very willing to be corrected on anything I’ve raised above!

14 Likes