Opam-repository: security and data integrity posture

In connection with another thread discussing the fact that Bitbucket’s closure of mercurial support had affected the availability of around 60+ projects’ published versions, I learned of a number of facts about how the opam repository is arranged, and how it is managed that are concerning.

In summary, it seems that opam / opam-repository:

  1. Never retains “published” artifacts, only links to them as provided by library authors.
  2. Allows very weak hashes (even md5).
  3. Allows authors to update artifact URLs and hashes of previously “published” versions.
  4. Offers scant support for individually signing artifacts or metadata.

All of these are quite dangerous. As a point of reference, the ecosystems I am most familiar with using prior to OCaml (JVM and Javascript) each had very serious documented failures and exploits (and many many more quiet ones) until their respective package managers (Maven Central et al., and npm) plugged the above vulnerabilities that opam-repository suffers from.

To make things concrete, without plugging the above (and especially items 1-3):

  • the availability and integrity of published libraries can be impacted by third-party hosting services changing or going offline (as in the case of the Bitbucket closure)
  • the integrity of libraries can be impacted by authors non-maliciously publishing updates to already-released versions, affecting functionality, platform compatibility, build reproducibility, or all of the above (anecdotes of which were shared with me when talking about this issue earlier today)
  • the integrity of libraries can be impacted by malicious authors publishing updates to already-released versions
  • the integrity of libraries can be impacted by malicious non-authors changing the contents at tarball URLs to include changed code that could e.g. exfiltrate sensitive data from within the organizations that use those libraries. This is definitely the nuclear nightmare scenario, and unfortunately opam is wide open to it thanks to artifacts not being retained authoritatively and essential community libraries continuing to use md5 in 2020.

Seeing that this has been well-established policy for years was honestly quite shocking (again, in comparison to other languages’ package managers that have had these problems licked for a very very long time). I understand that opam and its repository probably have human-decades of work put into them, and that these topics have been discussed here and there (in somewhat piecemeal fashion AFAICT), so I’m certain I have not found (nevermind read) all of the prior art, but I thought it reasonable to open a thread to gauge what the projects’ posture is in general.

8 Likes

Hello,

first of all thanks for your post raising this issue, which is important for me as well.

I’ve been evaluating and working on improving the security of the opam-repository over the years, including to not use curl --insecure (i.e. properly validate TLS certificates) - the WIP result is conex, which aims at cryptographically signed community repositories without single points of failures (threshold signatures for delegations, built-in key rollover, …) - feel free to read the blog posts or OCaml meeting presentations. Unfortunately it still has not enough traction to be deployed and mandatory for the main opam repository. Without cryptopgraphic signatures (and an established public key infrastructure), the hashes used in opam-repository and opam are more checksums (i.e. integrity protection) than for security. Threat models - I recommend to read section 1.5.2 “goals to protect against specific attacks” - that’s what conex above is based on and attempts to mitigate. I’ll most likely spend some time on improving conex over the next year, and finally deploying it on non-toy repositories.

In the meantime, what you’re mentioning:

  1. “Never retains ‘published’ artifacts” <- this is not true, the opam.ocaml.org host serves as an artifact cache, and is used by opam when you use the default repository. This also means that the checksums and the tarballs are usually taken from the same host -> the one who has access there may change anything arbitrarily for all opam users.
  2. “Weak hashes” <- this is true, I’d appreciate if (a) opam would warn (configurable to error out) if a package which uses weak checksum algorithms, and (b) Camelus or some other CI step would warn when md5/sha1 are used
  3. “Authors can modify URLs and hashes” <- sometimes (when a repository is renamed or moved on GitHub) the GitHub auto-generated tarball has a different checksum. I’d appreciate to, instead of updating that meta-data in the opam-repository to add new patch-versions (1.2.3-1 etc.) with the new URL & hash - there could as well be a CI job / Camelus check what is allowed to be modified in an edit of a package (I think with the current state of the opam-repository, “adding upper bounds” on dependencies needs to be allowed, but not really anything else).
  4. I’m not sure I understand what you mean - is it about cryptographic signatures and setting up a public key infrastructure?
5 Likes

For a while now, we have avoided doing this in the opam-repository, we usually ask to reupload the tarballs from the opam cache, and when this does not happen for some reason we manually check that the contents are matching. For other modifications we ask to use new patched releases instead.

Clearly this does not prevent malicious modifications of the source tarballs after the merge, but it is at least a sort of mitigation.

This could surely be useful, after all we are human and prone to error. It should also support making packages unavailable “available:false”

1 Like

Thanks for sharing, I had completely missed this. Really interesting project, I look forward to see its evolution

Yes, that is the other edit which came to my mind as well.

What I forgot to mention above is that while you can “secure” the ocaml/opam-repository git repository (and protect against changes to the git repository) by using safeguards (CI/Camelus/…), what the common user of opam uses is not a git clone, but a https download from opam.ocaml.org – so this host is at the moment the weakest link: when you change anything on there, users will take its metadata (and data) as granted.

1 Like

The conex project looks very interesting. I need to do more reading on it and its objectives, but you were right in our private communications that its aims appear to go well beyond the specific ills I’m observing at the moment, i.e. not only ensuring artifact integrity but also guarding by construction against compromised upstream repositories.

My experience is with repository systems that do demand trust of the repository(ies) of record, relying on SSL, certificate pinning, cryptographic signing of artifacts (yes, using a PKI), and aggressive downstream caching of always-immutable resources so as to recover from any new tainted releases from an exploited upstream repo. The last of these is mostly just a coping mechanism for the (much harder) problem of protecting against attacks on/via trusted repositories.

Yes, I’m aware that there is an artifact cache, but there are definitely contexts where it is not authoritative; otherwise, the notion of “broken upstreams” simply would not exist. At some point, that cache is being flushed (maybe when an author PR prompts via updating artifact URLs, maybe there are other cases?). I’m sure there are interesting questions about policy of when and how and why upstreams might be touched after a library version is first published, but the bottom line is that immutably retaining artifacts as they are when first published eliminates an entire class of potential security exploits and build irreproducibilty problems.

As you say elsewhere, without cryptographic signatures, the provided hashes are “just” checksums. Unfortunately, md5 appears to be widely-used, and as we all know, entirely useless. Forget a warning…I cannot contemplate a rationale for why it is still allowed in any way. Much like known-broken ciphers or TLS versions, md5 (and really, sha1 for the purpose of a single-blob validating checksum) is “unsafe at any speed”.

1 Like

The problem with any justification for why hashes and/or upstream artifacts might change is that such things are completely indistinguishable from tampering. For an analogy, how much trust would you put in an operating system .iso that has a checksum that changes over time?

Just a few minutes of nosing around in recent commits yields changes like this, where artifact URLs and checksums are changed for already-published libraries. Again, such changes are indistinguishable from tampering (malicious or not).

I appreciate that those considering the submitted PRs do what they can to vet these changes, but I think it’s unreasonable to expect such efforts to be successful over the long run. (To tweak the old saying, “They only have to be lucky once; we have to be lucky every time.”)

Mitigation is reasonable when it’s impossible or too expensive in some way to eliminate the vulnerability in question entirely. I’m going to sound like a broken record here, but these problems have been solved in other language communities for a long, long time, and the solutions there actually drive down operational costs.

2 Likes

Since in the current state of affairs the hashes are effectively just checksums against unintentional data corruption it should be noted that contrary to popular belief there’s absolutely nothing weak or useless in using md5 for that.

1 Like

You’re quite right, I was projecting as to what the hash validation could (should) be used for, were artifacts not allowed to churn. But then, by the same logic, a CRC would provide the same assurances. One of those situations where the whole is less than the sum of its parts.

Closely related issue is How to setup local OPAM mirror, since the integrity checks and verification will become even more important if there will be multiple mirrors in the future.

1 Like

If I may come with a pragmatic suggestion to this: has anyone considered contributing the necessary integration to existing solutions such as Sonatype Nexus or JFrog Artifactory? Both are in heavy use within the industry and of high quality. I feel like this approach would make a lot of sense as most companies/institutions use multiple package types already and having one central location for all simplifies management. It is possible one or both companies (Sonatype/JFrog) would be willing to do the necessary implementation work if asked?

This is true, and these (and other Maven-compatible) package repository implementations do provide certain architectural guarantees that the vulnerabilities in question are simply impossible. That said, I wouldn’t recommend using anything Maven-compatible for a general-purpose OCaml project repository; the model is much less flexible than opam’s in ways that would negatively impact many OCaml projects (e.g. it makes running post-install scripts quite difficult in comparison).

I think if an off-the-shelf alternative is desired, then I’d point to npm, either targeting the “main” npmjs.org instance, or standing up one dedicated to OCaml projects. I have quibbles with npm’s model, too, but they’re fundamentally just that. It’s a very flexible system that’s mostly pleasant to work with, and I suspect ~every OCaml project would be able to transition to it without much difficulty. As it is, https://esy.sh exists, and demonstrates the degree to which even mixed npm/opam-modelled projects can work quite well.

That said, suggestions for structural improvement (especially of the “nuclear” variety, like “use npm instead”) are basically non sequiturs insofar as the opam / OCaml platform team find the current situation and direction acceptable (or even desirable, as it seems with e.g. package authors being encouraged to alter already-published metadata and artifacts). Determining this posture more explicitly was the objective of my opening the thread in the first place; absent statements otherwise, one has to presume that things will remain as they are, and perhaps separately pursue ways of guarding against threats posed by using opam in our dev workflows.

1 Like

I wonder if we could not already massively switch all the tarball signatures in opam-repository to SHA256 or something.

This is true, and these (and other Maven-compatible) package repository implementations do provide certain architectural guarantees that the vulnerabilities in question are simply impossible. That said, I wouldn’t recommend using anything Maven-compatible for a general-purpose OCaml project repository; the model is much less flexible than opam’s in ways that would negatively impact many OCaml projects (e.g. it makes running post-install scripts quite difficult in comparison).

It is not clear to me what “Maven-compatible” means in this context, would you mind clarifying this? I’m asking because both products I mention support NPM packages which you explicitly call out as a better alternative model. There’s also support for docker, go, yum etc. so Nexus and Artifactory both seem to have a fairly generic model on top of which you can have support for what you want.

Excuse me, I misunderstood your suggestion. I am less familiar with Artifactory, but both it and Sonatype’s products have historically been rooted in the Maven ecosystem, so I thought you were meaning that perhaps opam should adopt its model, etc.

(I personally would say that, whatever their merits in an enterprise context, at least Sonatype’s tools are fairly unpleasant to work with in a 2009-esque way, whichever artifact type you’re dealing in.)

Insofar as npm’s model is desirable, “just use npm” seems like a reasonable strategy, or any of the existing alternative implementations would work fine as well (github and gitlab each provide npm-compatible package registries, or something like AWS CodeArtifact is a good off-the-shelf option if the OCaml platform aims to control its own infrastructure).