The title is a bit dishonest as it doesn’t really tell how to fix them, but at least i think this talk has the quality of being a great conversation starter.
As we are all aware opam-repository also suffers from all of these problems modulo size, and while parts of this have been discussed profusely here before, i think it would be nice to have an economics focused discussion featuring the interested parties (opam-repository maintainers, infrastructure maintainers, OCaml Software Foundation, Companies with available founds).
I’m not currently available to bootstrap this, but hopefully someone reading this is.
Can I say, first, thank you for posting this. Second, wow wow wow, Michael Winser does a great job in that talk. Everybody should watch it. It really is great.
Third, I would very, very much like to read the discussion that might emerge from people watching that talk and commenting here. It’ll be fascinating.
And last, again, wow, that was a great talk to watch, and thank you -again- for posting it.
What is the current cost of opam infrastructure? Are there any figures around? How this cost or the underlying metrics (e.g. number of monthly downloads) is growing?
What could be decentralized? Can a peer2peer protocol reduce some parts of the infrastructure, e.g. on download servers? I understand that some parts (building servers, CI infrastructure, manpower to run all that, …) cannot be easily distributed.
Beyond the economics of distributing packages, what’s concerning me is also the trust we can have in those packages: when I download a random opam package, can I introduce a security issue in my code?
I feel scalability and security are two sides of the same coin and should be probably tackled together.
It’s worth watching the talk. he discussed -at length- download bandwidth, caching, etc. And how that’s just not a problem. It comes up repeatedly in the Q&A, and that causes him to really insist on the point, with the chiming-in of other maintainers in the audience. For a little taste: he puts up a leaderboard (haha, borrowed from “Family Feud”) of all the things that these repository maintainers work on, and the priority order in which resources get allocated to that work. And …. “security development” is, literally, last place on the list.
That, I thought, was a great example of the problem he’s trying to get at. To be clear, he has no solutions: he’s diagnosing a problem, and that’s valuable b/c before you can search for a solution you have to know there’s a problem that needs solving, what its impact is, etc.
I wonder if the OPAM maintainers could comment on that leaderboard.
Can be, but isn’t for the vast majority of non enterprise users.
Some good news in that regard in the ocaml world:
conex development is active again
discussions are happening with the opam team to focus on some security features
Gabriel is actively using the ocaml software fundation to support the opam repository maintainers, so that we have a robust team of trustable people to review new submissions
at least the opam-repository maintainer team is usually happy for new volunteers – at the time being there are again 50 open PRs and 140 issues that need triaging – but there are not many incentives to do this work (if you read this as an opam-repository maintainer or former maintainer or future maintainer: thank you so much for doing the work! this is highly appreciated)
I know I’m not one of the interested parties, but I’ve been thinking for a long time that opam-repository seems to have further pressures than other language-specific registries. opam’s is the only one that I know of which:
Manually approves each published version,
Discourages upper bounds upstream, then spends time retroactively adding them as conflicts arise,
Runs CI to find these conflicts,
And feels the need to periodically announce the archival of old versions to reduce its burden.
This seems from the outside like a lot of maintenance time and money, consequence of running a repository for OCaml like you’d run an OS distro package manager. And I understand how opam got there, but I also think it’s inherently untenable for a growing library ecosystem, and way beyond the expectations of new users.
Maybe it’s excessive to apply opam’s workflow for the entirety of OCaml’s ecosystem. Compare with Go’s, which chose the minimal version selection strategy: Manifest files only support lower bounds, and contain enough information to transitively recover a dependency graph, resolve the highest lower bounds and download them directly from upstream, all without a package index.
So they don’t maintain a repository at all, despite Go having a lot more resources! The remaining features of a registry are performed by cache proxies, and Google’s proxy (there’s the bandwidth) builds the official package list and docs as download requests come, without any expectations about curation or long-term archival.
I skimmed quickly through the lengthy discussion so I might missed something, but in my original post, the idea of peer2peer / decentralization what to reduce bandwidth need. In the post you mention, decentralization is more about having several packages providers. Like others, I think having a central repository with Quality Checks on packages is a big bonus… with a cost.