Central OPAM documentation site

Now that we have odoc and odig, it seems like the next step is having a centralized site that automatically builds all documentation for all packages in OPAM. We’ve been having some ad-hoc discussions about this on discord, and I’d like to kickstart a more formal discussion so we can make progress towards this goal.

As one example, we have the repo that builds docs.mirage.io. This uses only public resources such as freely available CI tools and github pages for hosting.

Is this the right approach for a universal documentation site? Is it a good first step? What do you guys think?

5 Likes

From running docs.mirage.io for some time, the main issue with a centralised site is coming up with a permanent URL scheme that is robust to ongoing changes in the underlying packages. docs.mirage.io URLs are transient, so there’s no notion of a permalink into it.

The biggest hassle factor with actually running the site is ensuring the coinstallability of packages. Our dune-overlays repository has a really useful CI mode that checks that everything in that overlay can coinstall, and only permits a single version to be present per package. I’ll likely shift the docs.mirage.io site to use something along those lines once its stable, since we can then use CI to ensure that adding a package doesn’t result in another package being unceremoniously ejected from the docs set.

In the long-term, there is a large coinstallability problem for Mirage and opam since we have several packages that can never coinstall (e.g. some of the xen or arm or esp32-only packages). I’m hoping that doc generation via a dune build @check in the duniverse will fix that, but the dune rules aren’t good enough for that just yet.

I’d be delighted to see others clone the docs.mirage.io approach and improve the scripts for your own documentation clusters, and share your own experiences with using odoc at scale. We’re working hard on odoc itself to remove some scalability issues with large codebases at the moment, so the internals of the tool are getting steadily up to where they could cope with a docs.ocaml.org.

2 Likes

I would reiterate my suggestion from Discord. When we submit a PR to opam-repository we have all our packages already built in a docker backed infrastructure, ci.ocaml.org. Therefore, we already have a container with everything built which we can use to run odoc in it, extrac the docs and push it to some central repository, that should basically look like the Rust Docs. Basically, it should be as easy as adding a new task that will build and the following docker container

FROM <builder>
RUN build-docs && install-docs

We can build/push just the docs for the single package being PR’d or we can push every time a full snapshot of documentation.

Anyway, to implement this, we need to alter the ci.ocaml.org infrastructure and submit a PR to them. I can’t find where the source is hosted, also it would be interesting whether ci.ocaml.org maintainers will accept this at all.

That is actually what Nixpkgs is for and why many of our users are relying on it. Nixpkgs enables coinstallation of different versions of the same package at the same time. Of course, if you need to link not coinstallable libraries into the same application, it won’t help. But you can definitely have two applications available at the same time, where one is using core v0.11 and another core v0.13, and they may even share some common libraries.

The details contained here are the key missing step. If we did in the Rust way, then the url scheme would be docs.ocaml.org/package/version/<html>, but the cross-references would not be shared. This would make every package doc a standalone one, unless you had a different scheme in mind?

I’m definitely interested in getting nixpkgs up and running in our infrastructure and would welcome concrete PRs towards this. But to get cross-references working, odoc essentially needs to do a massive link across a lot of cmt[i] files, and so again if they are built with different versions of dependencies the CRCs wouldn’t match.

Before accepting something into the ci infrastructure, it’s definitely important to see what it looks like as an outside service. That’s why I setup docs.mirage.io externally so it’s easier to experiment with. I’d be happy to see a prototype of the scheme you have in mind (e.g. using the excellent cloud.drone.io) in order to evaluate how to integrate it into the opam CI. You can consider the latter to be a fancy DSL that runs pure functions over input git branches and stores the output in another git branch. So anything you build as a shell script or Docker container can easily be transplanted into that infrastructure later.

Basically, it should be as easy as …

I would suggest that it is much harder than that to build a proper documentation site. I think that what you describe would amount to publishing documentation for a constantly changing subset of the packages.

To do this properly you need to think about:

  • how you deal with conflicting packages
  • how to handle different versions of packages
  • how to version packages when their documentation depends on the versions of their transitive dependencies
  • how to provide a permanent URL scheme
  • etc.

Options that were brought up on discord:

a. Don’t worry about linking packages. Something is better than nothing as a first step.
b. Could we hash package documentation output, and reuse packages that match? We’d essentially prefix each iteration of a package’s documentation seen so far with its hash (or an id for its hash, though that would not be permanent).
c. Could we have odoc/odig always output the OPAM version of a package as part of its URL? While this wouldn’t be a perfect scheme since it would be up to devs to increment the version, it would work most of the time, and allow multiple versions of packages to coexist while keeping URLs readable (and permanent).

An ideal solution to any non-trivial problem is always non-trivial :slight_smile: What I’m seeking, for starters, is the same as Rust documentation stream, again, please click on the Rust Docs link. We don’t really need to have versions or anything like this, and all cross-references will work as expected. Yes, there will be some space overhead, which we can eliminate later, if necessary.

To summarize, each successful PR will have an extra job (run only one the latest OCaml version), which will do

odig odoc
publish-html `odig cache path`  docs.ocaml.org $package $version $PR

Where $package is the name of the package that is being built by the current $PR, $version is its version.

We can have only one version per package (e.g., the latest one), then every new PR will just override the old documentation. Or, if space allows, we can have a deeper structure, so that we can have documentation for each released version. I.e., that means that the URL structure would be
docs.ocaml.org/$package/$version or just docs.ocaml.org/$package

Important thing is that each released package will have the full set of cross-linked documentation, i.e., each docs.ocaml.org/$package will have all its dependencies (e.g., core-kernel.html will be repeated for each package that depends on it). Yes, lots of redundancy, but quite robust.

1 Like

I’m broadly fine with having a per-package docs storage, since we already effectively have this in the dune-release workflow – but in this case, it usually goes to a github gh-pages URL for the package itself. Placing this HTML centrally seems reasonable, until we get the full blown cross referencing working.

However, I don’t like the idea of hooking it into a PR workflow, since we don’t really support “deploys” in the sense you are thinking of (and there are additional complexities around having multiple packages touched per PR, etc). We do have a fair bit of compute resource available on 72-core boxes in the CI infra, so a script that generates a HTML bundle given a (opam-repo-git, package, version) tuple is sufficient, and then the bulk builders can generate the gigabytes of HTML that will result. It would be nice to record a reproducible Dockerfile in the download so that each package build can be reproduced.

Contributions towards this welcome, as well as volunteers to actually keep an eye on it once its live. I think it’ll take some consensus building before it goes actually live on docs.ocaml.org, but we can stage it elsewhere first as a beta test. It’s probably best to have a url scheme like <domain>/1/package/version/... so that we can rev the epoch to have the ultimate cross-referenced version that we’re aiming for with odoc.

I see. I don’t really know how ci.ocaml.org is implemented (and btw are there any sources online?). My vision was that since we’re already having an image built then we can easily reuse it to build the documentation from it without any extra overhead. Now we have an image with documentation and the rest is the hardest part - how to deploy docs from it. If it is impossible for some reasons to push docs directly from it, we can just strip it, (using COPY --from) and push somewhere (e.g., docker hub). Later, a separate script will pull this image with docs and push them to the documentation site.

Yes, lots of redundancy, but quite robust.

My intuition is that this will be infeasibly large, certainly Jane St.'s internal docs site is massive and taking n2 of that would be a problem. Maybe the packages on opam aren’t as big though.

I don’t really know how ci.ocaml.org is implemented (and btw are there any sources online?).

It’s currently part of the https://github.com/avsm/mirage-ci/ repository. It should probably be split out and renamed.

The main pipeline is defined here:

2 Likes

Yep, this is expected. It is hard to give a good estimate, but I’m expecting something around 50~500 Gb for the whole universe of opam packages. Sounds like a lot, but nothing impossible.

I’m expecting something around 50~500 Gb for the whole universe of opam packages. Sounds like a lot, but nothing impossible

A single copy of our internal documentation site is ~50Gb, so one copy of that per opam packages is ~110TB. Obviously on average packages will depend on only a fraction of the rest of the documentation, but if that fraction is say 10% then it will still be 11TB.

As I said, its just my intuition, you’d need to try it to see.

As far as hooking in the CI is concerned I think a first good step would be to simply exfiltrate the installed cm{i,ti,t} files from the CI build switch libdir along with the corresponding opam metadata to another service (which will be in charge for example to deduplicate them).

We then get a data set which has for each package version the files that are the roots of the documentation generation and the dependencies that were used when those were built (including the ocaml compiler that was used which is important).

With this data set in hand we can then play with and figure out a build and versioned cross-package linking strategy separately from the CI (w.r.t. a first step would be to be for odoc to be able to output versioned URIs for the package name, e.g. by specifing it on the cli at odoc compile time).

It’s a good vision, but perhaps a premature optimisation. We typically have two sorts of CI: the testing and deployment variety. With testing, it’s ok to be “best effort” since the purpose is to gather enough intelligence about regressions for a human maintainer to make a judgement call about whether to click merge or not. With deployment CI, the pipeline has to work every time.

The opam-ci is a testing one, and we often merge PRs with a red x (e.g. because a single revdep failed, or some transient infrastructure failure). This means that it’s not great for having a reliable deployment pipeline.

Hence my suggestion of having something standalone that can be turned into a reliable deployment pipeline, and also rebuilt from scratch if required. But bear in mind that we have terabytes of intermediate images built per bulk build. I’m with @lpw25 that the n^2 HTML docs might end up being really too big, but I’d also be happy to be wrong :slight_smile:

Also regarding a more best-effort, “we can have now” approach. @gasche and others have been working this spring on trying to build as much opam packages as possible and as fast as possible. Using this work and a beefy machine a larger https://b0-system.github.io/odig/doc/ documentation set could be produced for a reasonably usable docs.ocaml.org by having something like:

  1. Fix a compiler version (say the penultimate one).
  2. Depending on available computing resources fix a frequency on which you checkout the current state of the opam repository.
  3. Using @gasche et al. work try to build a cover that includes each package at its latest version.
  4. Union the resulting libdir's of the cover elements always keeping the result for the latest version of the package if there are conflicts.
  5. odig odoc the resulting unioned libdir.

That “union” prefix will be broken compilation wise but should remain odiggable providing a best-effort docset for the all the packages at the latest versions in the ocaml-repository. Some inter-package links will be broken or absent — api changes or unresolvable .cmti file digests – but, depending on the actual result, maybe something we shouldn’t shy away to publish on docs.ocaml.org until we figure out better ways.

4 Likes

Yep, I’ve slept with this idea and woke up with the same conclusion, that it is a premature optimization :slight_smile: Basically, we could write a simple documentation generator that will install a package $pkg using opam depext --install $pkg and then run odig odoc and update the main index afterwards. We can then run this generator as a cron job for every new package. And also run it once to initially fill in the package set. Basically, something like this (pardon my bash)

function build ()
  docker build -t $IMAGE_NAME - << EOF
    FROM ocaml/opam2
    RUN opam update \ 
     && opam depext --install --yes $1 \
     && opam install --yes odig ocaml-manual \
     && odig odoc \
     && update-index $pkg $(odig cache path)
  EOF

My estimate is based on the size of the BAP documentation, which includes 136 packages, and it is about 800 Mbytes. Taking that not all packages are as big as BAP and that we have about 2000 packages in the opam universe, I estimated that in total it will take about 100 * 2000 = 200 GBytes.

Assuming that there is a lot of sharing, we can compress it, using something similar to hashconsing. We can find all duplicates using the fdups utility, which will partition all files into a set of duplicates,

fdups -r odoc
....
3396 bytes each:
odoc/monads/Monads/Std/Monad/Core/Let_syntax/index.html
odoc/monads/Monads/Std/Monad/module-type-Core/Let_syntax/index.html

2232 bytes each:
odoc/monads/Monads/Std/Monad/Core/Monad_infix/index.html
odoc/monads/Monads/Std/Monad/module-type-Core/Monad_infix/index.html

5279 bytes each:
odoc/ocaml/Id_types/UnitId/Compilation_unit/index.html
odoc/ocaml/Id_types/module-type-UnitId/Compilation_unit/index.html

8927 bytes each:
odoc/_odoc-theme/highlight.pack.js
odoc/highlight.pack.js

And then we can reduce each group by removing all files but the first one and making a hard link from the removed file to the first one (i.e., using cp -l). If we will run this on the whole universe, it will save a lot of space. And again is pretty trivial to implement.

Update 1: there are already an utilities that do this, e.g., hardlink, trimtrees.
Update 2: in fact, this might not be necessary if we will use Git to store documentation, since Git will do this sharing for us and will use hardlinks when possible, see also git relink.

3 Likes

I like both options above. It’s really nice to see a fully running index of all packages. It also highlights how authors are currently under-utilizing the tag system, but that’s beside the point.

As another option raised by @wokalski on discord, for this single case of creating global documentation, we could try and use esy. esy creates a cache of all library versions and then sets up individual sandboxes using environment variables. We could build every library in esy and then have odig iterate through the directories, creating each library’s documentation. Finally, odig (or some other tool) would also have access to the cache itself, allowing for easy indexing of all library versions required for the transitive closure of all latest-version libraries in OPAM.

3 Likes

FYI I’ve started work on the esy option. My planned steps are:

  1. Modify odoc to support version strings as package subdirectories. WIP.
  2. Modify odig to send said version strings to odoc.
  3. Modify odig to read and follow esy's cache.
  4. Create a script that gets the current world from OPAM, puts each library in an esy project’s dependency list (separately), installs using esy, and then reads the resulting cache with odig.
  5. Host on github first, maybe somewhere else if it’s a success.
3 Likes