Central OPAM documentation site

I don’t really know how ci.ocaml.org is implemented (and btw are there any sources online?).

It’s currently part of the https://github.com/avsm/mirage-ci/ repository. It should probably be split out and renamed.

The main pipeline is defined here:

2 Likes

Yep, this is expected. It is hard to give a good estimate, but I’m expecting something around 50~500 Gb for the whole universe of opam packages. Sounds like a lot, but nothing impossible.

I’m expecting something around 50~500 Gb for the whole universe of opam packages. Sounds like a lot, but nothing impossible

A single copy of our internal documentation site is ~50Gb, so one copy of that per opam packages is ~110TB. Obviously on average packages will depend on only a fraction of the rest of the documentation, but if that fraction is say 10% then it will still be 11TB.

As I said, its just my intuition, you’d need to try it to see.

As far as hooking in the CI is concerned I think a first good step would be to simply exfiltrate the installed cm{i,ti,t} files from the CI build switch libdir along with the corresponding opam metadata to another service (which will be in charge for example to deduplicate them).

We then get a data set which has for each package version the files that are the roots of the documentation generation and the dependencies that were used when those were built (including the ocaml compiler that was used which is important).

With this data set in hand we can then play with and figure out a build and versioned cross-package linking strategy separately from the CI (w.r.t. a first step would be to be for odoc to be able to output versioned URIs for the package name, e.g. by specifing it on the cli at odoc compile time).

It’s a good vision, but perhaps a premature optimisation. We typically have two sorts of CI: the testing and deployment variety. With testing, it’s ok to be “best effort” since the purpose is to gather enough intelligence about regressions for a human maintainer to make a judgement call about whether to click merge or not. With deployment CI, the pipeline has to work every time.

The opam-ci is a testing one, and we often merge PRs with a red x (e.g. because a single revdep failed, or some transient infrastructure failure). This means that it’s not great for having a reliable deployment pipeline.

Hence my suggestion of having something standalone that can be turned into a reliable deployment pipeline, and also rebuilt from scratch if required. But bear in mind that we have terabytes of intermediate images built per bulk build. I’m with @lpw25 that the n^2 HTML docs might end up being really too big, but I’d also be happy to be wrong :slight_smile:

Also regarding a more best-effort, “we can have now” approach. @gasche and others have been working this spring on trying to build as much opam packages as possible and as fast as possible. Using this work and a beefy machine a larger https://b0-system.github.io/odig/doc/ documentation set could be produced for a reasonably usable docs.ocaml.org by having something like:

  1. Fix a compiler version (say the penultimate one).
  2. Depending on available computing resources fix a frequency on which you checkout the current state of the opam repository.
  3. Using @gasche et al. work try to build a cover that includes each package at its latest version.
  4. Union the resulting libdir's of the cover elements always keeping the result for the latest version of the package if there are conflicts.
  5. odig odoc the resulting unioned libdir.

That “union” prefix will be broken compilation wise but should remain odiggable providing a best-effort docset for the all the packages at the latest versions in the ocaml-repository. Some inter-package links will be broken or absent — api changes or unresolvable .cmti file digests – but, depending on the actual result, maybe something we shouldn’t shy away to publish on docs.ocaml.org until we figure out better ways.

4 Likes

Yep, I’ve slept with this idea and woke up with the same conclusion, that it is a premature optimization :slight_smile: Basically, we could write a simple documentation generator that will install a package $pkg using opam depext --install $pkg and then run odig odoc and update the main index afterwards. We can then run this generator as a cron job for every new package. And also run it once to initially fill in the package set. Basically, something like this (pardon my bash)

function build ()
  docker build -t $IMAGE_NAME - << EOF
    FROM ocaml/opam2
    RUN opam update \ 
     && opam depext --install --yes $1 \
     && opam install --yes odig ocaml-manual \
     && odig odoc \
     && update-index $pkg $(odig cache path)
  EOF

My estimate is based on the size of the BAP documentation, which includes 136 packages, and it is about 800 Mbytes. Taking that not all packages are as big as BAP and that we have about 2000 packages in the opam universe, I estimated that in total it will take about 100 * 2000 = 200 GBytes.

Assuming that there is a lot of sharing, we can compress it, using something similar to hashconsing. We can find all duplicates using the fdups utility, which will partition all files into a set of duplicates,

fdups -r odoc
....
3396 bytes each:
odoc/monads/Monads/Std/Monad/Core/Let_syntax/index.html
odoc/monads/Monads/Std/Monad/module-type-Core/Let_syntax/index.html

2232 bytes each:
odoc/monads/Monads/Std/Monad/Core/Monad_infix/index.html
odoc/monads/Monads/Std/Monad/module-type-Core/Monad_infix/index.html

5279 bytes each:
odoc/ocaml/Id_types/UnitId/Compilation_unit/index.html
odoc/ocaml/Id_types/module-type-UnitId/Compilation_unit/index.html

8927 bytes each:
odoc/_odoc-theme/highlight.pack.js
odoc/highlight.pack.js

And then we can reduce each group by removing all files but the first one and making a hard link from the removed file to the first one (i.e., using cp -l). If we will run this on the whole universe, it will save a lot of space. And again is pretty trivial to implement.

Update 1: there are already an utilities that do this, e.g., hardlink, trimtrees.
Update 2: in fact, this might not be necessary if we will use Git to store documentation, since Git will do this sharing for us and will use hardlinks when possible, see also git relink.

3 Likes

I like both options above. It’s really nice to see a fully running index of all packages. It also highlights how authors are currently under-utilizing the tag system, but that’s beside the point.

As another option raised by @wokalski on discord, for this single case of creating global documentation, we could try and use esy. esy creates a cache of all library versions and then sets up individual sandboxes using environment variables. We could build every library in esy and then have odig iterate through the directories, creating each library’s documentation. Finally, odig (or some other tool) would also have access to the cache itself, allowing for easy indexing of all library versions required for the transitive closure of all latest-version libraries in OPAM.

3 Likes

FYI I’ve started work on the esy option. My planned steps are:

  1. Modify odoc to support version strings as package subdirectories. WIP.
  2. Modify odig to send said version strings to odoc.
  3. Modify odig to read and follow esy's cache.
  4. Create a script that gets the current world from OPAM, puts each library in an esy project’s dependency list (separately), installs using esy, and then reads the resulting cache with odig.
  5. Host on github first, maybe somewhere else if it’s a success.
3 Likes

This sounds good, thanks for putting the time in! Please do update here with a prototype when you get something running, even on a small scale set of packages. We could, for example, use the small package list from github.com/mirage/docs to try this out before throwing a lot of compute resources at a more comprehensive index.

I’m wondering, why do we need this versions? Tools like hardlink will automatically unify files, so that when we will generate N universes for N packages, all duplicating files will be still hardlinked, i.e., the same version of core, if used by m packages will still occur only once. Moreover, git itself will enable this kind of unification on a lower level granularity and will deduplicate information which occurs inside of the files which are overall different. Therefore, trying to implement this kind of unification on a semantic level would be a duplication of work which is already done.

The deduplication is already done by esy, and it’s done at the most efficient level ie. compilation doesn’t happen unless it’s needed.

What I want to do is just take esy's cache, parse it and map it to produce that same deduping at the doc level. This also gives us the benefit of being able to index multiple versions of packages needed and separate them out logically, so if B depends on A v1 and C depends on A v2, odig could present both versions of A in its global index, clearly laid out.

1 Like

That’s neat, great idea :slight_smile: a little bit harder to implement (requires some work, wrt to hardlink’s poor man solution), but sounds more interesting. It is also nice to have a global documentation for both universes, Reason and OCaml. Besides, do you aware of any endeavours to build all opam projects in esy?

opam projects that doesn’t randomly access filesystem should be able to be built with esy. There is a manually maintained repository for packages that doesn’t build with esy out-of-the-box.

3 Likes

Just chiming in, but I think there may be simpler possibilities directly for opam:

  • we have some ways already to install all packages (or just their latest versions) in as few iterations as possible (Marracheck, or the older greedy prototype here)

  • the cleanest way to extract cmt(i)s would probably by through a post-install hook: you can specify any command to be run after every package installation, and you have access to the package name and version, the list of files it installed and its build dir (see here for an example).

  • of course, you also have the simpler option to specify --keep-build-dir, run all the installs, then scan everything in SWITCH/.opam-switch/build/PKG.VERSION for artifacts

  • then I believe we could run odig/odoc on this pile of artifacts to generate as-complete-as-possible documentation ?

3 Likes

@AltGr I don’t think you need to play with --keep-build-dir or post-install hooks. Just let packages install most of them do nowadays intstall their cmti files. Package installs is what odig naturally consumes, see my message here. Point 3. is Marracheck.

I think it would be good to have that version first before we try to change all of the tools to support versioning and try to make them consume unconventional install structures.

Assuming Marracheck works this should simply be a matter of running programs at that point.

1 Like

Indeed; what the post-install hooks would give you is attribution of the files to the opam packages and versions, and aggregation of the libdirs across different opam install commands. But odig can already process everything without the need for that :slight_smile:

1 Like

If you run esy env you’ll get your package’s environment, and OCAMLPATH can be used to get the transitive dependency paths from the global build cache in a way that represents your current project root, with sharing among all other projects you’ve built on the system. That might be enough to prototype something to see if it’s even the best direction to begin with.

I’m not really sure what it would mean to build a centralization of all docs outside of a particular project root though. Curious to hear your thoughts on that.

@dbuenzli, @gasche: where is the code to get your best-effort listing, as seen above? I agree that given the fact that it’s already running, we should be able to get something up in minimal time using that approach.

My thought was to use esy's dedup by a. listing all OPAM packages b. creating a dummy project per package, listing only that one package as a dependency. esy would then create it’s DAG of dependencies under .esy, with the links in json files and the packages in their appropriate files. This information can then be hoovered up by odig directly, so long as it knows to support versions and to read esy's files.

The one downside of this is that it uses esy‘s metadata directly. It would be nicer if there was a way to query esy for all of this information: give me the location and version of a package followed by its dependencies’ location and versions. We then wouldn’t be dependent on things that could change later.