Analyzing contributions to the OCaml compiler and all opam packages

Hi there,

I recently learned of fornalder, a tool that creates nice visualizations of contributions to open-source projects by analyzing commits to their git repositories (the author used it to analyze GNOME contributions). I decided to use it to study contributions to the OCaml implementation and OCaml open-source packages, results below.

The OCaml compiler distribution

This graph shows the “contributor cohorts” for the OCaml compiler over time. For example, the big dark-red bar that shows up in 2015 represents the “2015 cohort”, the number of long-term contributors to the OCaml compiler that did their first contribution in 2015. The dark-red bar in similar position in each following year represents the contributors from the 2015 cohort that are still active on that year. The bar shrinks over time, as some members of this cohort stop contributors. Short-term contributors (all their contributions fall within a 90-days period) are shown as the “Brief” bars at the top.

The main thing we see on this graph is that moving the compiler development on Github in 2015 increased sharply the number of contributors, which has remained relatively stable since (there is an “expert pool” that is stable in size), with a large fraction of occasional contributors each year.

(Note: stability of contributor numbers is fine for the compiler, which is not meant to keep growing in size and complexity. We hope most contributors go to other parts of the OCaml ecosystem.)

This graph shows the number of commits from the contributors of each cohort. We see for example that the 1995 contributor, namely Xavier, has remained relatively active throughout the compiler development, with a marked uptick in 2020 (possibly related to the Multicore upstreaming effort). Today most of the commit volume seems to come from community members that started contributing right after the Github transition, after 2015-2016.

It’s interesting to compare these two charts: we see that the 2015 cohort has shrunk in size in 2020 (by half), but that they contributed much more in 2020 than in 2015: over time, the remaining contributors from this cohort grew in confidence/expertise/interest and are now contributing more (several of them became core maintainers, for example).

All OCaml software on opam

I then ran the same visualization tool on all OCaml git repositories listed in the public opam-repository. This is a very-large subset of all open source software implemented in OCaml. But it does not represent well the “industrial” codebases that some industiral OCaml users are working on – even when the code is open-source, it may be packaged and distributed separately.

This graph shows the number of contributors, in yearly cohorts. We can see that the number of contributors has been growing each year, plateauing in 2018.

Note: there is a measurement artefact that makes the last column smaller than the previous ones: some of the “short-term” contributors in 2020 will later become longer-term contributor by contributing again in 2021, so they be added to the long-term cohort of 2020. This artefact may suffice to explain the small decrease in long-term contributors in 2020.

This graph shows the volume of commits. Here we don’t see a plateau; there is in fact a small decrease in 2018, and further growth in 2019 and 2020. Another aspect I find striking is the stability of commit volume in each cohort. For example, the 2014 cohort seems to have contributed roughly as many commits during all years 2016-2020. Given the reduction in the number of contributors in this commit, this is again explained by fewer contributors gradually increasing their contribution volume.

Disclaimer

Some industrial OCaml codebases are included in the public opam repository, but a large part is not.

This visualization aggregates project data assuming that they follow “standard” git development practices. The data is imperfact, it may be skewed by tool-generated commits. For example, some of the Jane Street software packaged on opam uses git repository mirrors that are updated automatically by usually a single committer, in a way that does not reflect their true development activity. (Thanks to @yminsky for catching that.)

Another threat to validity is that some authors commit in different projects using different names, so they may be counted as separate contributors instead. (Inside a project, one may use a .mailmap file to merge contributor identities, but afaik there is no support in git or fornalder for overlaying an extra .mailmap file that would work across repositories.)

If you wish to study the dataset to see if the overall conclusions are endangered by such anomalies, please feel free to replay the data-collection steps. You can either manually inspect git repositories, or play with the SQLite database generated by fornalder.

My take away

I found this analysis interesting. Here would be my conclusion so far:

  • The OCaml community gets a regular influx of new contributors.

  • Some of our contributors stay for a long period, and they contribute more and more over time.

  • We observe a plateau-ing numbers of new contributors on the years 2018-2020 (and the pandemic is probably not going to improve the figure for 2021), but the volume of commits keeps growing.

It is difficult to draw definitive conclusions from these visualizations, especially as we don’t have them for many other communities to compare to. Compared to the Gnome trends shown in the original blog post ( On the Graying of GNOME | Et tu, Cthulhu ), I would say that we are doing “better” than the Gnome ecosystem (in terms of attracting new contributors).

My personal view for now is that OCaml remains a more niche language than “mainstream” contenders (we don’t see an exponential growth here that would change the status), but that its contributor flow is healthy.

Reproduction information

You can find a curated log of my analysis process in logs.md; this should contain enough information for you to reproduce the result, and it could easily be adapted to other software communities.

I uploaded all the small-enough data of my run in this repository, in particular the list of URLs I tried to clone – some of them failed. Not included: the cloned git repositories, and the databases build by fornalder to store its analysis data.

29 Likes

This is an extremely cool analysis, thanks for posting it @gasche! I’m trying to think of any systemic reasons for the plateauing of new contributors in 2018/2019, but the only thing I can come up with is that there are more private industrial codebases employing OCaml developers. Anecdotally, the number of jobs across OCaml/Reason seems to be on the up in the past few years.

I’ll have a go at reproducing your methodology after the academic term here finishes. One thing we’d be very happy to take PRs for in the opam-repository are improvements to metadata to assist with this sort of research. For instance, filtering out dev-repo entries for non-OCaml projects seems like an immediate win and would simplify the data collection.

1 Like

One hypothesis I considered is that some contributions have moved away from the opam-repository and are happening directly in npm, thanks to esy. I ran a similar analysis (logs) on all npm packages tagged ocaml, but the results are unconclusive (I may be missing more OCaml package on npm that is not tagged).

npm “ocaml” contributors

npm “ocaml” commits

(If you wonder what’s the long trail between 2003 and 2009: this is the development of bs-sedlex, which goes back to an old OCaml-only prototype by Alain Frisch in 2003. There is also a version of OCaml packaged on npm, but I removed it from the analysis as it was adding noise and was mostly not-esy-specific contributions.)

Note that we are talking about ~4K commits here, which remains fairly small compared to the ~120K commits for opam-repository packages on the last year. When I tried to merge both sets together this didn’t make much of a difference compared to just-opam numbers.

Maybe someone should redo the analysis with “reason/rescript” tags in addition, to measure the contribution volume there. I sticked with packages that self-identify as “ocaml” for now.

3 Likes

Thanks @avsm for (the kind words and) the suggestion to add a sort of “non-ocaml” tag on the opam side, this is a nice idea. (I could also consider doing some fixing of the dev-repo URLs, if there is an agreement on what is the expected format there.) I will open an issue on opam-repository to discuss it forward with the maintainers.

1 Like

It might be fun to see this for the batteries git repository.

Batteries

Here are graphs for batteries-included.

Cohorts, per number of contributors:

Cohorts, per volume of commits:

What we see, I think, is that Batteries has been fairly quiet since 2015, which probably corresponds to entering some kind of “maintenance mode”. There is still a reasonable diversity of contributors, with many one-shot contributors (which I assume corresponds to user that are mostly happy silently using the library, and come to add a function or fix a bug once in a while).

Looking at the volume of commits: the strong decrease of the gray bar in 2018 corresponds, I think, to when I stopped contributing actively, and you took over as a contributor. It looks like I was effectively the last of the early-day contributors still active. The purple “2013” cohort is interesting, and I went to look at the data: it’s you (François Berenger) and Simon @c-cube Cruanes. Simon contributed a lot on a short period, and then went off to create the very nice Containers library that would move faster. You stuck and are now the most active contributor (and maintainers).

Containers

Contributors:

Commits:

Containers is mostly a one-person library with Simon doing most of the work. There were many new contributors in 2017 and 2018 (most of them brief), and the strong show of purple year 2018 in today’s commit volume is mostly due to the enigmatic Fardale.

Disclaimer

I think that fornalder is more useful to study large repositories (or set of repositories) that have been going for many years. For a single project, especially if they are relatively small or young, git shortlog -n -s (over the whole log or --since 2018, etc.) tells you mostly the same thing.

4 Likes