Hi there,
I recently learned of fornalder, a tool that creates nice visualizations of contributions to open-source projects by analyzing commits to their git repositories (the author used it to analyze GNOME contributions). I decided to use it to study contributions to the OCaml implementation and OCaml open-source packages, results below.
The OCaml compiler distribution
This graph shows the “contributor cohorts” for the OCaml compiler over time. For example, the big dark-red bar that shows up in 2015 represents the “2015 cohort”, the number of long-term contributors to the OCaml compiler that did their first contribution in 2015. The dark-red bar in similar position in each following year represents the contributors from the 2015 cohort that are still active on that year. The bar shrinks over time, as some members of this cohort stop contributors. Short-term contributors (all their contributions fall within a 90-days period) are shown as the “Brief” bars at the top.
The main thing we see on this graph is that moving the compiler development on Github in 2015 increased sharply the number of contributors, which has remained relatively stable since (there is an “expert pool” that is stable in size), with a large fraction of occasional contributors each year.
(Note: stability of contributor numbers is fine for the compiler, which is not meant to keep growing in size and complexity. We hope most contributors go to other parts of the OCaml ecosystem.)
This graph shows the number of commits from the contributors of each cohort. We see for example that the 1995 contributor, namely Xavier, has remained relatively active throughout the compiler development, with a marked uptick in 2020 (possibly related to the Multicore upstreaming effort). Today most of the commit volume seems to come from community members that started contributing right after the Github transition, after 2015-2016.
It’s interesting to compare these two charts: we see that the 2015 cohort has shrunk in size in 2020 (by half), but that they contributed much more in 2020 than in 2015: over time, the remaining contributors from this cohort grew in confidence/expertise/interest and are now contributing more (several of them became core maintainers, for example).
All OCaml software on opam
I then ran the same visualization tool on all OCaml git repositories listed in the public opam-repository. This is a very-large subset of all open source software implemented in OCaml. But it does not represent well the “industrial” codebases that some industiral OCaml users are working on – even when the code is open-source, it may be packaged and distributed separately.
This graph shows the number of contributors, in yearly cohorts. We can see that the number of contributors has been growing each year, plateauing in 2018.
Note: there is a measurement artefact that makes the last column smaller than the previous ones: some of the “short-term” contributors in 2020 will later become longer-term contributor by contributing again in 2021, so they be added to the long-term cohort of 2020. This artefact may suffice to explain the small decrease in long-term contributors in 2020.
This graph shows the volume of commits. Here we don’t see a plateau; there is in fact a small decrease in 2018, and further growth in 2019 and 2020. Another aspect I find striking is the stability of commit volume in each cohort. For example, the 2014 cohort seems to have contributed roughly as many commits during all years 2016-2020. Given the reduction in the number of contributors in this commit, this is again explained by fewer contributors gradually increasing their contribution volume.
Disclaimer
Some industrial OCaml codebases are included in the public opam repository, but a large part is not.
This visualization aggregates project data assuming that they follow “standard” git development practices. The data is imperfact, it may be skewed by tool-generated commits. For example, some of the Jane Street software packaged on opam uses git repository mirrors that are updated automatically by usually a single committer, in a way that does not reflect their true development activity. (Thanks to @yminsky for catching that.)
Another threat to validity is that some authors commit in different projects using different names, so they may be counted as separate contributors instead. (Inside a project, one may use a .mailmap file to merge contributor identities, but afaik there is no support in git or fornalder for overlaying an extra .mailmap file that would work across repositories.)
If you wish to study the dataset to see if the overall conclusions are endangered by such anomalies, please feel free to replay the data-collection steps. You can either manually inspect git repositories, or play with the SQLite database generated by fornalder.
My take away
I found this analysis interesting. Here would be my conclusion so far:
-
The OCaml community gets a regular influx of new contributors.
-
Some of our contributors stay for a long period, and they contribute more and more over time.
-
We observe a plateau-ing numbers of new contributors on the years 2018-2020 (and the pandemic is probably not going to improve the figure for 2021), but the volume of commits keeps growing.
It is difficult to draw definitive conclusions from these visualizations, especially as we don’t have them for many other communities to compare to. Compared to the Gnome trends shown in the original blog post ( On the Graying of GNOME | Et tu, Cthulhu ), I would say that we are doing “better” than the Gnome ecosystem (in terms of attracting new contributors).
My personal view for now is that OCaml remains a more niche language than “mainstream” contenders (we don’t see an exponential growth here that would change the status), but that its contributor flow is healthy.
Reproduction information
You can find a curated log of my analysis process in logs.md; this should contain enough information for you to reproduce the result, and it could easily be adapted to other software communities.
I uploaded all the small-enough data of my run in this repository, in particular the list of URLs I tried to clone – some of them failed. Not included: the cloned git repositories, and the databases build by fornalder to store its analysis data.