[ANN] Sandmark Nightly - Benchmarking as a Service

Tarides is happy to announce Sandmark Nightly benchmarking as a service. tl;dr OCaml compiler developers can now point development branches at the service and get sequential and parallel benchmark results at https://sandmark.tarides.com.

Sandmark is a collection of sequential and parallel OCaml benchmarks, its dependencies, and the scripts to run the benchmarks and collect the results. Sandmark was developed for the Multicore OCaml project in order to (a) ensure that OCaml 5 (with multicore support) does not introduce regressions for sequential programs compared to sequential OCaml 4 and (b) OCaml 5 programs scale well with multiple cores. In order to reduce the noise and get actionable results, Sandmark is typically run on tuned machines. This makes it harder for OCaml developers to use Sandmark for development who may not have tuned machines with a large number of cores.

To address this, we introduce Sandmark Nightly service which runs the sequential and parallel benchmarks for a set of compiler variants (branch/commit/PR + compiler & runtime options) on two tuned machines:

  • Turing (28 cores, Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz, 64 GB RAM)
  • Navajo (128 cores, AMD EPYC 7551 32-Core Processor, 504 GB RAM)

OCaml developers can request their development branches to be added to the nightly runs by adding it to sandmark-nightly-config. The results will appear the following day at https://sandmark.tarides.com.

Here is an illustration of sequential benchmark results from the service:

You should first specify the number of variants that you want for comparison, and then select either the navajo or turing hostnames. The dates for which benchmark results are available are then listed in the date column. If there are more than one result on a given day, then the specific variant name, SHA1 commit and date are displayed together for selection. You need to choose one of the variants as a baseline for comparison. In the following graph, the 5.1.0+trunk+sequential_20220712_920fb8e build on the navajo server has been chosen as the baseline, and you can see the normalized time (seconds) comparison for the various Sandmark benchmarks for both 5.1.0+trunk+sequential_20220713_c759890 and 5.1.0+trunk+sequential_20220714_606abe8 variants. We observe that the matrix_multiplication and soli benchmark have become 5% slower as compared to the July 12, 2022 nightly run.

Similarly, the normalized MaxRSS (KB) graph for the same baseline and variants chosen for comparison is illustrated below:

The mandelbrot6 and fannkuchredux benchmarks have increased the MaxRSS (KB) by 3% as compared to the baseline variant, whereas, the metric has significantly improved for the lexifi-g2pp and sequence_cps benchmarks.

The parallel benchmark speedup results are also available from the Sandmark nightly runs.

We observe from the speedup graph that there is not much difference between 5.1.0+trunk+parallel_20220714_606abe8 and the 5.1.0+trunk+decouple_20220706_eb7a38d developer branch results. The x-axis in the graph represents the number of domains, while the y-axis corresponds to the speedup. The number in the parenthesis against each benchmark refers to the corresponding running time of the sequential benchmark. These comparison results are useful to observe any performance regressions over time. It is recommended to use the turing machine results for the parallel benchmarks as it is tuned.

If you would like to use Sandmark nightly for OCaml compiler development, please do ping us for access to the sandmark-nightly-config repository so that you may add your own compiler variants.

19 Likes

This is great!

Ping.

Some questions:

  • Have you considered documenting this feature in ocaml/HACKING.adoc?
  • What’s the expected way to “ping to get sandmark-nightly-config access”? (send an email to whom? post here? create an issue there?). What’s the expected result of the ping, I get push access to the repository?
  • I’m not fully sure what is the process to use sandmark-nightly-config once I have “pinged people” and gotten access. Am I expected to push directly? (What if I break stuff?) Can we just submit PRs to that repository? (Do we need to request access first if we are happy with PRs?)
  • Could this “how to actually get access and use the sandmark-nightly-config repository” information be documented somewhere, for example in sandmark-nightly-config itself? (It already has useful information about the data format, but not about the contribution/request process.)

Have you considered documenting this feature in ocaml/HACKING.adoc?

Let me document the feature in ocaml/HACKING.adoc, and I shall create a PR for review.

  • What’s the expected way to “ping to get sandmark-nightly-config access”? (send an email to whom? post here? create an issue there?).

If you need feature requests, please feel free to file a GitHub issue for the same.

Am I expected to push directly? (What if I break stuff?)

We do have a couple of sanity checks in place to check that the JSON parsing works, as well as a valid URLs have been provided. Please see:

Can we just submit PRs to that repository? (Do we need to request access first if we are happy with PRs?)

Yes, please submit PRs to add your developer branches to the JSON files at sandmark-nightly-config/config/. If the PR or branch needs to be run on both the machines, then both custom_navajo.json and custom_turing.json need to be updated with your entries.

Could this “how to actually get access and use the sandmark-nightly-config repository” information be documented somewhere, for example in sandmark-nightly-config itself? (It already has useful information about the data format, but not about the contribution/request process.)

Sure. I shall update the README with the above information. Thanks for your questions and suggestions!

I think this is great and will help greatly track performance impact of OCaml PRs. Currently we had some report that OCaml 4.14 is about 8% slower on some Coq benchmarks than 4.07 (despite having the prefetching patch). I am trying to track down where this comes from, and some continuous monitoring could help. I also noticed great variability in latency by redoing old latency-focused micro-benchmarks.

Here are a few things which may be worth opening PRs for (?):

  • Are there plans to benchmark for GC-induced latency? (And more generally GC performance)
  • Are there plans to add some benchmarks for the Coq proof assistant? The easiest would be to measure the CPU time for installing various sets of Coq libraries with opam (which are know to stress different parts of Coq). Do you already have a benchmark in this area (e.g. symbolic computing)?
  • Having some way to find the branch without knowing the date? Currently IIUC, I have to find when the benchmark has been done to find it in the UI.
  • There always seems to be some variability in the results (e.g. +/- 10% at the extrema). Can you give us a rule of thumb to recognize what is considered as a non-significant difference?

Which model do you prefer if I wanted to benchmarking my own potential PRs:

  1. one long-lived entry for a personal branch where I push whatever I want whenever I want to benchmark a change, or
  2. many short-lived entries for different branches (each time sending you a PR to the above repo)?

Currently we had some report that OCaml 4.14 is about 8% slower on some Coq benchmarks

We do have Sandmark benchmarks building for 4.14.0, but, they have not been added to the nightly runs yet. There is an open request for the same at Support more than one development variant for nightly runs. We will add it shortly.

Are there plans to benchmark for GC-induced latency? (And more generally GC performance)

There is an on-going effort to:

(a) bring GC latency analysis support using olly in Sandmark (Issue #362), and latency analysis support is something that we used to have.

(b) bring perf stat analysis also through sandmark nightly (instructions retired, context switches, page faults, etc). This is tracked in Sandmark-nightly Issue #66.

Are there plans to add some benchmarks for the Coq proof assistant?

We used to run Coq benchmarks in Sandmark, but, they are currently disabled and waiting on Coq: Resurrect CI job for OCaml trunk and on few patches that need to be merged.

Having some way to find the branch without knowing the date?

Sorry, can you elaborate with an example?

Do you already have a benchmark in this area (e.g. symbolic computing)?

The present Coq benchmarks are at sandmark/benchmarks/coq, and fraplib. If you need a specific benchmark to be included, please feel free to file an issue at Sandmark GitHub issues.

There always seems to be some variability in the results (e.g. +/- 10% at the extrema). Can you give us a rule of thumb to recognize what is considered as a non-significant difference?

You may observe upto 10% variance due to micro-architectural effects.

Which model do you prefer if I wanted to benchmarking my own potential PRs:

  1. one long-lived entry for a personal branch where I push whatever I want whenever I want to benchmark a change, or
  2. many short-lived entries for different branches (each time sending you a PR to the above repo)?

Many short lived ones are better, as the long lived ones will be run over and over again.

For instance, it would be nice to find benchmarks results from their branch name (e.g. 4.12.0+stock) or their commit (e.g. 14856602). Currently it seems one has to first enter the date manually.

Many short lived ones are better, as the long lived ones will be run over and over again.

I see. Would it be possible to check if the commit_id has already been benchmarked and abort in this case? I’ll send short-lived ones in the meanwhile but this might suffer from the same issue (unless you are ready to accept one PR for each benchmark run).

it would be nice to find benchmarks results from their branch name (e.g. 4.12.0+stock) or their commit (e.g. 14856602).

I have created a GitHub issue to track this UI feature request at sandmark-nigthly issue #67.

Would it be possible to check if the commit_id has already been benchmarked and abort in this case?

The nightly run pipeline should be updated to check the same. A new issue has been created at sandmark-nightly-config issue #6.

Thanks for these useful feature requests!

Here is a pull request that I’d love to see Sandmark-ed against the current trunk: Speed up register allocation by permanently spilling registers by stedolan · Pull Request #11102 · ocaml/ocaml · GitHub . Can someone help? Thanks!

Sure! We will run Sandmark with current trunk, and for PR #11102 and will share the results shortly.

The results from July 20, 2022 run for 5.1.0+trunk+f1223b6 and 5.1.0+trunk+stedolan+pr11102 are available at https://sandmark.tarides.com. The following selection can be made at the web site:


The normalized time and MaxRSS comparison graphs are provided for reference:


More graphs can be viewed from the web page. For the parallel benchmarks, the Turing machine results are shown below:

The following entry has been added to sandmark-nightly-config/config files to trigger the nightly benchmark runs:

{
    "url": "https://github.com/stedolan/ocaml/archive/refs/heads/spill-permanently.zip",
    "name": "5.1.0+trunk+stedolan+pr11102",
    "expiry": "2022-07-26"
  }

In future, please feel free to add PRs for your developer branches to sandmark-nightly-config repository for the nightly runs. We are currently tracking 5.1.0+trunk. The README has examples on adding a PR, or a specific commit or a developer branch. If you have any specific questions, please do let us know. Thank you!

Thanks a lot for the benchmark results. For those who wonder how to read the “normalized time” graph: yes, the “soli” benchmark runs 1.6 times slower with the PR (0.0118s instead of 0.0072s). This can be measurement noise (with such a small running time), or something significant. Will have a look.