About Multicore

kayceesrk · May 15, 2017, 2:08pm

A topic for discussing Multicore OCaml project. Multicore OCaml adds native support for concurrency (through effect handlers) and parallelism. The Multicore OCaml project wiki has information for installation, papers, articles, talks and examples.

If you are looking for the current status of multicore and the list of things to do, checkout the projects page.

avsm · May 15, 2017, 2:31pm

An immediate update that it would be great to get feedback on is a recent paper draft from a few weeks ago on “Concurrent Systems Programming with Effect Handlers”

http://kcsrk.info/papers/system_effects_may_17.pdf

This proposes a monad-free model for doing direct-style IO, but with custom schedulers written in OCaml.

bluddy · May 15, 2017, 5:01pm

Woah – any idea why Go is kicking OCaml’s behind, even on one core?

rgrinberg · May 15, 2017, 7:28pm

Where are the Lwt benchmarks? I’m pretty sure Httpaf supports Lwt as well.

kayceesrk · May 15, 2017, 8:15pm

httpaf fork I’m using doesn’t yet support Lwt. But it might be in the works or in a branch I don’t have access to.

We’re investigating why Go is faster. One explanation is accept loops getting starved in the OCaml case. This paper explains accept loop starvation in event-based servers: http://www.read.seas.harvard.edu/~kohler/class/04s-readable/acceptable.pdf. We’re currently investigating ways to prevent accept loop starvation. Idea would be to accept as many connections as possible quickly once the client makes connection requests.

avsm · May 16, 2017, 1:39am

It’s also just very early days in terms of constructing useful IO benchmarks – as @kayceesrk notes, we are still distinguishing between compiler/runtime bottlenecks (GC, allocation, contention) vs API-level bottlenecks (accept, listen, and backpressure). Go has been expertly tuned on both counts, so our early benchmarks are very encouraging as we have a lot of optimisation potential left in the bag

orbitz · May 16, 2017, 5:27pm

Is there a tl;dr for how multicore Ocaml will change the memory model? Will I now have to worry about mutex and atomic operations and all?

vramana · May 16, 2017, 5:59pm

You can read about multicore memory model here

kayceesrk · May 16, 2017, 8:38pm

If you are building stuff using multicore compiler, there are nicer alternatives available such as Reagents. If you are building stuff for the multicore compiler (such as lock-free libraries, high-perf web server engines, etc,.), then you’d have to worry about mutex and atomic operations. In fact, as a developer using multicore OCaml, you shouldn’t ever have to use atomic operations, mutex and condition variables.

gasche · May 17, 2017, 4:24am

Sorry for the trite comment, but I find the citation and bibliography style of the paper hard to read. Would it be possible to have citations in the form “(Fullname1, Fullname2, …, pubyear)”? This can be done easily with \usepackage{natbib} and then using \citet*{foo} (for citations as text) and \citep*{foo} (for citation as parentheses). If the LaTeX of the paper is available somewhere, I would be happy to fix it for you.

kayceesrk · May 17, 2017, 2:01pm

Hi @gasche. Thanks for the suggestion. Updated now to use \citet*: http://kcsrk.info/papers/system_effects_may_17.pdf. Would be great to hear your throughts on the design.

bluddy · May 17, 2017, 3:55pm

But don’t Reagents imply using arrows?

gasche · May 17, 2017, 3:59pm

Thanks!

I liked the paper, but I’m far from an expert in systems programming so I am not a qualified person to give domain-level feedback. (On the other hand, I would be curious to have Xavier Leroy’s opinion; if you haven’t send it to him yet, you should consider it.) As a non-expert, what I retain from the asynchronous part of the story is “it’s still tricky, but it is more elegantly expressed.”

It’s not specific to this work but I like the exception of the match syntax with effect clauses in addition to exception clauses. I was also interested in your design for default handlers, which I don’t remember having seen or reflected on before; I wonder how it compares to default handlers in other systems (Eff?); is it a substantially different (novel) approach, or rather a tweak on the default handler presentations that does a good design job of remaining close to the original language and allow for efficient implementation?

I seem to remember from previous discussion that copying a continuation didn’t necessarily work as well as one would expect (other continuations referenced from the continuation would not get copied, and thus break when the copied continuation is invoked). You briefly mention continuation copying in this paper, did you find a way to fix these issues, or a programming style that avoids them?

One minor question on the OCaml/Go comparison: I am very ignorant of latency measurements, but I would expect the “pain zone” where the system lags behind in answering requests and long latencies appear to start at different places for different systems. Could the Go results be explained by the fact that you are not in Go’s “pain zone” yet, and a test at even higher load show qualitatively different results because both the OCaml and the Go implementations are in trouble?

Finally, I’m not sure what to conclude of the Async vs. Effects benchmark. The benchmarks that are shown do not show conclusive performance difference between them (maybe I’m looking at them wrong and the “medium contention” bench actually shows a noticeable improvement of Effects over Async?). Is the story that “all user-level implementations of schedulers, in direct or indirect style, have roughly the same performance?”. Is the story that having the runtime support for efficient delimited continuation capture allows a relatively naive Effect-using scheduler to bridge the performance gap with a finely-tuned Async scheduler?

bluddy · May 17, 2017, 5:34pm

That’s what I took away from it. Plus the fact that you’ll have support for multicore, which will further improve performance.

kayceesrk · May 18, 2017, 8:12am

Thanks for the comments!

default handlers

Default handlers are inspired from Eff, but there are some differences. Eff has the notion of effect instances and default handlers are associated with effect instances. Multicore OCaml does not have the concept of effect instances and default handlers are associated with effects (operations in Eff parlance). In Eff, if the default handler evaluates to an operation, then runtime error is reported. In Multicore, if the default handler performs an effect, we look for the default handler of that performed effect. If that effect does not have a default handler, then, unlike eff which reports a runtime error, we discontinue the continuation of the original perform with an exception.

effect E : unit 
effect F with function _ -> perform E

try perform F with Unhandled -> 
  print_string "Raised unhandled since E doesn't have a default handler"

prints the error message. With default handlers, we can have the same Unix or Sys module signature, which would behave like vanilla OCaml without a handler, and be asynchronous/fiber-safe with an appropriate handler.

You briefly mention continuation copying in this paper, did you find a way to fix these issues, or a programming style that avoids them?

The current solution is to relegate them to Obj module It is unclear whether there is a clean fix for this issue especially with resource. I would like to discourage the use of multi-shot continuations for typical use. But they indeed are useful in certain domain-specific instances – backtracking search, memoization, etc.

Could the Go results be explained by the fact that you are not in Go’s “pain zone” yet, and a test at even higher load show qualitatively different results because both the OCaml and the Go implementations are in trouble?

This is quite possible. As I am discovering, a high-performance web-server is a piece of finely tuned engineering (not unlike the GC). There is work to be done on the multicore GC tuning, but even more so for configuring the right socket option. It also seems like there is accept loop starvation on the OCaml side; OCaml servers don’t seem to accept connections as fast as Go server as the main accept loop is preempted in favour of request processing. Still tinkering with this benchmark.

Finally, I’m not sure what to conclude of the Async vs. Effects benchmark.

There is a misconception that direct-style is slower than indirect-style c.f threads vs events debate in early-2000s. Some language runtimes have moved away from lightweight/green threads (Rust, various JVM implementations) for various reasons, shift away from M:N to 1:1 threading on multicore. Go is an exception and seems to have done rather well. The story here is that direct-style implementations need not be necessarily slower just because they offer a better abstraction. A prototype effect-based asynchronous I/O library can compete with the well-engineered Async library, while offering advantages of direct-style programming (easier comprehension, backtraces and stack based profiling…).

stedolan · May 18, 2017, 2:18pm

That’s about right. Asynchronous operations are traditionally tricky because of the inherent difficulty of responding to interrupts that may occur at almost any point in the program, and because of poor interfaces based on mutable global state representing “the current interrupt handler”. Effects fix only the latter, but I’m quite happy with that.

avsm · May 22, 2017, 9:39am

Stack traces now work in the latest multicore branch, thanks to @kayceesrk adding DWARF backtrace info! See https://github.com/ocamllabs/ocaml-multicore/pull/134

kayceesrk · May 22, 2017, 2:12pm

On top of that, you can now profile multicore OCaml programs using gperftools/cpuprofiler. For this, compile your program with -g flag to enable debugging information. Assuming you’ve installed gperftools in the usual location, run the program as:

LD_PRELOAD=/usr/lib/libprofiler.so CPUPROFILE=prof.out OCAMRUNPARAM="w=512" <myprogram>.native

cpuprofiler is a statistical profiler that works by periodically interrupting the program with a signal handler and recording the program information. The OCAMLRUNPARAM option w=512 allocates an extra 512 words at the bottom of the stack for running the signal handler used by the profiler. Then you can analyse the profile with pprof to produce pretty graphs like

bluddy · May 22, 2017, 6:32pm

In my experience, a flame graph is far more understandable and scalable, and it can be made fairly easily.

kayceesrk · May 23, 2017, 2:41pm

which pprof supports

Topic		Replies	Views
Multicore OCaml: November 2020 Community multicore , multicore-monthly	1	3242	February 6, 2021
Multicore OCaml: August 2020 Community multicore , multicore-monthly	0	2814	September 14, 2020
Multicore, Async, and Lwt Ecosystem multicore , lwt , async	17	6648	September 15, 2023
Multicore OCaml: October 2021 Community multicore , multicore-monthly	0	8974	November 16, 2021
OCaml multicore, effects and js_of_ocaml Ecosystem multicore , effects , js_of_ocaml	31	4108	October 5, 2021

About Multicore

Related topics