About Multicore

multicore
compiler
faq

#21

Is there an API for programming multicore using standard multithreading constructs (thread spawning, semaphores, message queues)? I’m not interested in the overhead of Reagents at this point, nor do I want to deal with arrows. Is there an example somewhere of using multicore-ocaml for multicore programming?

I think @stedolan had an example of a work-stealing queue in one of his talks, but I couldn’t find the code for it.


#22

I have a very down-to-earth question about multicore OCaml. Does this mean that libraries like Async or Lwt will need to be ported to the new primitives involved to take advantage of the new system. Will they then be somehow obsoleted by these new features?

Last, will the new features be available in js_of_ocaml and maybe bucklescript?


#23

Does this mean that libraries like Async or Lwt will need to be ported to the new primitives involved to take advantage of the new system

It is worth splitting this question into two parts. Multicore OCaml really has two separate but related aspects: parallelism and concurrency. The parallelism is provided by the Domains module whilst the concurrency is provided by algebraic effects. So one question is whether Async and Lwt should use the parallelism features, and a separate question is whether they should the concurrency features.

I would say that the core of Lwt and Async should not use the parallelism features. All existing code for these libraries assumes that a single “thread” is running at a time and breaking that would be very painful. They could add additional features to take advantage of parallelism – for example you could imagine running one Async scheduler per Domain and providing some mechanisms for communication between “threads” on these different schedulers.

Now Lwt and Async might use the concurrency features. It is possible they could be redesigned to use algebraic effects internally. However, the main advantage of algebraic effects is that you code in direct style (with a corresponding improvement in performance), and without that it is less clear what you gain. There is also the fact that Lwt and Async are both unusual monads (certainly not the “standard” concurrency monad that you have in, say, Haskell) which might make them more difficult to reimplement in terms of algebraic effects.

Will they then be somehow obsoleted by these new features?

I’m obviously biased, but I think that the typed version of algebraic effects (see my talk here) would indeed render these libraries somewhat obsolete. Although it is worth pointing out that using typed algebraic effects for concurrency still requires you to write a scheduler – and the internals of Lwt or Async would be a great starting point for such a thing.

Last, will the new features be available in js_of_ocaml and maybe bucklescript?

KC would know more about the current status of this, but there is an implementation of algebraic effects for js_of_ocaml. It follows similar lines to the work described here, although I’m not sure whether a version using typed algebraic effects exists yet.


#24

Libraries like Lwt and Async will be somewhat obsoleted by multicore in their current form. At least for Lwt, the plan is to port it and mutate the API, in a backwards-compatible fashion, so that it takes the best advantage possible of multicore – including by allowing Lwt-based code to use less Lwt (at least less of what Lwt is today).

In particular, a goal is to ensure that code written against Lwt today and in the past performs well on multicore, has obvious optimization paths if any new ones arise, and plays well with whatever new APIs are developed in the future, including any new APIs in Lwt.


#25

Indeed. I really only meant that the monadic aspects of these libraries interfaces would be somewhat obsolete. Not the libraries themselves.


#26

As Leo mentioned, we have a prototype js_of_ocaml compiler extended with support for algebraic effect handlers. This uses untyped algebraic effects, and has an associated performance hit. I am leaning towards using typed algebraic effects for the translation as it produces much better code; you only pay an extra cost for the code that uses effects. This doesn’t exist yet.


#27

We currently have a way to create (kernel) threads through Domain.spawn and message queues (See https://github.com/kayceesrk/lockfree and wip in https://github.com/fondation451/lockfree). What is missing is blocking primitives (locks, condition variables, semaphores). This is currently in development and should be ready in the next few weeks.


#28

Having spent a considerable amount of time on implicit parallelization techniques, I wonder how much performance gain we can actually expect from going multicore with OCaml. My impression is that many overly optimistic programmers expect close to linear performance improvements when you increase the number of CPU cores (N) whereas the reality is that “linear” typically only holds for very small N :roll_eyes:

In practice many applications will quickly suffer from threads contending for shared resources. Especially memory bandwidth is a difficult bottleneck to optimize for. Unless an application can be written such that all CPU cores only touch independent and at most L2-cache sized memory regions many times over, performance will likely disappoint. Modern CPUs are crazy fast and, even for small numbers of cores and expensive floating point operations, can easily saturate the memory bus if each value is only touched once.

Only few algorithms scale really well with the number of cores, e.g. expensive random number generators, because they perform an inordinate number of computations relative to the size of their small state. Or matrix multiplication, which can be easily chunked into independent blocks that require a nonlinear number of operations relative to the size of the data.

And we haven’t even touched on issues of cache coherence, or how much user code can actually be parallelized within an application. E.g. if just receiving and handing off a query to a worker thread takes 10% of the total processing time, no amount of cores and tweaking of the runtime will ever make your application more than 10x faster (rather far less). It’s also kind of ironic that one of the main arguments for exploiting threads as opposed to processes + message passing is that you can operate in the same memory space - even though that’s exactly what you should try to avoid for scalable parallelism!

To be fair, I’m excited that OCaml may get a multicore runtime, because it would significantly improve performance for some things while possibly also simplifying them. E.g. avoiding the quite expensive runtime lock with system calls and external libraries. But it’s certainly not going to be a panacea for scalable parallelism.


#29

Noone working on multicore OCaml claims that it’s going to be a panacea for automatic parallelism. What it gives us is another scheduling tool for applications to interface with the underlying hardware more effectively.

Crucially, the algebraic effects model lets us build explicit scheduling that can exploit application-specific knowledge to schedule better onto heterogenous hardware.

One thing we learnt back in 2012 (and things have only gotten worse since) is that there doesn’t exist a single good scheduling strategy for modern hardware, given the interaction of virtualisation, caches, sockets, hardware rings and NUMA memory buses. http://anil.recoil.org/talks/fosdem-io-2012.pdf has the graphs from back then, and the results database at https://www.cl.cam.ac.uk/research/srg/netos/projects/ipc-bench/ has a startlingly variable result set for seemingly “obvious” benchmarks like two processes communicating on the same host.

I hope that we can build a series of scheduling libraries that can understand particular hardware layouts (e.g. Intel QPI vs AMD MC) and specialise scheduling and thread layout for that purpose. I’m not aware of any other language where this is a practical proposition without altering the runtime, and would be glad to hear about any such examples out there that others know about.


#30

Some of your concerns are well founded, but I find them to apply to shared memory parallelism in general. :slight_smile: Firstly, we are not interested in implicit parallelism with multicore OCaml. Implicit parallelism is a difficult problem where the cost of parallelisation outweighs the benefit. However, Manticore project from U Chicago looks at implicit parallelism on a similar multicore runtime.

We are not looking for linear scaling, and it is well understood that linear scaling is quite contrary to shared memory multicore parallelism due to the cost of communication (caches) and coordination. What multicore does offer is a way to exploit the memory hierarchy such that mostly local workloads can fit in your local caches and NUMA domains. Domain local minor heaps explicitly optimize towards memory hierarchy found on modern multicore hardware.

While this is true, they do form an important class of problems. An example is FB’s various program analysis tools (Hack, Flow, Infer) have very large number of independent pieces of work but involve structured communication between multiple phases. Current solutions include off-heap hash table of serialized values, which can naturally benefit from shared memory multicore parallelism.

What multicore offers is zero-copy messaging. The application may still communicate over a message passing channel abstraction such as golang, CML, CSP, GHC’s MVar etc, but the messages live in the same memory space avoiding the need to serialize and copy the messages between processors.


#31

Two thoughts:

For the FB style case, I’m not sure that having a fully shared heap is the right solution. Another approach is to use a shmem region and build a custom data structure with explicitly managed memory for that region. This approach can lead to very fast code indeed, and has the advantage of explicitly cordoning off the shared piece, which makes the remainder of the program easier to reason about.

As to zero copy messaging, I thought new messages can’t be fully zero copy, since to send something to a different domain, it needs to be promoted to the major heap. Is that wrong?


#32

I’m personally hoping there will be a method to allocate directly on the shared heap, perhaps using an annotation. Also, even without such a mechanism, it’d allow us to broadcast to multiple workers at the cost of one copy.


#33

I was certainly not implying that the OCaml multicore developers were making this claim. There are sometimes voices from both inside and outside the OCaml community that present the lack of multicore support in the runtime as a major reason for the slow adoption of OCaml. Besides believing that the two are almost entirely unrelated, I do find that many people’s enthusiasm for multicore (or parallelism in general) does not match reality. I’m just trying to instill some realistic expectations :slightly_smiling_face:

Support for explicit scheduling is surely a necessity to achieve good performance with heterogenous hardware and a well-taken design decision. Inferring the operational behavior from code automatically is pretty much hopeless, especially considering the fact that hardware parameters like number of cores, cache hierarchies and sizes, synchronization overhead, etc., would all have to be factored in for efficient scheduling.

That said, the vast majority of programmers would be hopelessly overwhelmed by doing that manually. In fact, most programmers are probably already overwhelmed by correctness considerations when dealing with parallelism, never mind performance. Explicit scheduling will be a boon for a small group of people with highly specific and not overly complicated needs.

In a particular problem I am dealing with right now, explicit scheduling would likely not help much even though there are typically ample parallelization opportunities: the execution of the user program is traced, followed by a reinterpretation that is operationally quite different, e.g. reads in the original (and possibly even purely functional) program become writes in the transformed one, which may introduce cache coherence issues that did not exist in the user program. The whole point of the automatic program transformation is that it’s infeasible to do for a human - and so would hence be explicit scheduling.

Other big issues are portability and different work loads: even if I wrote the perfect scheduler for my platform and working set size, if I send the application to other people, upgrade my machine or change the size of the problem, it will likely be ill-tuned. Some amount of (semi-)automatic parallelism as in “tune the scheduler and user code for this platform and problem size” seems inevitable. Merely tuning cache usage without any parallelism can already be fairly involved (e.g. see the ATLAS linear algebra library).


#34

You are right. The GC needs to promote the messages if they were in minor heap. But this promotion is work that you would have done anyway as a part of the minor collection, assuming the object survives the minor collection.

I am a bit hesitant to provide this mechanism. Firstly, I assume most of the work that is produced by a domain is performed by the same domain. It would be unnecessary to promote this work eagerly if it is going to be consumed by the same domain. Since it is unclear when a different domain is idle, I believe promotion on demand to be the correct default.

Secondly, it is unclear what the API for allocating in the shared heap should be that doesn’t lend itself to false promotions. There is already a mechanism to early promote the object Gc.promote_to. Moreover, minor allocations are almost free. My guess is that cost of allocating on the minor heap + promotion + collecting from the major heap ~= cost of allocating/collecting on the major heap.


#35

Personally since I had to find a reason to be enthusiastic about multicore (not effects, these I’m craving for) I did eventually find one which is: mere convenience.

I had to multiprocess a program two or three times in my life and frankly using the bare POSIX APIs to do this is so painful and error prone that if I can actually get mostly the same benefits by staying in OCaml and get rid of unix process management and communication issues I’m quite happy about that.


#36

Implicit parallelism being “too costly” is a matter of the cost function. Obviously, there is a tradeoff between performance and complexity here. I believe it’s fair to say that most programmers prefer low complexity most of the time.

I haven’t had the opportunity to test drive multicore OCaml yet, but looking at presentation slides the overall design definitely looks like an interesting approach. Keeping workloads within local caches is always a great idea for performance, irrespective of parallelism. It’s even more important if you know that multiple cores will compete for access to main memory otherwise.

Though the software may give the illusion that a value was passed without copying, a thread running on a different core may still trigger extra access to main memory if it has never touched that region or if it has to be refreshed after a write by another core. Not having to explicitly copy data is admittedly much more convenient from a software engineering perspective, but potentially expensive message passing still exists, just handled behind the scenes by hardware.

It’s not just the working set size, the structure of the computational graph of your program plays a pretty big role in what will work well, too. E.g. frequent, sequential use of unary operations is great, because the input and output are both going to be in the cache of the core performing the computation. But with binary operators each argument may have been the result of a computation by a different core. Now one of their caches needs to be synchronized for the binary operation, which may greatly diminish the gains of computing both arguments in parallel. That’s especially true for computationally “cheap” (high data throughput) functions, e.g. if you are adding lots of float vectors. Memory bandwidth can become a pretty important factor then.


#37

+1. for me being able to replace fork-based parallelism (parmap) by something that just works™ within OCaml would be a big win.


#38

Don’t forget the benefit multicore will have on the community. All the times one has to explain to someone why they shouldn’t reject OCaml because their imaginary-hypotherical code would not scale well. Now they can just say “yeah… sure…”.


#39

ocamlnet already allows you to do that.
http://projects.camlcity.org/projects/dl/ocamlnet-3.6.6/doc/html-main/Netmcore_heap.html
Sometimes, I wonder if people involved in multicore ocaml look at what was done in ocamlnet.
I think it has some nice ideas to allow parallel programming in ocaml.


#40

I just love parmap. :slight_smile: The only problem is that windows doesn’t have an efficient fork() call. Too bad for them.