Language abstractions and scheduling techniques for efficient execution of parallel algorithms on multicore hardware

The Multicore OCaml team has made significant progress in the recent years. There now seems to be interest in working on the high-level parallelism constructs. Such constructs are also tightly connected to the problem of controlling the granularity of parallel tasks.

I’ve been working on parallel constructs and granularity control from 2011 to 2019, together with Umut Acar and Mike Rainey. We published a number of papers, each of them coming with theoretical bounds, an implementation, and evaluation on state-of-the-art benchmark of parallel algorithms.

While we mainly focused on C++ code, I speculate that nearly all of our ideas could be easily applied to Multicore OCaml. Porting these ideas would deliver what seems to be currently missing in Multicore OCaml for efficiently implementing a large class of parallel algorithms.

Gabriel Scherer and François Pottier recently suggested to me that it appears timely to share these results with the OCaml community. I’ll thus try to give an easily-accessible, OCaml-oriented introduction to the results that we have produced. Note, however, that most of the ideas presented would apply essentially to another other programming language that aims to support nested parallelism.

I plan to cover the semantics of high-level parallelism constructs, to describe and argue for work-stealing scheduling, to present a number of tricks that are critical for efficiency, and to advertise for our modular, provably-efficient approach to granularity control. I’ll post these parts one after the other, as I write them.

Other parts will be published in the coming weeks or months.

20 Likes

@charguer Great to have you thinking about abstractions for parallel algorithms in Multicore OCaml. I am yet to read the report in detail but noticed a few minor things:

  • The use of the term fibers and lightweight threads – when introducing fibers in this report, it is perhaps useful to distinguish these fibers from the fibers in Multicore OCaml runtime i.e, stack segments that provide support for delimited continuations. IINM, the proposal here is orthogonal to effect handlers / delimited continuations (though I’m interested in figuring out sensible (failure) paths for programs mixing interactive and throughput oriented computations).
  • Reference to the memory model should refer to the PLDI paper https://kcsrk.info/papers/pldi18-memory.pdf. The one on the wiki has been superseded by the paper.
  • Gabriel Scherer instigated me :smiley: to provide better support for writing parallel programs in Multicore OCaml. We have begun the attempt at building one under Domainslib.Tasks. I’d love to see fruits of this ongoing effort land in Domainslib.
1 Like

it is perhaps useful to distinguish these fibers from the fibers in Multicore OCaml runtime i.e, stack segments that provide support for delimited continuations.

It would be useful to figure out the exact difference in terms of implementation, indeed. A node from the DAG calculus needs to store a “suspended piece of computation”, and a pointer on the stack that this computation could or should execute in. I’ll write more about the reuse of stacks in structured parallel computations in part 3, and also about the dynamic splitting of stacks in part 4.

Reference to the memory model should refer to the PLDI

I thought I had fixed that already, sorry. I see that you updated the wiki, great.

I’d love to see fruits of this ongoing effort land in Domainslib

Likewise. I’m lacking expertise for implementing the runtime features myself, but I’m actively trying to contribute to the specifications.

Thanks

It would be useful to figure out the exact difference in terms of implementation, indeed. A node from the DAG calculus needs to store a “suspended piece of computation”, and a pointer on the stack that this computation could or should execute in. I’ll write more about the reuse of stacks in structured parallel computations in part 3, and also about the dynamic splitting of stacks in part 4.

I’ve browsed through the report. I’m curious about the implementation of the DAG calculus. The only trickiness is yield; the rest can be implemented efficiently. Of course, the following questions may be too early to ask, and I should wait for the further parts of the story, but I thought I would record them here anyway.

It appears that while the description of yield says it “captures the current continuation” unlike continuation capture mechanisms like call/cc and call1/cc, the API doesn’t reify the continuation as a first-class entity; it just hands over control to the scheduler. It appears that the semantics is that yield may return after all the dependencies on self node have been resolved (no incoming edges to this node).

The current Domainslib.Task API does not have a yield, but it does provide strict futures (through async/await). Other API such as parallel_for have been implemented using async/await. This is a conscious decision to leave out yield like primitive. Implementing yield would necessitate some ability to capture the current continuation irrespective of whether it is exposed as a first-class value or not. Currently, we aim to first upstream domains-only multicore support without the support for effect handlers (ability to capture continuation of arbitrary OCaml computations). I’d like Domainslib to be usable from the day when domains-only support lands and, if possible, not tie it to effect handlers. While the effect handler support is forthcoming, effect handlers are really suited for the case where blocking/suspension is necessary (think I/O). Effect handlers make continuation capture cheap by paying the cost upfront (in term of allocating separate stack segment for the handler computation / the computation delimited by reset if you are aware of shift/reset style delimited continuations).

Is it necessary that the implementations of DAG calculus support yield? Is it necessary that the DAG calculus itself supports yield? Rather than yield, the way Domainslib.Task interacts with the scheduler is that the await call is implemented as “loop (poll to see if the promise is filled and return the value, otherwise, steal a task from the scheduler and start executing it)”. Assuming that the program contains only one scheduler, this implementation works well. Curious what your thoughts are on this.

I’ll have a closer look at DAG calculus.

Thanks for sharing this very interesting Part 1 paper.
I haven’t yet read in details its numerous references.

I understand that you don’t criticize/disaprove the primitives offered by Multicore OCaml but you propose to add the notion of future(s) and additional constructs, and to expose all of of that via an interface that would make programming easier with them (parallelism always evokes heaven and hell).
The programmer would not even need to deeply understand the notion of future and to handle lower level constructs.
Do I understand correctly your paper, or can you pls. correct my understanding?

From “ordinary” OCaml programmer standpoint, the term multicore evokes more intensive/massive computation thanks to parallelism. And also hard problems when programming (and I don’t even talk about GPU programming).
But at the same time, one can naively imagine using several cores and use message passing between the corresponding processes, and even between CPUs of different machines (with the cost of communication).
This is all about distributed computation that should help “doing more with fixed memory and time resources”.

The concept of dynamically constructing a computation DAG by adding and consuming nodes and edges (where nodes are computation and edges are dependencies) looks great. It lets me think about a “perfect army or team” that self organizes to solve a problem with its (current, limited) resources.

With simple words, can you explain to ordinary OCaml programmers:

  • first, the benefits from using vanilla multicore OCaml?
    When do we really need it versus, say, message passing between processes.
    Can you share some metrics for some classes of computation? (+50% +100% + 200% multicore faster vs monocore?)
  • and give a few real world examples from the aforementioned class of algorithm that the so-called DAG calculus (designed to overcome limitations of futures) would make accessible?

Thanks.

I understand that you don’t criticize/disaprove the primitives offered by Multicore OCaml but you propose to add the notion of future(s) and additional constructs, and to expose all of of that via an interface that would make programming easier with them.

The high-level constructs that I am advertising are fairly standard constructs whose benefits have been demonstrated in other languages for the purpose of programming multicore algorithms.

The DAG-calculus provides a unified framework for explaining the semantics and the cost semantics of all these constructs. On the one hand, it helps implementers factorizing the implementation effort. On the other hand, it lets advanced programmers express more complex dependency patterns if they ever feel the need to.

The programmer would not even need to deeply understand the notion of future and to handle lower level constructs.

That’s one thing that fork-join and async-finish give you. Besides, these constructs are associated with nice theoretical bounds, explaining under which hypotheses these constructs are likely to deliver good speedups.

But at the same time, one can naively imagine using several cores and use message passing between the corresponding processes, and even between CPUs of different machines (with the cost of communication). […] When do we really need it versus, say, message passing between processes?

Parallel programs written using high-level constructs can be:

  • easier to program, because you don’t have to deal with the passing of messages;
  • more efficient because able to synchronize efficiently through shared memory, using e.g., a giant concurrent hashtable;
  • more modular, because you can write a library without having to worry about the context or the instantiation of the functions taken as argument;
  • I suspect that it is also easier to control granularity of subtasks when the amount of work depends not on the size of the input data, but on the values from the input.

I won’t try to make a list of everything you can program better than with using message passing. Let me just point out one fascinating piece of work for stream processing of graph updates mixed with graph queries, and for which functional representations of the various versions of the graph are stored in the shared memory, with sharing of common subtrees across different versions.

A 72-core multicore machine with 1TB of RAM can handle graphs with over 200 billion edges. This kind of hardware costs in the range 10k€-30k€, and can be rented at Amazon EC2 for less than $7/hour, if you just want to play around.

Can you share some metrics for some classes of computation? (+50% +100% + 200% multicore faster vs monocore?)

For OCaml, it remains to be seen. In C++, working with 40 cores, we typically could achieve speedups in the range from 12x to 39x. For memory intenstive programs, the 12x appeared in most cases as a ceiling on performance associated with peak memory bandwidth. Check out Blelloch’s papers for example speedup figures for a few few classic algorithmic problems (in Table 2).

and give a few real world examples from the aforementioned class of algorithm that the so-called DAG calculus (designed to overcome limitations of futures 1) would make accessible?

Here is a more theoretical read which might answer your question about the motivation for programming using fork-join style computations. And here is the exhaustive list of Blelloch’s papers if you want to look around for other pratical applications of that programming model.

1 Like

Indeed, maybe you could implement many operations without the full power of yield.
The only thing that really matters is that when you execute a fork-join, the stack of the caller must be reused for executing the continuation task (aka “join task”), after the two branches are completed. If the right branch was executed by another core, and if it terminates second, then this other core is the one executing the continuation task, not the core that originally executed the fork-join. Thus, somehow, there is a transfer of stack between cores. If you can handle that, you should be fine, presumably.

Thanks a lot for all these interesting details and references.

I’m not used to multicore, but the ratio “looks good” (for C++).

Regarding vanilla multicore OCaml speedup/cores number ratio, there are already some metrics there: Multicore OCaml: March 2020 update :

Speedup (sequential_time/multi_threaded_time) versus number of cores for Multicore (Concurrent Minor Collector), Parmap and Parany is quite significant:
approx 20x for 40 cores.

This ratio is certainly strongly application-dependent.
The benefits of multicore seems to be: “You can compute more if you add cores and RAM, and that comes in a range of 20-100% speedup/core ratio” (e.g. 10x cores gives 2x to 10x speedup).
It’s a little bit like a car that can double its speed with a more powerful engine but at increased cost: the cost of the car, and of its fuel consumption : 2x speed (say, 2x 100 km/h, or 2x60 mph) should approx. requires 4x to 6x more fuel.

When you say “more efficient because able to synchronize efficiently through shared memory, using e.g., a giant concurrent hashtable”, does it means that you expect an increased ratio speedup/core from multicore OCaml + DAG calculus constructs and unified framework compared to vanilla multicore OCaml?
In C++, what is the increase observed of the speedup/cores number ratio that are brought by these constructs (fork-join, async-finish, futures)?

(Naive) side question:
We are discussing about this DAG calculus proposal, but it won’t be available next week.
Now coming back to daily programming, I understand that multicore OCaml is the available option to speed up computation (for the first time, I’ve just installed it in a fresh opam switch. It was as straightforward as with a monocore OCaml compiler. Good. Now, it’s time to study examples! ).

As multicore OCaml is available (I don’t know yet its possible constraints and limitations), does an “ordinary monocore OCaml programmer” should simply use it, or are there affordable in-between techniques or libraries that could help him significantly speed up his programs with ordinary libraries?

Thanks.

I’m a little surprised that you’re trying to avoid using the runtime parts of the effects work in domainslib. My understanding is that even the first version of multicore will include the runtime support, but only expose it in Obj.

Using that to implement futures and yield seems preferable to the ad-hoc version you mention. (I’m sure you know this but: the ad-hoc version doesn’t compose, gives very confusing backtraces, and maybe confusing exception behaviour although you might be able to fix that).

I don’t think using the effects runtime would present large forward compatibility issues. When typed effects land you can re-implement it using those and the interface would only change by adding some effect types on the various functions to stop them being run without a scheduler.

I agree that this is a problem to implement the interface as @charguer proposes it

val yield : unit -> unit

   let force (t,r) =
      newEdge t (self());
      yield();
      unsome !r

But maybe we could change the API to ask the user to explicitly delimit the continuation themselves? (This is a question for @charguer):

val yield : (unit -> unit) -> unit

   let force (t,r) =
      newEdge t (self());
      yield @@ fun () ->
      unsome !r

That could make it possible to implement without control operators in the runtime.

Note: I just read an interesting blog post Go statement considered harmful, that I think could be of interest to many people who read the present thread with interest. It’s a bit slow to start (it’s for a general audience, you don’t need to be very familiar with concurrency; good!). The main point, I think, is that it is important to be able to block on all the subtasks created before continuing, in particular for error handling and resource safety.

3 Likes

Multicore hardware has been mainstream for more than a decade. So I am not sure that “week” is the right timescale here :slight_smile:

No, beyond trivial examples, it won’t work. I think that you really need to capture the current continuation.

I understand that it can be tricky to implement for many reasons, but I don’t know of another way to expose parallel constructs that you can smoothly introduce in the middle of conventional sequential code, i.e. without requiring to manually CPS-transform the entire program.

I think it makes sense to have a way to test the domains only, to simplify the moving parts. Personally I am curious to see how much we can do in terms of benchmarking parallel programs without more fine-grained concurrency mechanisms.

This being said, it would also be interesting to implement with fibers. Does one of you know, if someone wanted to write small benchmarks using the delimited-control operators (… for example an implementation of the DAG calculus, but probably first something simpler), where we should start looking? Is there a “Domainslib” equivalent in that space? (Are there already benchmark collections?)

As I’d mentioned to @lpw25 in a separate conversation, the reason for not using runtime parts of effect handlers is to fit with the multicore upstreaming plan. The multicore upstreaming plan consists of 3 major pieces:

  1. domains-only multicore support without effect handlers in any form
  2. effect handlers with all the runtime system support, but no syntax extension (exposed through Obj module)
  3. typed effects

Currently, the multicore team is focussed on item 1. Item 2 involves non-trivial changes to the stack layout, which for efficiency must be implemented in assembly. Unfortunately, this means that we’ll have to replicate this effort for each of the different architectures. Some of this code is quite subtle and will likely necessitate good amount of time to properly review, and might land a release or two after domains-only multicore. Given this plan, the idea of Domainslib is to not have any dependencies on effect handlers / delimited continuation support so that we have a usable library the day item 1 lands.

That said, nothing prevents us from implementing Domainslib.Tasks with effect handlers and utilize the full power of delimited continuations. I would think that once item 2 lands, we can promote domainslib to use effect handlers transparently (hopefully, but likely with some breaking changes for better semantics I’d think).

We’ve already ported the parallel benchmarks in Sandmark to use Domainslib. If a version of Domainslib is implemented with effects, then we can start benchmarking the overheads today. @gasche regarding benchmarks for delimited-control operators, effects-examples should be useful. Several recent papers on effect handlers utilize similar examples for benchmarking. For our TFP paper, We’d built an asynchronous I/O library using effect handlers whose performance was competitive with Async on a realistic webserver workload.

1 Like

Thanks for taking the time to write this down.

One thing I do not share with you is the the idea that cancellation can easily be tackled on the user side.

I think cancellation is extremely important – it constantly happen when you build user interfaces – and you want to get it right w.r.t. ressource cleanup.

From a compositional perspective it’s very difficult and annoying to handle in user space. Having to manually thread labels and check thems incurs too much burden (and the possibility of errors) on the programmer. Besides you are busted if one of the libraries you use doesn’t poll for your labels.

I think this aspect and its relationship to ressources cleanup should be given much more thought in the multicore API and runtime design (see e.g. f# cancellation tokens described in this paper).

2 Likes

Thanks for your feedback. After a couple of private discussions, I am now convinced that proper support for exceptions is needed. I plan to write up after the week-end the description of a flexible API that would cover most use case, and still work with a scheduler implementation that does not provide full support for task cancellation.

It probably did not come out very clearly in my “part 1” write up: I am not against the idea of supporting “cancellable tasks”. I am just warning that this can be tricky to implement efficiently in the general case (and even tricky to specify in the case of unstructured futures), and that there exists a workaround (manual polling) which certainly isn’t perfect and fully modular, but that can be used while waiting for better support to be available.

1 Like

This is a wonderful blog post, @gasche. Since it’s from 2 years ago, did nurseries live up to the hype? The concept certainly seems to ‘make sense’. Should we use this same concept in OCaml?

As far as I can tell, nurseries are very similar to async-finish (and their labelled variant) as described/presented in the document presetend in this thread. (The author also has a blog post on timeout and cancellation, which I haven’t read yet.)

I have been convinced by arguments that proper support for exceptions would be needed for the first implementation of the DAG-calculus. I sketch below what could be an extension of the DAG-calculus with customizable support for exception handling.

Unlike the rest of the material, I am here describing “ideas of the week-end”, as opposed to fully worked-out, peer-reviewed, research results. Inevitably, there will remain a few rough edges.

Quick recap on the DAG-calculus without exceptions

In the DAG-calculus, a node describes a piece of computation. Initially, the DAG consists of a single node, which describes the entry point of the program. During the execution, new nodes are created dynamically. The original entry point then becomes the “sink” node, at the bottom of the DAG, and describes the final exit point of the program. Typically, a fork-join operation introduces to fresh nodes, and updates the current node to represent the continuation, i.e. the computation that comes after the fork-join.

An edge from A to B indicates that node A must complete before node B is allowed to start. When a node completes, it is removed from the DAG, together will all its associated edges. After the removal of these edges, other nodes that previously had dependencies (incoming edges) may become “ready” (zero incomming edges).

Every node features an “instrategy”, which controls the representation of the incoming edges, and an “outstrategy”, which controls the representation of the outgoing edges. When a node A completes, its outstrategy takes care of notifying the target of the outgoing edges of the removal of the edges. For example, for an edge from node A to node B, the instrategy of node B has is notified. Note that, depending on the in- and out- arity of the nodes, the methods associated with in- and out-strategies may need to handle queries concurrently.

Extension of the DAG-calculus with exceptions, first without futures

In the original presentation of the DAG-calculus, the only information carried out by the edges is whether a node has terminated. The result value produced by a computation can be transfered via the shared memory. In a language like ML, furthermore extended with exceptions, it makes sense for edges to carry the result produced by a node: either a value, or an exception.

The API for in- and out-strategies (which I haven’t detailed so far), could be refined for handling return values and exceptions. More precisely, when an edge from A to B is removed, the instrategy of the node B receives the result of node A. If this result is an exception, the instrategy can follow different policies for how to handle the exception. This policy may be chosen on a per-node basis.

I see 3 useful patterns:

  1. The policy that combines exceptions: if one edge or more carries an exception, then, when all incoming edges are removed (i.e. all branches have completed), propagate an exception that consists of the list of the exceptions gathered.
  2. The policy that respects sequential execution order: collect results and exceptions from the incoming edges; if the resulting tuple is of the form (v1, ..., vi, exn1, _, _, _), where the underscore indicate either a value, or an exception, or no result obtained yet, then propagates the exception exn1.
  3. The policy that allows for eager propagation of exceptions: as soon as one edge carries an exception, propagate this exception. This policy is non-deterministic, but allows for debugging without waiting. If non-determinacy is an issue for debugging, it is always possible for the programmer to switch to another policy, presumably by means of setting the right flag in his code.

Note that, with policy 3, if a node raises an exception and that this exception is not caught, then the program would immediately terminate on that exception, without noticeable delay (just the time required to walk down a chain of edges).

Let me first clarify what it means to “propagate an exception”. If the instrategy of a node decides to propagate an exception e, it essentially means that the contents of the computation associated with that node will be patched with a leading raise e. In other words, the exception will be raised in the stack associated with the computation of that node. This allows for semantically-enclosing exceptions handler to catch the exception.

It now remains to explain what happens to the branches that are cancelled, i.e. that have not yet terminated when an exception is propagated from another branch (for policies 2 and 3).

Cancelling the execution of disconnected nodes

If the instrategy associated with a node decides to propagate an exception before waiting for completion of all incoming edges (branches), its action over the DAG consists of effectively removing the remaining incoming edges into the node. Thereby, a sub-DAG gets disconnected from the rest of the DAG, that is, nodes no longer have a path reaching the sink. The results associated with the disconnected nodes are no longer needed.

If computations were pure, all these nodes could be discarded at once, and those currently running could be interrupted abruptly. However, in a world of impure computations, cancelling computations at arbitrary points is certainly not a good idea. The post on Trio makes that point too.

For cancelling computations, we need to somehow notify the nodes that they are no longer needed, and let them handle this notification on their own.

  • If a node has not yet started, the scheduler may test before starting its execution whether the node is marked as cancelled. If so, it may invoke a custom finalize method that the programmer may have registered. For finalize methods to be executed in the right order, it seems necessary to follow the DAG order, that is, to invoke finalize only on ready tasks.
  • If a node is currently running, it seems safe to raise the Cancel exception only at checkpoints that the programmer has marked explicitly in its code, via the checkpoint instruction.

For many parallel algorithms, the node from the DAG are set up in such a way that their execution time is never too long before they complete or invoke yield. For such nodes, it is perfectly acceptable to not attempt cancelling the execution of the node while it is executing. In other words, it suffices for cancellation to be checked before starting a node, and the checkpoint instructions do not need to appear anywhere explicitly in the code.

On the contrary, for programs that may feature nodes that involve long sequential execution, placing the checkpoint instructions can be quite tricky, especially if the code calls into existing sequential library. Dealing with this situation is certainly going to be challenging. A number of libraries could be identified as “safe to interrupt any time”. In that case, the polling on cancellation could be handled by the runtime system. Not all libraries, however, would satisfy this property.

Support for futures

The problem that I discussed with futures is that it is not always clear where to propagate the exceptions that they may raise. Another important problem, at least for languages without a garbage collector, is that it is not clear at which point a future is no longer needed, and thus can be removed from the DAG (in the model) and deallocated (in the implementation).

We can address both of these problems through the introduction of “shadow edges”. A shadow edge relates a node A that describes a future to a node B, typically a node associated with a “finish” block (or a “nursery” in Trio’s vocabulary). The interpretation of such an edge is double:

  1. If the execution of the future raises an exception, then this exception is delivered to the instrategy of node B. This instrategy should have a policy able to handle such exceptions, and to prioritize its propagation with respect to that of other exceptions within the same scope.
  2. If all proper incoming edges to node B have completed (i.e. if only incoming shadow edges remains on B), then the shadow edges incoming into B are removed. At this point, the node B becomes ready to execute. The future A gets disconnected from the DAG. This future will never be forced in the future, and it can be removed from the DAG. If it is running, its execution may be cancelled as explained above.

Note that there are a number of well-scoping rules (to be made precise) that the code should satisfy for things to work out smoothly. In particular, a future must not be forced outside of its scope.

Overall, the extension of the DAG-calculus that I’m sketching maintains the key property that the semantics can be explained independently of the scheduler. In particular, it is not specified how fast nodes are cancelled. This leaves room for many possible scheduler implementations.

Implementation of cancellation

For implementation the DAG-calculus without support for exceptions, it was sufficient to store, in out-strategies, pointers on instrategies. To support cancellable nodes, it seems to me that for an efficient implementation one would need backward pointers, following the edges backwards. Cancellation of branches would be implemented by a reverse DFS traversal of the graph. An implementation faces a number of complications due to potential races between the forward traversal of outgoing edges after nodes complete, and the backward traversal of edges for cancellation. Interesting research work ahead!

Tempting but tricky: exceptions for early algorithmic termination

Consider the function array_any_index_of x a, which searches for the index of any occurence of x in the array a. The code is implemented using the parallel_for construct, using a function that raises an exception as soon as one occurence is found. The annotation exn:eager specifies the “eager propatation of exception” (policy 3, above).

   exception Occurence of int

   let array_any_index_of x a =
      let n = Array.length a in
      try
         parallel_for ~exn:eager 0 (n - 1) (fun i ->
            if a.(i) = x then raise (Occurence i));
         None
      with Occurence i -> Some i

There are three important aspects that the programmer must be aware of:

  1. Unlike for sequential code, the occurence reported isn’t necessarily the first one. For that, one would need to follow another policy for handling exceptions, e.g. writing ~exn:sequential. But doing so means that in many cases, a large fraction of the array needs to be traversed even when an occurence is found early.
  2. Unlike for sequential code, there is no guarantee that the function terminates faster than O(n), even if the array contains x in a cell near the front of the array. The randomness in the scheduler may very well lead to this segment of the array being processed last. In fact, depending on the scheduler, even if an occurence is found early, the delay associated with task cancellation may result in the entire array being traversed nevertheless.
  3. Last, and most importantly, such an introduction of “speculative parallelism” like in the example considered is generally counterproductive for performance. Of course, if all the cores are free, then the above parallel_for is the only mean of providing these cores with some work. However, assume this code to be executed in a context that already spawned many parallel subtasks, and assume that the array contains an occurence, say at index “n/2 - 1”. In a work stealing execution, if another core acquire some of the work involved, it is likely that all the cells of the array will be traversed. Yet, a sequential execution would have traversed only half on the array. Thus, the parallel execution will be crippled by the fact that it performs unnecessary work.

What the third point suggests is that speculative parallelism is not always a win. One would need some means of expressing priority: exploit parallelism from this specific parallel-for construct only when the scheduler has no other ready node to work on. Yet, this kind of rules generally is hard to exploit for at least two reasons:

  • It is not easy for decentralized schedulers to handle this kind of priorities, because they do not have a global view of what nodes remain to be executed.
  • Any policy based on tests such as “if there is no other ready node available at that point in time” can be defeated when another running task suddently starts spawning a large number of parallel nodes (with higher priority).
6 Likes