It is obvious how important this project is for almost any single application these days - you hardly can find 1-core computer, stationary or mobile, x86 or ARM, etc. And at the same time OCaml lacks proper parallelism for years, unlike many other similar languages. Wouldn’t be beneficial for Facebook and other heavy OCaml users to provide one full time developer to this project of utmost importance?
I once dreamed about the OCaml funding project, something like FreeBSD fundraising, where everybody could donate money to OCaml project or particular OCaml subprojects (e.g. multicore).
Did the OCaml people ever think about a centralized donation platform/funding system? I suppose it could be quite tricky in terms of law and technical issues, so the costs could surpass the benefits though.
I’ve also never heard about PL-related fundraising, but Gnome, Haiku, FreeBSD, OpenBSD do that.
That would be the OCaml fundation (its existence is relatively recent, which explains why it might not be known very widely yet).
Have you considered to add Stripe option? It will allow to accept all major payment providers, even Chinese ones?
Never heard of this before? Is it referred on ocaml.org somewhere?
I doubt either of them feel any pressing need for it— and the arguments in favor of local data race protection aren’t sold to everybody yet. Moreover, paying real money to underwrite the major refactoring of open source software platforms not under their control isn’t the typical mode in for-profit enterprises. I’m expecting them to fork the language instead.
I’d like to take the opportunity to clarify a few details here.
Jane Street has been instrumental in this area by funding fundamental research in Multicore OCaml and the OCaml compiler via a research grant for the last 5+ years through the OCaml Labs initiative at the University of Cambridge. Beyond the dollar amount, it is the willingness of the Tools and Compilers (T&C) team at Jane Street to actively engage with the Multicore OCaml developers to co-create and guide development that has been most fruitful. Fundamental research in this direction includes papers on effect handlers, and the Multicore OCaml memory model. Valuable software development that has come out of this effort includes not only improvements to the compiler, but also benchmarking infrastructure (bench.ocamllabs.io and bench2.ocamllabs.io), which aids progress of Multicore OCaml, but also adds enormous value to the entire OCaml ecosystem.
Suffice to say that Multicore OCaml project wouldn’t be where it is without the support from Jane Street.
At OCaml Labs, we have a number of industrial partners, and we are always looking for collaborations that echo the similarly productive structure of co-creation and engagement we have with Jane Street.
As someone who spent over a decade cleaning up dumpster-fires on the JVM, and did the first scalability work (flat locks) on the JVM, I can attest to this. For myself, I think most programmers are incapable of dealing with real SMP parallelism and the memory-model issues that come with.
I know that I am [as was demonstrated when I tried to write a lock-free-reader java.util.Hashtable back in 1998]. It’s a lot harder than it looks to deal with real parallelism.
It is still better to have the ability to have all kinds of problems with parallelism, rather than having no chance even to experience these kinds of issues at all.
I cannot agree, and it’s my contention that the reasoning that leads to this judgment is badly flawed. There are many other fine programming languages that offer shared-memory symmetric multicore without local data race protection, and some of them are frequently touted for providing safety as a principal advantage over more traditional languages like C++ and Java. They are anything but safe without any local data race protection, but that doesn’t stop people— many in large for-profit enterprises— from flocking to them like lemmings.
TL;DR Allowing any but the best programmers to write SMP parallel code seems innocuous, and every programmer thinks they are “the best”. But Kernighan tells us that “when you use all your ingenuity and skill to write a piece of code, by definition you have already made a grave error”. Making SMP parallelism easily available to most programmers will be a poison fruit.
I’m pretty dour on this idea. There are reasons we don’t allow worker-bee programmers to write code that does explicit memory-management (even though in the 90s it was standard): we’ve learned that they cannot do this safely, and they will write code that fails in wild and unpredictable ways. And those failures will be expensive. The minute garbage-collection became feasible, everybody ran to it like towards the only life-raft on the Titanic. In the same sense, I believe that concurrency is unsafe for programmers. Two anecdotes:
(1) Prior to (circa 2007) the JVM’s JDBC DriverManager had a couple of places with locking, and the lock-ordering in these locations was such that the Oracle OCI JDBC driver, and the Sybase pure-java driver, together with this JDBC DriverManager bug, produced a deadlock in production of a Websphere app at a major custodial bank.
This happened twice that I know of, though who knows how many other times that somebody just restarted the app
The key lesson from this incident that I drew was that, since the entire application is not assembled until deployment-time at each customer, since (due to the many/many/many differing versions and fixpacks) each customer has a different collection of code, it is pretty much impossible for programmers to perform manual deadlock-detection or -avoidance (how can they know what other code’s lock-wait-for graph will be like?) and of course nobody does it at application boot-time.
And notice that this was code from three different, commercially-rivalrous sources. So there would be no way for any one of these sources to be able to compute that wait-for graph, even if it were technically feasible to do so.
You cannot write deadlock-free code in complex systems assembled from many parts at boot-time.
I was sent to work with this customer because they were suffering from a cumulative load of a couple of hangs a day, every day on their main customer-facing application. It was a nightmare for them, and during my time fixing (between 6-10 different bugs) three senior managers lost their jobs (one basically the day before I arrived, one halfway thru fixing all the bugs, and one at the end of the process). Much of this nightmare was due to concurrency issues. The rest was due to a different problem, which I will discuss below.
(2) I tried to write a lock-free-reader variation of java.util.Hashtable (at the time, every method was synchronized). I thought I knew what I was doing; after all, I’d just finished working with David Bacon to put “flat locks” into the IBM JVM on Power/x86/Z. When I got “done”, I showed the code to David, who pointed out several errors I’d made. I then went back and tried to fix these errors, and again showed David, who pointed out (again) several errors. He then fixed the code, and I came to the conclusion that real SMP parallel code was unsafe for me to write.
Stipulate that we all know how to write safe parallel code at the small-scale. Even a few steps up in size, and I was unable to deal with the complexity. But hey, maybe most programmers are smarter than me.
(3) A natural result of allowing SMP parallelism in address-spaces, is that programmers will start stuffing more stuff into single address-spaces (b/c no visible short-term cost).  These processes then get slower to start-up, and this means that they cannot be allowed to “fail-fast”. This process had reached its flesh-eating end in the JVM by around 2000, and has never gotten better: I had a conversation recently with a guy who works for a major streaming video company, who told me about how they run JVMs with 1000+ threads, and that he didn’t think it was problematic to keep a JVM running after it had started throwing faults – because they took a while to start-up, and gosh it would be wasteful to have multiple JVMs around just for fault-tolerance, when you could just keep the one you’ve got, running.
 whereas if you don’t allow SMP parallelism, the short-term cost is that you can’t take advantage of multiple cores, so programmers will try to use multiple processes.
It is entirely understandable, and I genuinely support that writing SMP is hard. However, 90% of the time it brings advantages by improving performance. So if the language is not able to provide the ability for that, most of the developers will change the language. Exactly what happened with that GPG key service story (their reasoning was bad, OCaml wasn’t the problem in that case, but the line of reasoning was logical). Working with BAP (Binary Analysis Platform), I often meet the problem, when I have 32+ core machine, but the program analysis of 100Mb file uses just a single core. It is very depressing.
Don’t be depressed. Be glad. It could be so much worse.
A story and a question:
(1) are you aware of the work of Anastasia Ailamaki and her student Ippokratis Pandis. This work is noteworthy because Ailamaki is perhaps the last of the hard-as-nails serious-as-death big-SMP-database researchers. She’s done stuff like increase log-append speeds with interesting lock-free techniques. Nobody can pretend that she’s not a big-ass-database person. And yet, she and her student found that to really get significant throughput improvements in SMP databases, it pays to treat them as if they were a cluster – to program it like you were programming a Bigtable cluster, in short.
(2) Why aren’t you able to run multiple processes to achieve this?
If the answer is “I shouldn’t have to”, two thoughts:
(a) even at places like Google, typically one does not use multiple threads for unrelated activities (and partitionable workloads fit this category)
(b) partitioning in this manner allows fail-fast fault-tolerance, a 100% good thing. On failure, restart the (subsidiary) task, not the (entire) job
It would be interesting if your analysis task was not partitionable in a meaningful sense.
ETA: I read a little about BAP. It seems like there might be a natural partitioning of work into that which is done per-function, and that which uses interprocedural information? And the former ought to be partitionable? Have you thought about maybe importing map/reduce as a structuring technique? (it is, after all, merely hash-join, hence universal). It might even be interesting to consider putting an SQL-like interface into BAP; at that point, compiling queries down to map/reduce would be even more natural.
Not all the problems could be solved with processes.
Take multimedia pipelines as an example: I have a process which takes buffers from multiple UDP sources, parses the mpeg/whatever packages stored within UDP, decodes video, merges frames from various streams, adds effects etc.
Each step of this pipeline could be done in parallel threads without any problems. And avoiding deadlocks is quite trivial due to very straightforward architecture: stages simply pass the objects between each other one way or another. That’s how gstreamer/directshow works.
Split this pipeline in processes and you are either doomed to have a mmap hell, or your performance would be dismal due to copying.
There are 2 separate advantages of multicore from my perspective:
Using processes for parallelism is a pain. Processes are awkward; using them involves calls to the OS which differ between operating systems; it involves heavy usage of system calls for communication between processes(for example, fork is cheap on Unix but isn’t on Windows, pipes and sockets are relatively expensive); it’s not type safe (or particularly secure), as everything needs to be serialized between processes; and keeping track of process liveness is also painful.
On the flip side, transitioning from a process-based model to a distributed one is less difficult, assuming we handwave some details away.
Multicore gives you a way to manage all of that from one OCaml application, and that’s huge. It makes OCaml a far better candidate to be used in low-level applications where it currently has a tough time competing, unless you’re a huge company with the resources and dedicated manpower to set up process-based parallelism just the right way.
While sharding works in many instances, there are applications where you want to share a lot of memory between cores. Doing this via serialization between processes via pipes is not practical, and process-based memory sharing is an even worse experience on OCaml. This is also where the danger of multicore comes in-- until you share memory, there’s no problem. But using a shared memory area on OCaml is very limited due to the tracing GC (unlike, say, python), unless (once again) you’re a huge company like Jane Street or Facebook and therefore have the ability to configure your shared memory area to store well-defined structured data for a specific purpose using a ctypes-like FFI.
Here, the correct approach will be to avoid sharing of mutable memory between threads whenever possible. Sharing immutable (or near-immutable) memory, however, will provide a huge benefit, as one core can write the data and other cores can read it, amortizing the cost of immutable data structures. The option to share mutable data exists, but it should be avoided, with minor exceptions for people who know what they’re getting into.
All in all, I’m really looking forward to multicore, and I think it’ll provide yet another push for OCaml’s adoption.
Um, why? I remember decades ago, the X-windows client lib had a “shm” extension (to speed up comms between client and server when on the same machine). It was implemented by having each end format messages into buffers in the SHM segment, and send pointer/length to the other end, which would -return- that same pointer/length when it was done. So, the -ownership- of the message-regions was managed by sending them via TCP. You could think of the SHM message-regions as just a different way of marshaling messages. I’ve done a similar thing myself for the same reasons.
So one could create a big-ass memory-region, and each worker process could mmap it; then they could communicate with each other in the normal way, using TCP/RPC, and send “ownership tokens” to transfer ownership of blocks of the mmaped region. In your example, there is a “producer” (source of frames) and a “consumer” (sink of frames) and a bunch of stuff in-between. So the consumer would need to “send” frames to the producer, but otherwise, this seems pretty straightforward.
Because shared memory is a hell? Using process with shared memory instead of threads is simply dead wrong. If you want shared memory, you want threads (in most of the cases). Sharing memory between a bunch of process complicating the issue beyond any reasonable limit.
and send “ownership tokens” to transfer ownership of blocks
What if the process needs more space? You need an orchestrating process which manages memory allocation and notifies processes that there is a new page available.
And you need such an extreme complexity for what reason, again? To solve the problem long solved with fivers/threads? That doesn’t make any sense.
I think it’s good that we have such vigorous discussions! grin And I’m sure that I’m not going to convince you; just writing down the counter-arguments.
Um, some responses:
(1) opinions differ on whether it is shared memory, or concurrent access to shared memory that is the problem. The singular attribute that makes the X-SHM protocol so tractable, is that each message-region is “owned” by a single process, and other processes neither read nor write that message. Also, keeping this distinction between “memory that is sharable by other threads” and “memory that is private to this thread” makes it much simpler to write code.
Also, I’ll note threading on UNIX came after processes and shared-memory. Long after, as a matter of fact.
(2) Since these ownership tokens are for decent-sized blocks, it’s straightforward to include in them shm-ids (along with probably “suggested addresses to which to map them”). Then there’s no need for an orchestrator process; any process can create new (should be large) segments (not at the granularity of pages) and pass chunks of 'em around. The only thing that’s needed, is for there to be an “eventual parent” that accumulates all the shm-ids, so it can delete them all after the computation is complete.
(3) In any memory-intensive application, it’s necessary to explicitly manage the largest class of memory, and outside the heap. This is imperative for performance. And while GC “researchers” have been telling us (and me) that the next great advance in GC will render such explicit (“memory pooling”) management superfluous for … 30+ years, it is as necessary today as it was in … 1989. In Java we know these things as “ByteBuffers”, IIRC. They’re outside the heap, and if you’re going to do high-performance work, you learn to use 'em.
Wow, I go away for a few days and return to this forum to find I’ve entered a time portal back to 2012
@XVilka, you’ve been around here for long enough that you should know that the title of this post is simply incorrect, and I dislike answering loaded questions. As @gemmag notes, multicore OCaml wouldn’t exist without Jane Street’s sponsorship over many years of hard work. Please edit the title of this post to correct it for the record.
You’ve also posted in another thread just 15 hours ago that shows you are aware of the active multicore PRs on the OCaml issue tracker, so I’m confused by the implication that there is noone working on multicore OCaml. Are you simply disappointed that it isn’t finished yet, and not shipping overnight?
I understand your desire to just solve your problem and have parallelism for performance. Some thoughts:
There are active multicore PRs that are complex, affect all architectures and distributions, and require extensive testing and feedback. You can for example look at ocaml/ocaml#8713 and help verify that it doesn’t regress on your codebases.
You’ve posted about performance problems you’re seeing but not really followed up on that with any constructive feedback. It could well be the case that multicore will help with parallel access to some large shared memory structure. It’s a pretty good time to profile your application and to see if it’s a good candidate for implementation within the multicore OCaml branches.
Multicore OCaml is making steady forward progress, but requires painstaking benchmarking and careful design to ensure we don’t mess up the lovely single core experience that has served us so well for the past few decades. The reality of the work is that we spend most of our days poring over benchmark results at the moment to understand the multivariate effects of even the smallest changes in the runtime. Take a look at https://github.com/ocaml-bench/sandmark – well-explained macrobenchmark contributions are welcome here.
This thread is so far full of rather well-trodden discussions that we’ve seen many times over the years. I’d encourage you all to look forward to the PRs that are exciting flowing into ocaml/ocaml at the moment and get involved with testing them and providing concrete feedback to help make multicore ship instead!
Generally as a contribution rule, if you see a PR that has been lingering for a while and want to help get it merged, do not just post a “ping” comment on that issue. Instead, take a few moments to clone the PR and build it, check its status against the current master, and see if you can post even a short update of your results along with your query. This will contribute to the PR – even a little more new information is often useful. I’m looking forward to seeing more testing feedback on our various GitHub trackers!