I was certainly not implying that the OCaml multicore developers were making this claim. There are sometimes voices from both inside and outside the OCaml community that present the lack of multicore support in the runtime as a major reason for the slow adoption of OCaml. Besides believing that the two are almost entirely unrelated, I do find that many people’s enthusiasm for multicore (or parallelism in general) does not match reality. I’m just trying to instill some realistic expectations
Support for explicit scheduling is surely a necessity to achieve good performance with heterogenous hardware and a well-taken design decision. Inferring the operational behavior from code automatically is pretty much hopeless, especially considering the fact that hardware parameters like number of cores, cache hierarchies and sizes, synchronization overhead, etc., would all have to be factored in for efficient scheduling.
That said, the vast majority of programmers would be hopelessly overwhelmed by doing that manually. In fact, most programmers are probably already overwhelmed by correctness considerations when dealing with parallelism, never mind performance. Explicit scheduling will be a boon for a small group of people with highly specific and not overly complicated needs.
In a particular problem I am dealing with right now, explicit scheduling would likely not help much even though there are typically ample parallelization opportunities: the execution of the user program is traced, followed by a reinterpretation that is operationally quite different, e.g. reads in the original (and possibly even purely functional) program become writes in the transformed one, which may introduce cache coherence issues that did not exist in the user program. The whole point of the automatic program transformation is that it’s infeasible to do for a human - and so would hence be explicit scheduling.
Other big issues are portability and different work loads: even if I wrote the perfect scheduler for my platform and working set size, if I send the application to other people, upgrade my machine or change the size of the problem, it will likely be ill-tuned. Some amount of (semi-)automatic parallelism as in “tune the scheduler and user code for this platform and problem size” seems inevitable. Merely tuning cache usage without any parallelism can already be fairly involved (e.g. see the ATLAS linear algebra library).