Naive multicore question: pmap

multicore

#1

In Clojure (the functional language I know best), there is a primitive pmap that works exactly like map over list-like structures (e.g. like List.map in OCaml) except that pmap splits the input sequence into chunks and then applies the function argument in different threads on different cores. The number of threads depends on available cores. The returned value is the sequence that map would return.

Whether pmap improves speed depends on how much work the mapped function is doing, but sometime you can arrange your code so that a lot of the work is in one function that is mapped. It’s then easy to check whether pmap improves speed by adding and removing “p” before “map”. :slight_smile: Obviously this is only useful in some situations. (Clojure has other, more flexible primitives for parallelism as well.)

Is it likely that multicore will allow similar functionality in OCaml. i.e. would it be easy to add something like pmap to List, Array, LazyList, etc. (perhaps in Batteries or Jane Street’s Core)?


#2

There are some solutions which allow mapping over multiple cores:

(From ocaml-awesome, I selected those where I could find a map quickly.)


#3

Thanks Leonidas. I had forgotten that forking processes was an option, and those libraries look like they make that very easy. Great.

I would imagine that forking processes means that the mapped functions have to do more work to justify the cost than if threads were run on different cores.

I still my original question about the OCaml Multicore project: Is it likely that it will be easy to create pmap-style functions that will make it easy to split mapping between different cores, but without the cost of a fork?


#4

It’s actually a little more subtle than that. If the original process or the mapped functions (in their own processes) do too much work, they’ll create enough garbage for a full major GC. This involves marking the entire heap, which will trigger the copy-on-write unix fork mechanism to copy almost the whole heap from the original process. So there’s a very narrow window where this is actually efficient, and it’s not efficient at all on Windows, which doesn’t have copy-on-write forks.

Absolutely. This is a very simple use-case of multicore: it’s a simple master-workers pattern, and it’s deterministic.


#5

Thanks much @bluddy.
Very interesting about forking.


#6

After a little bit of refactoring that didn’t change my code’s speed, I made a small change in it to use Parmap.array_float_parmapi. A computation that took 25-27 seconds in native code on my 4-core MacBook Pro now takes between 0.018 and 0.0211 seconds to produce the same result.

Such a large speed improvement is simply unacceptable. I expected a 3X or 4X improvement, but not 1000X to 1500X ! When performing an expensive computation, you must feel it. You should know that you’re doing important work because it is taking so long to finish.

It is OK to speed things up, but this is just wrong. And Parmap is to blame.

(And yes, I did test to make sure that the results were the same. Amazing. No doubt the specialized float array processing provided by array_float_parmapi is part of the explanation.)


#7

Wait… not sure this is all correct. More investigation needed…

I was using Sys.time to measure function execution time, but this seems to be inaccurate with Parmap. Unix.getttimeofday is more accurate.


#8

Maybe just talking to myself … but in case anyone’s interested in my experience with Parmap, I think I’ve now tested pretty much correctly, using gettimeofday and /usr/bin/time–and I will verify it using a better measure.

So, no, there was not a miraculous 1500X speed-up due to Parmap. On a later run I saw what looked like a 30K speedup. That really made me suspicious! (If anyone here has ever read Greg Egan’s novel Permutation City, … I was starting to wonder whether Parmap was doing Permutation-City-style computation.)

However, I can now report that I really did see an 18X speedup, which is pretty impressive for a 4-core machine, and makes the work I wanted to do a lot more feasible–thanks to Parmap's authors, Marco Danelutto and Roberto Di Cosmo.


#9

I’d be interested to know how exactly you got even 18x speedup. Your 4-core MBP has hyperthreading, so it’s effectively almost 8 cores, but there’s still a factor of 2 there that doesn’t make sense. It’s possible that there’s some GC overhead in the single-core version when dealing with a lot of data, that doesn’t exist when you ship the data off to separate forks.


#10

@bluddy, well I think I messed up again. I ran the experiment today, and had a 14X speedup. But I don’t think that’s right.

I’m still using gettimeofday, and also using Sys.time in a separate calculation. Comparing the Sys.time number for Parmap seems meaningless. That number is so small that I think it’s just the amount of CPU used to fork the processes and whatever happens before and after that in the same function. So I’m not using that number any more.

When I give the 18X or 14X numbers, what I’m doing is I’m dividing the gettimeofday duration of the Parmap version by the same duration for a non-Parmap version.

However, this particular test runs for a very long time–7 minutes for the Parmap version, but an hour and a half or two hours for the other version. And I realized that on the single-process version, there is a huge difference between the gettimeofday duration and the Sys.time duration. To me this suggests that there are other processes on my machine that are eating up the CPU time. Yesterday I might have had an automated backup running. Maybe the virus checker was scanning today. And, if I divide the single-process CPU time by the Parmap-version wall time, the ratio is a bit more than 4X. That seems more reasonable. (I am confused about hyperthreading. My understanding was that in terms of processing power, I still only have four cores, even though the OS thinks there are eight.)

It seems as if in order to get an accurate measure of the difference between the Parmap and non-Parmap versions, I need to suspend anything that might be doing significant work in the background, or run the tests in single-user mode, or set up a more complicated measuring scheme that will report from the subprocesses. Not sure any of that is worth the trouble to me at this point. I’m happy about the additional speed that I’m getting. (Maybe Core_bench will be able to measure more accurately?)

(Btw I accept @dbuenzli’s point that getttimeofday can be messed up by changes to the system clock, and will switch my testing at some point, but I don’t think this is likely to make much a difference in long-running processes.)


#11

Yeah – benchmarking can be tough.

One thing to consider is that if you have enough data to require pmap, it’s quite likely that performance is dominated by hard drive/memory cache issues rather than any CPU-related optimization. This could explain the unreasonable-seeming speedup you’re getting.

And btw hyperthreading is like 2 cpu cores that share most of their innards. Under ideal situations, they can act like 2 separate cores, but in the worst case scenario (maximum contention), they’ll act like one core.


#12

Ah, thanks for explaining about hyperthreading. That makes sense of everything I’ve heard. It’s like 8 CPUs or like 4 CPUs, depending on the situation, so neither is exactly right in general. I might try playing with the number of processes parameter in Parmap. By default it seems to use seven on my four-core machine, and (I think) two on my two-core machine.

Ah, hmm. OK. The test constructs two 1000x1000 matrix from four matrices of the same size. I use Owl's matrix arithmetic for parts of this, but most of it has to be done by hand, so there’s a small but significant bit of calculation needed to create each element in the new matrices. It might be interesting to check whether there’s swapping happening. I don’t think so, but I haven’t watched carefully, and I certainly haven’t messed around with any OCaml memory configuration parameters. I’ll leave the operating system and hardward to deal with caching … :slight_smile: I’m not going for every little bit of optimization.