Interesting! I wasn’t aware of these instructions. On x86, it seems that only Skylake and newer have good performance, and only gathers (see here). AMD Ryzen has them implemented really badly apparently, which is a problem since Ryzen is currently the most promising architecture. So this may eventually be viable in the future and is worth looking into. It would indeed make a functional list, for example, perform closer to a vector in some instances (excluding cache issues).
I couldn’t really find in-depth information on ARM’s implementation of these instructions.
It’s probably the indirect access patterns. Most indirect access patterns are fairly complex. It’s even harder in OCaml where you call unknown functions (map, fold etc) rather than looping. Haskell cheats using its rewrite rules.
Agreed, but I think that’s a problem. The default data structure in a language that aims at performance shouldn’t be a linked list.
I’m not sure that users really enjoy writing their code in a vectorized style. I know that I don’t. But it’s the only way to get performance out of libraries like numpy and owl that rely on stringing together kernels implemented in a low level language. One of the selling points of Julia for example is that you get to write regular old loops and still have fast code.
You can definitely do more with vector instructions than crunch numbers. For example, I think parsing is usually considered a good fit for functional programming. This paper (https://arxiv.org/pdf/1902.08318.pdf) describes how to use simd to parse json at GB/s speeds, which I think is pretty neat.
Reading begining of the paper I got an expression that this is very specific use case which may be not scalable for not-a-JSON languages, where more alternatives present in “grammar”. Frankly speaking, the most benefit of SIMD in this paper gets a lexing phase, and not a parsing itself.
By using the tuples you destroy performance as each tuple is allocated. I would suggest the following interface,
let cmpeq_i8 : src1:t ->pos1:int -> src2:t -> pos2:int -> dst:t -> pos:int -> unit
or just without labels.
With noalloc a call to C is just a jump so it is pretty good as it is as cheap as you can get. But the C code is not available to the OCaml compiler so the latter can not perform global optimization across OCaml/C boundaries. You may, however, try Duplo, which does cross-language optimizations and see if it makes things better.
@ivg Ah! Thank you for that insight. That is indeed a good optimization! Yes, Duplo does look promising. Being able to inline the call would mean that even the overhead of the jump is eliminated!
Right, I should have been clearer. Gather loads came first, because scatter stores are a pain for chip designers, since they have to handle the case of conflicting store addresses. As for speed, it is not that poor. Sure, a gather load is much much slower than a regular vector load. But one has to keep in mind that several cache lines are accessed at once, rather than just one.
But that is not even what matters. The bottleneck today hardly ever lies in the execution units of a processor; it lies in the instruction decoding unit. A superscalar processor will usually be able to decode about 3 scalar instructions per cycle or 1 vector instruction. Now, even if a predicated 4-way gather load is slow and takes 12 cycles, this has to be compared with the time it would take to decode all the other instructions needed to perform the exact same task. More importantly, since the processor spent so little time to decode a gather load, it can now focus on things that actually matter for performances: prefetching, prediction, speculation, and so on.
There is not much to say. Since ARM chips are customizable, your compiler does not know a priori how wide the vector registers are. You could even end up with a processor where vector instructions have just 1 way, so they would be the same as scalar ones (except for their encoding). Let us ignore this degenerate case and assume registers actually contain several cells. Even so, you can still assume that gather loads are internally broken into scalar loads and thus not faster (as above, because of cache lines). But again, that is not what matters. As explained, the pressure on instruction cache and decoding decreases. So, if nothing else, you will at least get a reduction in power consumption.
To summarize, SIMD instruction sets were not introduced in mainstream processors because vector execution units are faster than scalar ones, but because vector instructions are comparatively much cheaper to decode than scalar ones.
I followed your last post up to this point, where you lost me. Sure, decoding is one (minor) advantage, but the point of SIMD is that operating over batches of data is far more efficient bandwidth-wise. That’s the starting position for any SIMD processing. Scatter/gather is an enhancement that costs bandwidth (depending on how many vector instructions you can string together) while hopefully performing the task of scattering/gathering cheaper than doing it normally. If all you’re doing is saving a few instructions in a loop, the effort of supporting vector operations is near meaningless IMO.
You’re right that due to cache locality issues, once you’ve broken the concept of tight batches of data, you’re likely to play cache fetch/write costs with gather/scatter that completely reduce the advantage of vector operations. This is why the concept still fits best in the context of data that is tightly packed into arrays.
These sorts of things really need numbers. Note though that even if everything is in icache, there is a limit on how many instructions can be fetched per cycle, and how many instructions the CPU can decode per cycle, and how many have the decoded form cached, etc. So it still makes a difference to bandwidth.
There might be a misconception. For example, a modern x86 processor has no trouble retiring four (fused) micro-operation per cycle. And since four of the eight ports can execute ALU operation, performing four additions per cycle is perfectly feasible (or any other ALU operation). In other words, you do not need a SIMD instruction to perform four additions per cycle; the scalar addition is just fine. But, if you are using the scalar addition, then the decoding unit will struggle to decode sufficiently many instructions to issue four micro-operations per cycle [*]. That is why the SIMD instruction suddenly becomes relevant. By using SIMD instructions, then the decoding unit can keep up with the execution units.
So, you are right that is a matter of bandwidth. But it is not just a matter of data bandwidth, it is also and more prominently a matter of instruction bandwidth. One should not underestimate the cost of decoding an instruction. As a rule of thumb, you can consider that the decoding unit of an x86 processor has more than half the transistors of all its execution units. So, it would be extremely cheap to add new execution units to the processor, except that they would be left unused because the decoding unit cannot keep up anyway.
[*] If your additions happen to be packed in a very short loop (less than 28 micro-operations), then the so-called Loop Streaming Detector will kick in. As a consequence, the decoding unit is no longer used, so you get the optimal throughput, even with scalar instructions.
Just to add to the discussion is that not only speed matters, but power dissipation and physical constraints of the chip. The instruction decoding and generally the instruction stream delivery is responsible for 30-40% of the power consumption. Therefore, having a more dense encoding may keep the throughput the same but lead to better power consumption and smaller decoders. All these could be used to deliver cheaper/more efficient chips or expended on other execution units, i.e., to get more performance.