How to speed up this function?

I only proposed you to get the help of the recript compiler to see how to rewrite a tail rec function with a while loop, and then compile it with ocamlopt that produces native-code. It doesn’t mean you should consider that ocaml is not better than JS.

Choosing ocaml for its performance is a good choice.
It’s quite often not that far from C.
And the C FFI is very simple and easy to use in ocaml so that you can easily make some functions in C, which is still better than if you have to write all your program in C.

2 Likes

Am I confused or should NaN-boxing be able to support around 48-bit unboxed integers, not just 32? From a symbolic computation perspective, using an unboxed integer as a “practically unoverflowable” counter is common, and 48 bits would be much better than 32.

Just as one data point: I would not want to reduce the unboxed integer width from 63 to 32 in exchange for unboxed floats (for which I have almost no use). Reducing from 63 to 48 could probably be survived.

1 Like

I think this is a great solution for high performance optimizations in specific instances, such as wanting a specific record to be unboxed, but I’m not sure this solution is good enough for floating point computation. The biggest issue is that manually unboxed types cannot be handled polymorphically.

I did want to try to convert the OCaml runtime to NaN-boxing at some point. It does seem to be a rather messy solution though.

Other options:

  • Making the runtime representation fatter, by which I mean doing what haskell does. If we add a word to each allocated object, we can represent the number of pointer values in a data type (assuming all pointer values are mapped to the beginning of the data structure behind the scenes). A similar thing is then done at stack frame boundaries (using bits instead). This solution would allow for unboxed ints, floats and any other primitve types.

  • Similar to .NET, Java etc, creating a per-type representation in memory that contains functions for GC, comparison etc. A single pointer would need to be added per allocated value.

  • Specializing functions for types such as floats - what @alainfrisch was doing in his PR - the biggest issue being possible combinatorial explosion. In .NET, they can do this only as needed via the magic of JIT (reified generics).

1 Like

In general I try not to rely on “ah but with flambda” arguments for performance. Not that these people are not doing a good work but most of the time I’m not interested in tweaking compilation flags, installing a different version of the compiler or increasing my compilation times too much.

What I’m interested in is a good performance/compilation time tradeoff on the baseline compiler and a clear model of what to expect when I write my code (for which I’m still relying on this 20 years old piece I often link to – I wish somehow in the know would update it and integrate it to the manual).

Personally I could put up with floats getting boxed at function boundaries in general (large functions can still be split and be assembled using [@inline] functions though I should likely now also check the assembly to see how it really interacts with local unboxing) but I’m a bit less happy to have to rewrite code like this:

  let centroid : P2.t list -> P2.t = fun ps -> 
    let rec loop n cx cy = function
    | [] -> let n = float n in P2.v (cx /. n) (cy /. n)
    | p :: ps  -> (loop[@tailcall]) (n + 1) (cx +. P2.x p) (cy +. P2.y p) ps
    in
    loop 0 0. 0. ps

as a while loop with references to avoid the boxing.

2 Likes

Alas I think flambda will never be able to fully erase seq.

For me, if I write performance critical code in OCaml, I’ll sometimes use iter for iteration (type 'a iter = ('a -> unit) -> unit which is easier to optimize), but for the most part, it boils down to:

  • use imperative code with dynamic arrays
  • use loops rather than recursion, with raise_notrace for local control flow (shed one tear for the absence of break/continue and the likes)
  • one day, we might have value types. That’ll come in handy I’m sure.
  • profile a lot

But if you really need perf, just don’t use OCaml, use C++ or rust and write bindings.

2 Likes

NaN-boxing offers you as many unboxed values as there are non-NaN floating-point numbers (i.e., 2^64 - 2^53 + 3; increase 3 if you need more sentinel values for specific purpose). So, it is not just 48-bit integers that you get; you can actually have 63-bit integers if you wish so. This is especially true of OCaml since static typing means that you do not need to distinguish between integers and floating-point numbers dynamically. (Javascript interpreters only care about 32-bit integers because that is what bit-level operators, e.g., <<, operate on. In particular, this is the most optimized datatype for asm.js.)

For example, the simplest NaN-boxing scheme I can think of for OCaml would encode boxed values by setting all their bits 51-62. Most (all?) processors disregard these bits anyway, so you would not even need to explicitly decode boxed values before using them. Floating-point values would be stored as is, but after NaN normalization. Thus, this is still more costly than plain floating-point operations in C, but cheaper than allocating. Integer values would be stored with their bit 62 reset, which means that their bit 63 would have to be replicated when decoding. (Note that, as with the current OCaml scheme, you do not need to decode integers when doing addition, subtraction, etc, so the cost of this decoding might be negligible.)

5 Likes

I found this lexifi blog an informative dive into unboxing floats in OCaml.

I also wanted to emphasize that bigarrays and records of floats are already specialized unboxed representations. In practice I use bigarrays for any serious numerical computation and thus typically avoid the boxing penalty.

2 Likes

This is fascinating and sounds like a very promising avenue.

This is true (due to the fact that bigarrays can benefit from SIMD instructions using Owl, for example), but it means you’re going to have less typing and semantic control over your data. For example, suppose you’re writing an accounting application with totals for many different kinds of sums. If you use BigArray, the efficient way to do this is to lump all the sums together into a blob (otherwise you’re still boxing everything). The result being that every kind of sum will be an offset into a BigArray, which isn’t a particularly safe way of doing things.

3 Likes

From our perspective at Jane Street, unboxed types is a very high priority. A large slice of the team is thinking about it, and Chris Casinghino and Richard Eisenberg have joined recently and have it as their primary focus, along with Antal.

In terms of when it makes it upstream, that’s less clear. We’re working hard on getting out some initial versions done, and we plan on iterating internally, where it’s easier for us to try things out and then change them as we go. Once we have a design that we really believe in, we intend to propose it upstream, but how quickly that goes (and whether it’s successful at all!) depends on whether upstream maintainers and the larger community find the improvements compelling.

In any case, I find this conversation encouraging, since it suggests there’s some real hunger for improvements in this space.

I expect ICFP in particular to be a good opportunity for people to learn more about the work we’re doing both here, and also on type-safe stack allocation. (For what it’s worth, the latter is already in production internally and looks very promising.)

If you’ll be at ICFP:

y

25 Likes

This is super exciting to hear about. Jane Street has really managed to expand our language advancement bandwidth, so that while the Core team is busy sorting through the details of multicore, we can still move the language forward in orthogonal directions and become more competitive. Kudos!

5 Likes

But if you really need perf, just don’t use OCaml, use C++ or rust and write bindings.

I know this thread has derailed too much already, but on that note I’d really appreciate if OCaml, or at least its C API, could pin block addresses temporarily like the CLR and JVM can. I’d write a lot more (and more efficient) bindings if I didn’t have to worry about copying strings/bytes back and forth between allocations; using bigstrings often feels like a domain-specific hack rather than a solution.

1 Like

Could you explain why? BigArrays are now fully integrated into the language.

In short, I often have efficient OCaml code using (not necessarily big) regular strings, but would require copying to access those from a C binding. But using bigstrings would make the code messier and probably still need copying at library boundaries, so that simply moves the problem around.

I guess by fully integrated you mean the accessor syntax, but I don’t think that’s enough to be on par with regular strings:

  • Most of the ecosystem works with strings, except for a few domains in which slicing or C bindings are common, which is why I say it’s domain-specific.
    • For example, there aren’t In_channel.input_all_bigstring or Bigarray.Array1.get_utf_8_uchar (stdlib is just a well-known example, I’m not picking on it here).
  • Converting between string and bigstring in involves a copy, when avoiding extra copies is precisely why I’d like to have pinning.
  • Bigstrings involve two allocations, one of which is the more expensive malloc.
  • There’s no way to write bigstrings literals and allocate them statically as part of expression literals (because there aren’t read-only bigstrings), which is a pattern I sometimes use to reduce GC pressure for well-known inputs. The equivalent thus requires manually staging allocations and preventing mutable access.

Edit: I just realized that by fully integrated you probably meant they aren’t implemented as C bindings on ocamlopt anymore.

1 Like