Actual Performance Costs of OOP Objects

Well I’m waiting for your PR doing just that.

Many aspects of the OCaml implementation reflect the fact that it’s the product of a small team of people – it’s regularly in the zone where it’s getting 80% of the benefits with %20 of the work. It’s not realistic to expect the same language-engineering approach than what you get from, say, Oracle and its full teams of people extracting extra percents of performance from the JVM. (Or Google and v8, etc.). In particular, the more dynamic the programming-language feature, the harder it is (for everyone) to get reliably good performance. I think you have to accept this idea, and in particular there is no need to get vaguely insulting about it.

This is not to say that people shouldn’t work on making OCaml programs faster (I’m also working on this!), including the object-oriented layer if they feel like it. More power to them! But the particular approach to (1) reasoning about performance and (2) programming-language implementation expectations in this thread is just unhealthy.

3 Likes

Unless I’m mistaken, this an overhead of at best 2.5x. The conditions here are optimal for the object code, and suboptimal for the non-object code (no inlining). Try the same example with flambda and you will get a much wider gap.
This doesn’t mean we should discourage people from using objects, but I don’t see any reason to encourage them either.

Oh, and a small remark:

That’s the main reproach I have with objects in OCaml: very few compiler maintainers have worked on objects. Jacques Garrigue is, as far as I know, the only active maintainer to have worked on the code when it was introduced, and I don’t think I’ve seen any meaningful patches on this part of the compiler in years. On one hand, it means that it’s working fairly well; on the other hand, even if someone wanted to spend time improving things in this part of the compiler, the contribution would likely be rejected as nobody would be available to review it.

Find someone willing to review such a PR, and I’ll consider submitting one.

1 Like

The conditions here are optimal for the object code, and suboptimal for the non-object code (no inlining).

I would indeed expect to observe larger slowdowns in some scenarios (my guess above was “at least 10x”).

On one hand, it means that it’s working fairly well; on the other hand, even if someone wanted to spend time improving things in this part of the compiler, the contribution would likely be rejected as nobody would be available to review it.

We’ve worked on parts of the compilers that had not seen structural changes in years, and got things reviewed and merged. Examples include pattern-matching (checking and compilation), the typing of classes, the parser… the major GC, etc. When there is a bus-factor problem with the existing implementation, and/or when the code is arcane (the object runtime probably checks both boxes) it’s more work, but it’s certainly doable.

Find someone willing to review such a PR, and I’ll consider submitting one.

You can do as you wish with your time, but I wouldn’t think about it this way. Is it an important improvement to make? (Is it a useful use of your time, more beneficial than other things you would work on otherwise?) If you believe the answer is “yes”, and you can explain why, then of course we should get the discussion started. (I think right now the answer is likely to be “let’s get Multicore ready first”, but we’re not measuring things in couple-months, are we?)

2 Likes

Note that there is also a mistake in the record version. Contrarily to the object version, it does not take y as an argument.

Also, the performances are highly dependent on the architecture of your processor and how well it deals with branch prediction and the like. On my laptop, I get the following times:

  • plain: 1.33s
  • record: 1.96s
  • object: 2.64s

So, the overhead is not that bad.

2 Likes

Yes, I consider this split to be desirable anyway. If I’m writing an HTTP request parser, I want to accept many different sources of data (e.g. TCP socket, TLS decoder), but I don’t want to support multiple different user-provided buffering implementations.

If your workload is one integer addition, then yes you should avoid objects.

2 Likes

I didn’t think it would make an impact, and indeed it doesn’t (on my machine at least).

This is fascinating! May I ask what architecture your laptop has?

If your source is a buffer (e.g., a byte array), you don’t necessarily want another buffer on top of that. Leaving it up to the implementation makes sense – in an ideal OOP hierarchy, a user could choose which elements they want. But in this case, if the user chooses to use the direct OOP API (rather than buffering) they could presumably be hit with a performance hit. This is not very reasonable IMO. Having to buffer against language features is… problematic to say the least.

On mine, the difference is huge:

  • record +1: 1.56s
  • record +y: 1.96s

Intel i7-8650U, so a Skylake architecture.

I have a different result (Linux - ryzen 5950x) where the overhead is bigger:

ref: <0.5s
record: 1.08s
object: 3.5s 

That looks like the kind of results I would expect with flambda. Can you confirm whether it is enabled or not ?
Comparisons using flambda are not very useful here, as this test presents a lot of optimisation opportunities that wouldn’t occur in a real program.

There’s a long thread of discussion (and a little acrimony, which seems out-of-place) about whether objects are an afterthought or not. I thought I’d weigh in, since I spent a number of years writing a large caml-light codebase, and then went off to spend well over a decade in the depths of Java, JVMs, and Java-based commercial products.

There’s a saying that a good language makes the perceived cost of using a feature commensurate with the overall undesirability of having people use it. And a language that doesn’t make those two things line up, is promising its users a world of trouble. For instance, Java makes concurrency trivial to use (“new Thread()” and “synchronized”, wheeee!) and yeah, users get into a world of trouble. I think the way that OCaml has made objects available, but not too attractive, is about the right balance.

Your argument that “objects are an afterthought in OCaml, mostly there for a research paper” might be correct. But I’ll note that in the O-O world, the use of subtyping/inheritance (== “O-O”) has decreased pretty monotonically over time: in C++ with the rise of templates and large template libraries like STL and Boost, most programmers rarely need to construct subclass hierarchies, and more and more, O-O is a tool for building those templates.

I remember when O-O arrived in OCaml; I even used it for a couple of moderate-sized projects (a big one, a modular packet-sniffer/stream-reassembler/performance-analyzer in 2001), before deciding that it was more trouble than it was worth. And that was when I was well-within my decade of commercial Java systems hacking.

What am I trying to say? It isn’t obvious that spending a massive amount of energy on improving O-O would be the best bang-for-the-buck for OCaml overall. I’m not even sure it’s wise to encourage programmers to use O-O when other paradigms suffice.

7 Likes

Just used dune with default ocam 4.13

Here’s one aspect which hasn’t been mentioned in this thread yet:

This came up in “the real world” for me, where I help maintain a fairly large binary that had lots of classes linked into it.

1 Like

I know @bluddy started this discussion about the performance of OOP, but having followed the discussion leading up to this which was about a more expressive I/O layer in Ocaml, and OOP is a possible answer to that, but not necessarily the only solution.

To focus on the I/O problem it seems to me:

  1. If we used OOP for this, in many cases, the overhead of a method call would be completely dwarfed by the cost of performing I/O.
  2. Most common uses of I/O pull out, or put in, fairly large chunks of data from the channel to avoid performing too many calls to that layer, even if buffered and even if the underlying structure is an in-memory string, so the cost of a method call, again, seems like it is unlikely to be a large overall cost in the program.
  3. There are various programmer sentiments that, rightly or wrongly, will impact one’s reaction to the I/O layer of ocaml being objects. Personally, I have a possibly irrational distaste for objects (in generally I don’t like sub-typing as I believe it makes programs harder to understand), so module-based implementation “feels” better to me even if it’s using objects underneath. I have my own I/O library that I use for async code that is implemented similar to what @talex5 has done except I use a record for methods underneath a module handling the buffering.
  4. If the primary counter to doing an OO low-level layer and a Buffered module/other modules above that is in the case of an in-memory representation we end up paying a double-copy cost, that might be a reasonable cost to pay for an interface people might like a bit more.
  5. I’m not sure how much any of this matters given a lot of Ocaml code is in some async code, which have their own I/O primitives which, I don’t think, would be workable with this interface given the types would be different.

One final thought: I don’t know the cost for looking up a function with first-class modules. But this strikes me as maybe a good compromise in that with first-class modules, you always have an escape hatch if it’s important for performance. That is, imagine you had a Buffered_stream module that was backed by some kind of I/O layer built on a dispatch table and it turns out you’re mostly doing I/O in an in-memory representation and that is too slow. Well, for that performance sensitive code you could pass in a module with your reduced-copy implementation that matches Buffered_stream for those specific cases and pass in the standard Buffered_stream implementation for non-in-memory situations. Yes, it’s more verbose, but at the same time the amount of code that needs that is probably quite low so maybe that is a fair balance to be made.

1 Like

There is nothing magical about first-class module. OCaml needs to create an adapter module of the fly if there is a discrepancy between the interface of the original module and the interface of the expected one. For example, the micro-benchmark above can be adapted as follows. And, as expected (one allocation per call), it is much much slower than the object-oriented version (at least on my laptop).

module type A = sig
  val f: int -> unit
  val get: unit -> int
end

module B = struct
  let x = ref 0
  let get () = !x
  let f y = x := !x + y
end

let call (module M : A) y =
  M.f y

let () =
  let module M = B in
  for i = 0 to 1000000000 do
    call (module M) 1
  done;
  Printf.printf "result = %d\n" (B.get ())
3 Likes

I don’t have my home laptop with me to run this. Could you add the comparison times?

  • plain: 1.33s
  • record: 1.96s
  • object: 2.64s
  • module: 4.27s

You should get slightly better performance if you define call as:

let call (m : (module A)) y =
  let module M = (val m) in
  M.f y

or

let call y (module M) =
  M.f y

The reason is that (module M) in a function argument prevents un-currying the rest of the arguments, leading to extra closure allocations and function calls. These extra operations can be optimised away by Flambda with -O2 or -O3, but the non-flambda compiler will be slower.
In fact, you would probably have a fairer comparison if you had written your code as:

for i = 0 to 1000000000 do
  let module M_A : A = M in
  M_A.f 1
done;

The call function introduces overhead that is not present in the other versions.

I agree. But unfortunately, OCaml is sufficiently smart to inline the call in that case. That is why I used the call proxy.

Indeed, the performances are better. And since OCaml inlines the call to call (but not the one to M.f), this better reflects the actual cost of first-class modules.

  • plain: 1.33s
  • record: 1.96s
  • object: 2.64s
  • module: 3.05s