Actual Performance Costs of OOP Objects

I’d like to start by mentioning the one point where objects outperform everything else: subtyping is free. So if you have some code that by design relies on some version of subtyping, then objects could (in theory) outperform versions based on manual casting or first class modules.

But objects are very expensive to allocate (much more than the amount of memory they take would imply), and method calls are strictly more expensive than a field load and a function call.
And their worst property in my opinion is that they’re completely opaque to the optimiser. Simple things like (object method m = 42 end)#m will be compiled to a bunch of CamlinternalOO functions calls, then an indirect call to a function that will eventually return 42. This may look like a silly example, but this kind of optimisation is very important when your code uses lots of small, single purpose functions that get inlined by the compiler. Adding the fact that method calls can never be inlined anyway, this kind of style (lots of small functions) becomes very expensive in an objects-based project compared to all other alternatives.

2 Likes

The main historical bottleneck in the compiler is the type-checker, which indeed wouldn’t be badly impacted by using objects for traversal. But many other parts of the compiler are basically a single pass on the whole term, doing very simple operations (typically, all the passes in Simplif), so I think that if we used objects for traversal of all our terms we would get a very noticeable slowdown (my random guess based on my work on Selectgen is that we would get at least 50% slowdown on those parts, which when taking into account the whole compilation process would lead to somewhere between 10% and 25% slowdown. It’s just a guess, not a benchmark, though).

1 Like

I can’t stress this enough. I think that focusing on the fact that objects are slower than records and functions completely misses the point. Even if they were 10x or 20x slower, it may not matter at all for many workloads.

If objects are the most natural way to express your code (ie using subtyping, open recursion, inheritance, etc), I think you should not hesitate to use them. Only later, if their use is identified as a performance bottleneck, alternatives should be considered (like with any other code).

There are some good reasons not to abuse objects: eg their typing theory is more complex than the rest of the core language, but performance is not one of them, in my view.

Cheers,
Nicolas

1 Like

So the takeaway I get here is if you do have a critical path in your code that uses objects, you’re better off recreating them with records-containing-functions and passing in the record as this. That’s quite unfortunate, and it makes me wish objects were implemented the standard way rather than the way that they were in OCaml.

Regarding the makeshift solution, would it make sense to have the functions in the record hold this in their closure via partial application?

(And yes, I get that the performance impact may not be that big in some cases, but I’m talking about cases where we know for certain that we’re in the critical path).

1 Like

Another question: I understand that there’s some caching that happens when you call a method, so that next time when calling from that call site, the method can be resolved faster. How can this work in code that possibly calls different object types? Does the call site compare the object id first before using the cache?

Slower than what? More space than what? Are we comparing using objects to calling functions in a module known at compile-time, to using a record with one field for each method, or to using a GADT, for example?

The link says:

The first field of the block refers to the object’s class and associated method suite, in a format that cannot easily be exploited from C. The second field contains a unique object ID, used for comparisons. The remaining fields of the object contain the values of the instance variables of the object.

This is exactly what you would expect (except possibly the ID). The benchmark I gave earlier compared against a GADT:

type source = Source : (module SOURCE with type t = 'a) * 'a -> source

So here, the first field is the method table (module) and the second is a pointer to the record with the instance variables. We’ve lost the ID field, but gained an extra block and an extra indirection.

But method lookup is slower (a pre-computed hash rather than a record index).

There are some other benchmarks at GitHub - talex5/flow-tests: Just for testing, and there are also some notes at eio/doc/rationale.md at main · ocaml-multicore/eio · GitHub.

My thoughts are:

  • Avoiding objects is a benefit for speed-critical code, especially if you know the target statically.
  • If you need sub-types, objects will be a good choice.
  • For most uses, the overhead won’t matter compared to whatever other processing you’re doing.
4 Likes

Based on your rationale.md file, it looks like we got to this issue dealing with the same problem: OCaml lacks a good way to deal with steams of data. I’m currently dealing with this on a personal project, and it’s quite an annoyance. C++ has streams, which are subclassed into FileStreams, OutputStreams, InputStreams StringStream and what not (this isn’t meant to be 100% accurate, but you get the idea). OCaml completely lacks a notion of a data stream (I guess you call it Flow) and abstracting over it. The OOP model is a perfect fit here, as we need an easily extendible type.

So am I to understand that you used OOP Flows in eio?

Each object comes with its own map from method labels to closures. The map is implemented using an array of two-word cells, one word for the key and one word for the values. Keys are ordered, so looking up a given method from the associated label is a dichotomic search.
For each method call in the generated code, a global cache entry is generated that will hold the last index found by dichotomic search at this particular call. So the next dichotomic search can be skipped if the index from the cache points to the correct label. This happens fairly often if you don’t have too many different classes/raw objects implementing the same signature. But that’s also all the cache saves you: a few iterations of a short loop.

2 Likes

Yes, flows are objects. But typically you then wrap a flow in a buffered reader, which isn’t an object. So reading a character or line doesn’t involve a method call, unless the buffer is empty and it needs to refill it (see Buffering and Parsing).

Also, we sometimes optimise away the method calls. For example, you can ask to copy from any source flow to any sink, but if the source and sink are both wrapping Unix file descriptors then it will automatically extract them and do the bulk of the copy directly.

1 Like

I see. So the buffering (which would normally be optional, and part of the flow) is externalized and serves as a method to avoid many method calls. Totally understand why you did it, but it’s kinda sad that we have to step around a half-baked language feature. Presumably there’s a lot of optimization OCaml could have done here, including adding a strict object option to the language that would only allow objects if they came from a class, thus allowing proper compilation methods.

Did you by any chance measure the performance impact without buffering? In theory, repeat calls to the same method shouldn’t be bad, right?

As a side comment, it seems like this Flow business should be spun out to another library.

Thank you! I’ve been seeking a proper explanation for a long time. However, I don’t think this is a minor cost that’s saved when considering that this short loop is an integral part of method dispatch. We’re getting to python-level costs here.

1 Like

Just to test things out myself, I compared 3 versions of a loop:

(* plain *)
let x = ref 0;;

for i = 0 to 1000000000 do
  x := !x + 1
done;;
Printf.printf "result = %d\n" !x 

(* OOP *)
let x = 
  object (self)
    val mutable x = 0

    method add y =
      x <- y + 1

    method print =
      Printf.printf "result = %d\n"
end;;

for i = 0 to 1000000000 do
  x#add 1
done;;
x#print;;

type r = {
  mutable x: int; f: r -> unit;
};;

(* record-based *)
let v = {
  x=0;
  f = fun r -> r.x <- r.x + 1;
};;

for i = 0 to 1000000000 do
  v.f v
done;;

Printf.printf "result = %d\n" v.x

All files were compiled with ocamlopt, no flambda.

Times:
plain version: 1.95s
record version: 2.2s
oop version: 5.5s

So yes, it definitely seems like it’s worth emulating objects with records in high performance regions of the code. And no, method caching doesn’t really help.

That’s a depressingly low bar for a compiled language…

You’ll probably want to use (Sys.opaque_identity v).f v, if you want it closer to the performance you’ll get in real code. Otherwise the function might get inlined.

3 Likes

I doubt it does without flambda…? But I’ll try.

I just checked and there’s no impact without flambda.

I just realized I had a ‘bug’ in my object version as I wasn’t actually incrementing self.x. If I change the mistaken line to x <- x + y, we go up to 6.3s for the object-based version!!!

I never worked on objects directly myself but I still find it a bit vexing to see them described as a “half-baked language feature”.

You assume that the design here is related to the cost of method calls, but we don’t know that. From a distance it looks like @talex5 chose a two-level design, where both layers solve different problems, and objects seem to be useful for one layer (some convenience around two_way I guess; but I would guess that flows could also be made without object types if one wanted) but not the other.

So yes, it definitely seems like it’s worth emulating objects with records in high performance regions of the code. And no, method caching doesn’t really help.

Wait, your benchmark shows that even in a super-tight loop, the overhead of method dispatch is at worst 2.5x, and you react like it’s a huge cost? This is not a huge difference! It means that as soon as your code will do something 100x slower than incrementing an integer (which is: most other things), adding objects in this way will be a 2.5% slowdown, which is not noticeable for most sections of the code – only those, again, that are performance-critical.
(Note: you claim that caching doesn’t help here, but in fact probably does.)

(But: I’m not sure that your baseline is correct, it would probably be noticeably faster with a local reference instead of a global variable.)

What’s the next feature to be criticized because it’s slower than not doing anything? We can show you micro-benchmarks where exceptions incur a 2.5X overhead just as well, should we get rid of exceptions? Effects, lots of overhead, gone? Functors should never be used! int option has python-level costs compared to encoding invalid values as -1!

6 Likes

That’s true. That’s how I read @talex5’s explanation here.

Yes, I’ve been focusing on performance-critical areas. I said as much.

The question is what are the alternatives. If objects have a record-based encoding that works almost as well and is 2.5x faster (and I agree that it’s just a microbenchmark, but it reflects a basic issue), then yes, I’ll just use that, thank you very much. If effects have a serious performance overhead compared to the alternative, which is monadic code (and I don’t believe they do – quite the contrary, in fact), then yes, we should generally avoid them. Functors allow for language features that cannot be expressed differently, but I would still expect their performance cost to be minimized or they aren’t worth using in production code. The same applies to option – obviously it provides type safety for which there is no alternative, but it’s also true that in many cases that extra allocation is painful, and the language should try to minimize the cost of using this essential feature.

Bottom line: OCaml is a compiled language and it competes in a space where performance matters. We’re not python. Our competitors are constantly improving their performance, and we need to do the same.

AFAICT the object model is half-baked, in the sense that performance issues were not sufficiently considered. To me, it seems like a feature that was put into the language in order to publish a research paper, rather than carrying out the obvious move of implementing a high performance OOP system as is available in other compiled languages. And the evidence is that the ecosystem avoids this feature like the plague.

Well I’m waiting for your PR doing just that.

Many aspects of the OCaml implementation reflect the fact that it’s the product of a small team of people – it’s regularly in the zone where it’s getting 80% of the benefits with %20 of the work. It’s not realistic to expect the same language-engineering approach than what you get from, say, Oracle and its full teams of people extracting extra percents of performance from the JVM. (Or Google and v8, etc.). In particular, the more dynamic the programming-language feature, the harder it is (for everyone) to get reliably good performance. I think you have to accept this idea, and in particular there is no need to get vaguely insulting about it.

This is not to say that people shouldn’t work on making OCaml programs faster (I’m also working on this!), including the object-oriented layer if they feel like it. More power to them! But the particular approach to (1) reasoning about performance and (2) programming-language implementation expectations in this thread is just unhealthy.

3 Likes

Unless I’m mistaken, this an overhead of at best 2.5x. The conditions here are optimal for the object code, and suboptimal for the non-object code (no inlining). Try the same example with flambda and you will get a much wider gap.
This doesn’t mean we should discourage people from using objects, but I don’t see any reason to encourage them either.

Oh, and a small remark:

That’s the main reproach I have with objects in OCaml: very few compiler maintainers have worked on objects. Jacques Garrigue is, as far as I know, the only active maintainer to have worked on the code when it was introduced, and I don’t think I’ve seen any meaningful patches on this part of the compiler in years. On one hand, it means that it’s working fairly well; on the other hand, even if someone wanted to spend time improving things in this part of the compiler, the contribution would likely be rejected as nobody would be available to review it.

Find someone willing to review such a PR, and I’ll consider submitting one.

1 Like

The conditions here are optimal for the object code, and suboptimal for the non-object code (no inlining).

I would indeed expect to observe larger slowdowns in some scenarios (my guess above was “at least 10x”).

On one hand, it means that it’s working fairly well; on the other hand, even if someone wanted to spend time improving things in this part of the compiler, the contribution would likely be rejected as nobody would be available to review it.

We’ve worked on parts of the compilers that had not seen structural changes in years, and got things reviewed and merged. Examples include pattern-matching (checking and compilation), the typing of classes, the parser… the major GC, etc. When there is a bus-factor problem with the existing implementation, and/or when the code is arcane (the object runtime probably checks both boxes) it’s more work, but it’s certainly doable.

Find someone willing to review such a PR, and I’ll consider submitting one.

You can do as you wish with your time, but I wouldn’t think about it this way. Is it an important improvement to make? (Is it a useful use of your time, more beneficial than other things you would work on otherwise?) If you believe the answer is “yes”, and you can explain why, then of course we should get the discussion started. (I think right now the answer is likely to be “let’s get Multicore ready first”, but we’re not measuring things in couple-months, are we?)

2 Likes