OCaml is a language that prides itself (for good reason IMO) to take backward compatibility very seriously. And as far as I can tell, that claim is pretty solid.
That being said, OCaml started a long time ago (no offense ), and I guess some initial choices which made sense at the time are⌠letâs say, less obvious with todayâs perspective.
One of these choices is, IMO, the internal representation of string and bytes, whose direct consequneces are APIs that copy and allocate quite a lot. To give a concrete example, the cost of String.sub feels alien to me, in a world where Rust and Go have proven(*) that slices are a better abstraction for this.
We had a friendly chat on Bluesky with Alice and came to the realization that, at least as far as OCaml is concern, switching the implementation of string to a slice could probably be transparent. And Iâm wondering, now: is that a change we would like to try to push upstream? Or is the cost too high, that this will likely never be considered by OCaml maintainers?
Yes please, Iâm tired of reinventing string * int * int. However, I think changing sub would break too much code: The fact that it reallocates has deep implications for the correctness, speed, and memory usage of existing code, specially when Bytes.unsafe_* are involved.
Instead, Iâd propose a new function, letâs say slice, that returns the same backing buffer with different bounds. This would immediately enable existing functions which only accept string to work on slices without implementing bounded variants. For example, itâs quite tricky to use String.get_utf_8_uchar on the end of a slice while handling truncated sequences correctly.
My only concern would be the double indirection on the new representation, due to OCaml lacking unboxed types.
(1) why ? [not asking tendentiously: actually serious] Whatâs the gain (other than fast substring operation)
(2) Is it obvious that the C FFI wouldnât need to change? Iâm assuming that a sliced string would be something like: type slstring = { it : String.t ; ofs : int ; len : int }
and it isnât obvious how you push that thru to C so that itâs going to be a char *, is it? B/c the C FFI isnât type-aware (or maybe better put: the C FFI works fine under type-erasure just like almost all of ML does generally).
(3) Is there a reason why one couldnât just implement a module SlString with a type like that above, and wrapper all the functions from String ? Why wouldnât that be enough?
ETA: it seems to me that if slices are such a big win, then implementing #3 in such a way that people can use it as a drop-in replacement for String is the fastest way to prove it.
Having a slice datatype with already validated bounds is both helpful for library interoperability and allows the compiler to elide redundant bound checks. Currently we all implement our own type slice = string * int * int or equivalent and pay the costs involved.
It enables all existing code using string or bytes to work on slices, including the existing Stdlib. Some String functions that Iâve had to reimplement or adapt for slices have been equal, index_from and get_utf_8_uchar; the latter is trickier than it looks like. (Edit: Also pattern matching!)
Half[1] the performance optimizations Iâve been seeing on runtimes from the same era as OCaml come from eliding indirections and needless copying/buffering, as well as packing.[2] The latency of memory accesses today is simply much higher and more dependent on cache utilization than it was in the 2000s, compared to CPU speeds: Touching less memory is faster.
The API itself wouldnât need to change, String_val would simply return a pointer to the start of the slice, and caml_string_length the slice length. However, I just remembered that strings are guaranteed to be null terminated and slices wouldnât. Oof, thatâs too much of a breaking change.
Now Iâm curious what your slice datatypeâs representation would be? B/c at least from Golang, itâs a two-level thing, hence you donât get to elide indirections; indeed, as compared to straight strings, thereâs -more- indirections.
ETA: perhaps you mean that youâre going to unbox the char-array so it sits in the same block with the offsets and such? (like the hack in C: struct t { int len; char[0] buf; } perhaps?) In which case, I donât see how you save on copying, and really, donât see how itâs different from the String we have today.
Is the cost really that heavy? I feel like most applications are not really manipulating huge strings, and the OCaml GC is really quite fantastic at allocating and freeing memory quickly? If a use case requires manipulating huge strings, wouldnât the bigstring type really be more suited for that anyway?
Since Perceus is garbage free, the Perceus backend uses less memory than the OCaml garbage collector on all benchmarks, with a 40%+ reduction in the cfold and deriv programs. However, for the other three benchmarks the working set is suprisingly close which is a testament to how well the OCaml GC is able to manage its memory.
The cost of a slice should be 0 allocations, as is the case in Rust. Allocating isnât free, even in OCaml, and that holds even for bigstring slices.
Sadly I donât think the GC is compatible with having proper byte and array slices. Slices (at least, the *u8 + usize kind) would require value types and inner pointers, neither of which are supported right now. So Iâve mostly lost hope.
Iâm concerned as well about the additional indirection to the backing buffer if the slice record were boxed; I think I mentioned that already. The net benefit would depend on which optimizations are achievable and whether the workloads benefit from slicing over copying.
Still, I think itâs worth exploring. Many existing functions would automatically become compatible with slices, some could become copy-free (e.g. String.trim), and IO libraries like bytesrw and iostream[1] wouldnât need to roll out their own slice types. Slicing large IO buffers without additional copies is the big appeal Iâd say.
Edit:
Goâs string is unboxed, so thereâs only one pointer involved. Same for Rustâs str/slices and C#'s Span and Memory.
I donât know about C#, but Golang and Rust documentation -both- state that string/str are read-only slices. That is to say, their representation is just like other slices. Which would imply boxed.
Do you have evidence that theyâre unboxed? For instance, a program that constructed strings of increasing size, then -repeatedly- performed substring operations (letâs say, substring of the middle-half of the string), and showing that the time for the program was a function of the size of the strings (which would imply unboxed) rather than independent of the size of the strings (which would imply boxed).
Or blog posts that state theyâre unboxed ? Iâm asking b/c it seems pretty difficult-to-imagine how youâd implement âunboxed slicesâ in any way that made sense as âslicesâ.
We may be talking past each other here. The slice record, which contains the pointer to the backing buffer, is unboxed; but the backing buffer is boxed, of course. OCaml here needs to allocate at least two blocks on the heap, one for each slice record and one for the backing buffer, because the slice record is boxed.
So OCaml slices would be pointers to slice blocks, which point to buffer blocks. Whereas Go/Rust/C# keep the slice block stored âinlineâ (on the stack or as part of the parent block), thus involving zero allocations to slice and no additional indirection over a bare pointer to the buffer.
Ah, ok, thank you for clarifying. Sure, when OCaml gets unboxed fixed-sized types, slices would also get that. The backing buffer remains boxed, just as in other languages.
I really think it might be worthwhile looking at Mark Haydenâs thesis, or this paper:
which is the journal paper version of his thesis. Specifically, he describes how he was able to achieve startling performance numbers in 1997, by using zero-copy techniques for both marshalling and demarshalling. Also RC-GC techniques in an OCaml runtime. I think I should mention also that this issue of minimizing copying has been known for a long time: back in 1995 I was able to speed up JVM web-apps by multiples by simply rewriting code to do zero-copy, and reuse buffers. None of these techniques were novel: just novel to Java applications.
The reason this is all relevant, is that to achieve real performance, you need to do something specialized for the application, and it is exceedingly rare that some pervasive solution will work.
As another example of this, you might look at Apache 2 memory-pools: again, a solution that is adapted for the particular problem, rather than trying to impose a way of doing things language-wide.
Last thought: if you intend to do this for strings, how will this work with memory-buffer reuse? Strings are supposed to be immutable: if the backing buffer is not immutable, then the string wonât be either. If the backing buffer is immutable, how can we reuse it? Reusing memory buffers (sometimes call âbuffer poolingâ in Java apps, and âmemory poolsâ in Apache 2) is a critical technique for achieving performance in data-intensive applications.
The reason this is all relevant, is that to achieve real performance, you need to do something specialized for the application, and it is exceedingly rare that some pervasive solution will work.
Having a slice type (for bytes and arrays in general, imho) would
still be very useful in many situations, even if there are various
application-specific way to reuse the buffers themselves. Buffer pooling
is definitely good!
As another example of this, you might look at Apache 2 memory-pools: again, a solution that is adapted for the particular problem, rather than trying to impose a way of doing things language-wide.
Last thought: if you intend to do this for strings, how will this work with memory-buffer reuse? Strings are supposed to be immutable: if the backing buffer is not immutable, then the string wonât be either. If the backing buffer is immutable, how can we reuse it? Reusing memory buffers (sometimes call âbuffer poolingâ in Java apps, and âmemory poolsâ in Apache 2) is a critical technique for achieving performance in data-intensive applications.
Most IO in OCaml would use bytes, right? string is just the end
product in some cases. Byte slices are useful for IOs, string slices are
useful for String.trim, finding substring occurrences, etc.
In an I/O-intensive program, those string operations you describe are what one would call âdemarshallingâ. So for peak performance, youâll want to do them on the actual byte-buffer, not on a string that you created by copying out of that byte-buffer. So if youâre serious about this slice stuff, youâd want to have a read-only byte slice type, that didnât guarantee immutability, but did lack any operation to mutate. And that would be the type on which youâd want all the string operations.
The problem is that that type would not be what youâd want for generic string manipulation. There, youâd want ⌠immutable strings.
This is the problem: if you want a single universal type of string, itâs going to be ill-suited for many kinds of uses.
I would assume that a bytes slice is mutable, and a string slice
isnât, and there would be the same kind of 0-cost conversion between
them as we currently enjoy with bytes and string.
A concrete example: say I read a CBOR value into a buffer (not Buffer.t because it doesnât expose a slice into its current contentâŚ
ahem), and I want to deserialize it into a tree. Or maybe iterate over
its structure SAX-style, to avoid allocating the tree.
The point of slices in this context is that, when encountering a CBOR
text or bytes value, the iterator could just return a slice into the
buffer, instead of making of copy.
Currently the only way to emulate that is to return bytes * int * int
(which allocates, and is pretty cumbersome to use because few OCaml
functions actually take this as input).
Sure, but a slice will be implemented as bytes * int * int in the end, so the allocation will be there. And if someone were to implement a full set of String operations on that bytes-slice, with the same signature as String, then one could use that in all your CBOR demarshalling with just a single-line change at the top of each file.
After this long thread, I came up with a summary of why I think switching the base type of string to a slice is not a good idea:
Languages that make their base types something complex, are doing something wrong. They should make their base types simple, and use libraries on top for complex things. The argument that if we replace that simple base type with a complex thing, then all the existing packages will benefit, to me fails b/c if it were so much of a win, then one could just send PRs to each package-maintainer, switching to the new complex type, and itâd be a no-brainer.
If string->string-slice is such a great idea, it should be possible to demonstrate it without making a change to the underlying OCaml language and library.
Having string slices can enable functions like String.trim, String.take and String.drop (the latter two are not present in the OCaml stdlib tho) to work in constant time which is pretty neat.
However, in GCâed languages, thereâs a common problem with slices where a tiny slice (even 1 byte) to a huge string/bytes array prevents GC from collecting the entire huge string/bytes array. Therefore, the slices abstraction unfortunately leaks and requires careful thinking about memory usage in each case.
I think that this raises an important question: before implementing slices in the language we should take a hard look at performance benchmarks. Otherwise, this is optimization driven only by vague intuition.
Typically, itâs worth remembering that the GC keeps a chunk of memory available and that allocations in the minor heap usually consists in a simply pointer move, amounting to a de-facto memory pooling for short-lived memory allocations.
This is true for -very- short-lived memory allocations, but it turns out, for many I/O-intensive tasks, the main data-structures (memory buffers for messages, or for blocks of text, depending on what the I/O is for) are not short-lived enough. And it turns out that pooling them for reuse is an -incredibly- effective technique for speeding up code. I used to call it âallocation avoidanceâ.[1]
[1] I once hacked the Javasoft JVM profiler so that instead of using the system clock for âtimingâ, it used a counter of all bytes allocated by the GC/allocator. Effectively, you got a profile with bytes-allocated serving the same function as âelapsed timeâ. It was a very, very effective way of finding code to be optimized â removing allocations turned out to be a big CPU time speedup. Now, that was the JVM, and it didnât have a nursery (unlike OCaml) but read on âŚ
If you look at Mark Haydenâs PhD thesis, youâll see that he didnât remove all allocations: just the ones for the most important data-structures of the application, which in his case were the network message buffers of the Ensemble distributed communication system. Iâve done similar things to HTTP web-app servers, with similar dramatic speedups.
I do acknowledge and know that as long as allocations stay in the minor heap and are not promoted, theyâre going to be no biggie, but youâd be surprised at how small that heap is. And as your program grows in complexity, you can find that bits of functionality that used to fit in the minor heap, no longer do.
ETA: I would make the higher-level observation that if a program isnât memory-intensive, then none of these contemplated changes really matter: might as well keep using the strings/bytes that weâre using now. If it -is- memory-intensive, then youâre going to need to do special things to get real performance, and typically the sorts of stuff that people in other languages have done ⌠isnât enough.
Frankly and bluntly, I noticed that NONE of the GCed languages that get used for network runtimes have anything -like- the performance that Hayden got with Ensemble nearly 30 years ago, on vastly inferior hardware. Thatâs because in all those languages, theyâre barely even -trying- to do the job well. That doesnât mean that the Ensemble mechanisms are dead: in 2013 I modified Ensemble and replaced the UDP at the bottom with Infiniband (ibverbs) and got quite excellent performance over (I think it was) FDR Infiniband (it was a decade ago, I forget which fabric it was).
My memory is that Hayden got 85 microsecond round-trip time on Myrinet hardware with 75 usec intrinsic hardware RTT. So 10usec overhead, for full virtual synchrony (which is a whole bunch more than just TCP). In 1997, on then-current SUN workstations. These days, in Java and other GCed languages people are happy to get RTT in the milliseconds, and think themselves geniuses for doing so. Itâs all pretty embarrassing.
ETA2: Upon further reflection, I think the numbers above were for one-way latency, not RTT. So double them for round-trip.