Could we move string and bytes to sliced types?

OCaml is a language that prides itself (for good reason IMO) to take backward compatibility very seriously. And as far as I can tell, that claim is pretty solid.

That being said, OCaml started a long time ago (no offense :sweat_smile:), and I guess some initial choices which made sense at the time are… let’s say, less obvious with today’s perspective.

One of these choices is, IMO, the internal representation of string and bytes, whose direct consequneces are APIs that copy and allocate quite a lot. To give a concrete example, the cost of String.sub feels alien to me, in a world where Rust and Go have proven(*) that slices are a better abstraction for this.

We had a friendly chat on Bluesky with Alice and came to the realization that, at least as far as OCaml is concern, switching the implementation of string to a slice could probably be transparent. And I’m wondering, now: is that a change we would like to try to push upstream? Or is the cost too high, that this will likely never be considered by OCaml maintainers?

(*) At least, have convinced me

4 Likes

Yes please, I’m tired of reinventing string * int * int. However, I think changing sub would break too much code: The fact that it reallocates has deep implications for the correctness, speed, and memory usage of existing code, specially when Bytes.unsafe_* are involved.

Instead, I’d propose a new function, let’s say slice, that returns the same backing buffer with different bounds. This would immediately enable existing functions which only accept string to work on slices without implementing bounded variants. For example, it’s quite tricky to use String.get_utf_8_uchar on the end of a slice while handling truncated sequences correctly.

My only concern would be the double indirection on the new representation, due to OCaml lacking unboxed types.

Three thoughts:

(1) why ? [not asking tendentiously: actually serious] What’s the gain (other than fast substring operation)

(2) Is it obvious that the C FFI wouldn’t need to change? I’m assuming that a sliced string would be something like:
type slstring = { it : String.t ; ofs : int ; len : int }
and it isn’t obvious how you push that thru to C so that it’s going to be a char *, is it? B/c the C FFI isn’t type-aware (or maybe better put: the C FFI works fine under type-erasure just like almost all of ML does generally).

(3) Is there a reason why one couldn’t just implement a module SlString with a type like that above, and wrapper all the functions from String ? Why wouldn’t that be enough?

ETA: it seems to me that if slices are such a big win, then implementing #3 in such a way that people can use it as a drop-in replacement for String is the fastest way to prove it.

2 Likes
  1. Having a slice datatype with already validated bounds is both helpful for library interoperability and allows the compiler to elide redundant bound checks. Currently we all implement our own type slice = string * int * int or equivalent and pay the costs involved.
  2. It enables all existing code using string or bytes to work on slices, including the existing Stdlib. Some String functions that I’ve had to reimplement or adapt for slices have been equal, index_from and get_utf_8_uchar; the latter is trickier than it looks like. (Edit: Also pattern matching!)
  3. Half[1] the performance optimizations I’ve been seeing on runtimes from the same era as OCaml come from eliding indirections and needless copying/buffering, as well as packing.[2] The latency of memory accesses today is simply much higher and more dependent on cache utilization than it was in the 2000s, compared to CPU speeds: Touching less memory is faster.

The API itself wouldn’t need to change, String_val would simply return a pointer to the start of the slice, and caml_string_length the slice length. However, I just remembered that strings are guaranteed to be null terminated and slices wouldn’t. Oof, that’s too much of a breaking change.


  1. The other half is SIMD. ↩︎

  2. That’s also why I insist OCaml needs unboxed types. ↩︎

3 Likes

Now I’m curious what your slice datatype’s representation would be? B/c at least from Golang, it’s a two-level thing, hence you don’t get to elide indirections; indeed, as compared to straight strings, there’s -more- indirections.

ETA: perhaps you mean that you’re going to unbox the char-array so it sits in the same block with the offsets and such? (like the hack in C: struct t { int len; char[0] buf; } perhaps?) In which case, I don’t see how you save on copying, and really, don’t see how it’s different from the String we have today.

1 Like

Is the cost really that heavy? I feel like most applications are not really manipulating huge strings, and the OCaml GC is really quite fantastic at allocating and freeing memory quickly? If a use case requires manipulating huge strings, wouldn’t the bigstring type really be more suited for that anyway?

To see an example of how good the OCaml GC is, see https://www.microsoft.com/en-us/research/uploads/prod/2023/09/ocamlrc.pdf

Since Perceus is garbage free, the Perceus backend uses less memory than the OCaml garbage collector on all benchmarks, with a 40%+ reduction in the cfold and deriv programs. However, for the other three benchmarks the working set is suprisingly close which is a testament to how well the OCaml GC is able to manage its memory.

2 Likes

The cost of a slice should be 0 allocations, as is the case in Rust. Allocating isn’t free, even in OCaml, and that holds even for bigstring slices.

Sadly I don’t think the GC is compatible with having proper byte and array slices. Slices (at least, the *u8 + usize kind) would require value types and inner pointers, neither of which are supported right now. So I’ve mostly lost hope.

2 Likes

I’m concerned as well about the additional indirection to the backing buffer if the slice record were boxed; I think I mentioned that already. The net benefit would depend on which optimizations are achievable and whether the workloads benefit from slicing over copying.

Still, I think it’s worth exploring. Many existing functions would automatically become compatible with slices, some could become copy-free (e.g. String.trim), and IO libraries like bytesrw and iostream[1] wouldn’t need to roll out their own slice types. Slicing large IO buffers without additional copies is the big appeal I’d say.

Edit:

Go’s string is unboxed, so there’s only one pointer involved. Same for Rust’s str/slices and C#'s Span and Memory.


  1. And my own internal one. ↩︎

1 Like

I don’t know about C#, but Golang and Rust documentation -both- state that string/str are read-only slices. That is to say, their representation is just like other slices. Which would imply boxed.

Do you have evidence that they’re unboxed? For instance, a program that constructed strings of increasing size, then -repeatedly- performed substring operations (let’s say, substring of the middle-half of the string), and showing that the time for the program was a function of the size of the strings (which would imply unboxed) rather than independent of the size of the strings (which would imply boxed).

Or blog posts that state they’re unboxed ? I’m asking b/c it seems pretty difficult-to-imagine how you’d implement “unboxed slices” in any way that made sense as “slices”.

1 Like

We may be talking past each other here. The slice record, which contains the pointer to the backing buffer, is unboxed; but the backing buffer is boxed, of course. OCaml here needs to allocate at least two blocks on the heap, one for each slice record and one for the backing buffer, because the slice record is boxed.

So OCaml slices would be pointers to slice blocks, which point to buffer blocks. Whereas Go/Rust/C# keep the slice block stored “inline” (on the stack or as part of the parent block), thus involving zero allocations to slice and no additional indirection over a bare pointer to the buffer.

1 Like

Ah, ok, thank you for clarifying. Sure, when OCaml gets unboxed fixed-sized types, slices would also get that. The backing buffer remains boxed, just as in other languages.

I really think it might be worthwhile looking at Mark Hayden’s thesis, or this paper:

https://ecommons.cornell.edu/server/api/core/bitstreams/f69605de-210b-4210-bbd1-f4b3bcd557c4/content

which is the journal paper version of his thesis. Specifically, he describes how he was able to achieve startling performance numbers in 1997, by using zero-copy techniques for both marshalling and demarshalling. Also RC-GC techniques in an OCaml runtime. I think I should mention also that this issue of minimizing copying has been known for a long time: back in 1995 I was able to speed up JVM web-apps by multiples by simply rewriting code to do zero-copy, and reuse buffers. None of these techniques were novel: just novel to Java applications.

The reason this is all relevant, is that to achieve real performance, you need to do something specialized for the application, and it is exceedingly rare that some pervasive solution will work.

As another example of this, you might look at Apache 2 memory-pools: again, a solution that is adapted for the particular problem, rather than trying to impose a way of doing things language-wide.

Last thought: if you intend to do this for strings, how will this work with memory-buffer reuse? Strings are supposed to be immutable: if the backing buffer is not immutable, then the string won’t be either. If the backing buffer is immutable, how can we reuse it? Reusing memory buffers (sometimes call “buffer pooling” in Java apps, and “memory pools” in Apache 2) is a critical technique for achieving performance in data-intensive applications.

2 Likes

The reason this is all relevant, is that to achieve real performance, you need to do something specialized for the application, and it is exceedingly rare that some pervasive solution will work.

Having a slice type (for bytes and arrays in general, imho) would
still be very useful in many situations, even if there are various
application-specific way to reuse the buffers themselves. Buffer pooling
is definitely good!

As another example of this, you might look at Apache 2 memory-pools: again, a solution that is adapted for the particular problem, rather than trying to impose a way of doing things language-wide.

Last thought: if you intend to do this for strings, how will this work with memory-buffer reuse? Strings are supposed to be immutable: if the backing buffer is not immutable, then the string won’t be either. If the backing buffer is immutable, how can we reuse it? Reusing memory buffers (sometimes call “buffer pooling” in Java apps, and “memory pools” in Apache 2) is a critical technique for achieving performance in data-intensive applications.

Most IO in OCaml would use bytes, right? string is just the end
product in some cases. Byte slices are useful for IOs, string slices are
useful for String.trim, finding substring occurrences, etc.

1 Like

In an I/O-intensive program, those string operations you describe are what one would call “demarshalling”. So for peak performance, you’ll want to do them on the actual byte-buffer, not on a string that you created by copying out of that byte-buffer. So if you’re serious about this slice stuff, you’d want to have a read-only byte slice type, that didn’t guarantee immutability, but did lack any operation to mutate. And that would be the type on which you’d want all the string operations.

The problem is that that type would not be what you’d want for generic string manipulation. There, you’d want … immutable strings.

This is the problem: if you want a single universal type of string, it’s going to be ill-suited for many kinds of uses.

1 Like

I would assume that a bytes slice is mutable, and a string slice
isn’t, and there would be the same kind of 0-cost conversion between
them as we currently enjoy with bytes and string.

A concrete example: say I read a CBOR value into a buffer (not
Buffer.t because it doesn’t expose a slice into its current content…
ahem), and I want to deserialize it into a tree. Or maybe iterate over
its structure SAX-style, to avoid allocating the tree.

The point of slices in this context is that, when encountering a CBOR
text or bytes value, the iterator could just return a slice into the
buffer, instead of making of copy.
Currently the only way to emulate that is to return bytes * int * int
(which allocates, and is pretty cumbersome to use because few OCaml
functions actually take this as input).

1 Like

Sure, but a slice will be implemented as bytes * int * int in the end, so the allocation will be there. And if someone were to implement a full set of String operations on that bytes-slice, with the same signature as String, then one could use that in all your CBOR demarshalling with just a single-line change at the top of each file.

After this long thread, I came up with a summary of why I think switching the base type of string to a slice is not a good idea:

Languages that make their base types something complex, are doing something wrong. They should make their base types simple, and use libraries on top for complex things. The argument that if we replace that simple base type with a complex thing, then all the existing packages will benefit, to me fails b/c if it were so much of a win, then one could just send PRs to each package-maintainer, switching to the new complex type, and it’d be a no-brainer.

If string->string-slice is such a great idea, it should be possible to demonstrate it without making a change to the underlying OCaml language and library.

2 Likes

Having string slices can enable functions like String.trim, String.take and String.drop (the latter two are not present in the OCaml stdlib tho) to work in constant time which is pretty neat.

However, in GC’ed languages, there’s a common problem with slices where a tiny slice (even 1 byte) to a huge string/bytes array prevents GC from collecting the entire huge string/bytes array. Therefore, the slices abstraction unfortunately leaks and requires careful thinking about memory usage in each case.

3 Likes

Not OCaml strings, they are just a char * array and you need to check on ocaml_string_length for length.

Thanks for the examples.

I think that this raises an important question: before implementing slices in the language we should take a hard look at performance benchmarks. Otherwise, this is optimization driven only by vague intuition.

Typically, it’s worth remembering that the GC keeps a chunk of memory available and that allocations in the minor heap usually consists in a simply pointer move, amounting to a de-facto memory pooling for short-lived memory allocations.

1 Like

This is true for -very- short-lived memory allocations, but it turns out, for many I/O-intensive tasks, the main data-structures (memory buffers for messages, or for blocks of text, depending on what the I/O is for) are not short-lived enough. And it turns out that pooling them for reuse is an -incredibly- effective technique for speeding up code. I used to call it “allocation avoidance”.[1]

[1] I once hacked the Javasoft JVM profiler so that instead of using the system clock for “timing”, it used a counter of all bytes allocated by the GC/allocator. Effectively, you got a profile with bytes-allocated serving the same function as “elapsed time”. It was a very, very effective way of finding code to be optimized – removing allocations turned out to be a big CPU time speedup. Now, that was the JVM, and it didn’t have a nursery (unlike OCaml) but read on …

If you look at Mark Hayden’s PhD thesis, you’ll see that he didn’t remove all allocations: just the ones for the most important data-structures of the application, which in his case were the network message buffers of the Ensemble distributed communication system. I’ve done similar things to HTTP web-app servers, with similar dramatic speedups.

I do acknowledge and know that as long as allocations stay in the minor heap and are not promoted, they’re going to be no biggie, but you’d be surprised at how small that heap is. And as your program grows in complexity, you can find that bits of functionality that used to fit in the minor heap, no longer do.

ETA: I would make the higher-level observation that if a program isn’t memory-intensive, then none of these contemplated changes really matter: might as well keep using the strings/bytes that we’re using now. If it -is- memory-intensive, then you’re going to need to do special things to get real performance, and typically the sorts of stuff that people in other languages have done … isn’t enough.

Frankly and bluntly, I noticed that NONE of the GCed languages that get used for network runtimes have anything -like- the performance that Hayden got with Ensemble nearly 30 years ago, on vastly inferior hardware. That’s because in all those languages, they’re barely even -trying- to do the job well. That doesn’t mean that the Ensemble mechanisms are dead: in 2013 I modified Ensemble and replaced the UDP at the bottom with Infiniband (ibverbs) and got quite excellent performance over (I think it was) FDR Infiniband (it was a decade ago, I forget which fabric it was).

My memory is that Hayden got 85 microsecond round-trip time on Myrinet hardware with 75 usec intrinsic hardware RTT. So 10usec overhead, for full virtual synchrony (which is a whole bunch more than just TCP). In 1997, on then-current SUN workstations. These days, in Java and other GCed languages people are happy to get RTT in the milliseconds, and think themselves geniuses for doing so. It’s all pretty embarrassing.

ETA2: Upon further reflection, I think the numbers above were for one-way latency, not RTT. So double them for round-trip.

1 Like