Best practice for performant string operation?

I recently received this comment (thanks!) for a package I published. It is honestly something I never think of before.

str_1 ^ str_2 ^ str_3 ^ str_4 ^ str_5

This code seems innocent and clean. But if you look at ^ underlying implementation, you will know this is inefficient. ^ only handles two strings at a time. So 4 new String.t/Bytes.t objects are created during the process and the first string is copied 4 times. On contract, String.concat "" can do everything in one run.

This comment led me back to re-evaluate all my string operations, and reimplement some of them. For example, I originally had

let align_left w c s =
  let len = String.length s in
  if len >= w then
    s
  else
    s ^ String.make (w - len) c

A re-implementation was

let align_left w c s =
  if len >= w then
    s
  else
    let b = Bytes.create w in
    Bytes.blit_string s 0 b 0 len;
    Bytes.fill b len (w - len) c;
    Bytes.to_string b

To my surprise, the re-implementation is slightly slower than the old one. So I rewrote the function using lower-level API.

external bytes_unsafe_fill : bytes -> int -> int -> char -> unit
  = "caml_fill_bytes" [@@noalloc]
external bytes_unsafe_blit_string : string -> int -> bytes -> int -> int -> unit
  = "caml_blit_string" [@@noalloc]
external bytes_unsafe_to_string : bytes -> string = "%bytes_to_string"

let align_left w c s =
  if len >= w then
    s
  else
    let b = Bytes.create w in
    bytes_unsafe_blit_string s 0 b 0 len;
    bytes_unsafe_fill b len (w - len) c;
    bytes_unsafe_to_string b

This code is up to 100% faster the original one (according to my limited and unscientific tests)!

The questions are

  • It seems chaining ^ is a trap a lot of people will fall into. Is it possible to have compiler level optimization for this? When the complier sees a chain of ^ operation, it will be automatically compiled to String.concat or something else similar. Is there anything prevent us doing this?

  • Why does the second implementation have worse performance than the first one?

  • Is it a good idea to use all the external functions as in the third implementation?

  • And… what is the best practice for performant string operation in general?

Thanks. Happy hacking!

2 Likes

This would be a really bad thing to do, I think. There are many equations between different expressions, and starting to implement this sort of stuff is just a Pandora’s box. Now, I can understand why you might have gotten the impression that a ^ b ^ c ^ d was innocent: in Java, it gets converted into (1) create a string buffer, (2) append a, b, etc, (3) extract the final string.

The thing is: OCaml (and ML generally) has a history of being a language where you can look at what you wrote and know what it’s going to do. That means that whey you write a ^ b ^ c ^ d you’re expecting it to be three binary-concats, and left-to-right. The compiler doesn’t second-guess you on that. At least, that’s been the history.

3 Likes

Let’s examine this code:

let align_left w c s =
  let len = String.length s in
  if len >= w then
    s
  else
    s ^ String.make (w - len) c (* 1: 2 allocations + 2 copies *)

The line marked 1 has 2 allocations: one for the new string, and one for the concatenation, and 2 copies: again, one for the String.make and one for the concatenation.

Now let’s look at the second version:

let align_left w c s =
  if len >= w then
    s
  else
    let b = Bytes.create w in (*1: allocation *)
    Bytes.blit_string s 0 b 0 len; (*2: copy + bounds check *)
    Bytes.fill b len (w - len) c; (*3: copy + bounds check *)
    Bytes.to_string b (*4: allocation + copy *)

In line 1 we have an allocation. In line 2 we have a copy + a bounds check to make sure you’re not getting a buffer overflow. The same is true for line 3. And then in line 4, we have another allocation + copy. We have to allocate because a string is read-only, whereas a byte could be modified by its holder.

So you see, the second version is actually slightly worse than the first one: we have an extra copy + 2 bounds checks since the copy in these cases is mandated by the programmer rather than the runtime, and is therefore potentially a buffer overflow.

Now let’s examine the 3rd version:

let align_left w c s =
  if len >= w then
    s
  else
    let b = Bytes.create w in (* 1: allocation *)
    bytes_unsafe_blit_string s 0 b 0 len; (* 2: copy *)
    bytes_unsafe_fill b len (w - len) c; (* 3: copy *)
    bytes_unsafe_to_string b (* 4: no-op *)

First of all, it’s worth mentioning that all these unsafe functions are already present in the stdlib within the Bytes module. For some reason they don’t show up in the API, but Bytes has unsafe_blit_string, unsafe_blit and unsafe_fill. But it’s important to understand why they’re faster. The unsafe copy functions skip bounds checking, which is usually a bad idea. The Bytes.unsafe_to_string function doesn’t do anything, because it assumes you know what you’re doing and that it can use the same underlying bytes object as a string, despite potentially read/write issues.

So what’s the bottom line?

You need to especially think about your allocations. If you’re going to be augmenting a string with chars, as you are here, then avoiding ^ is a good idea. A simple ^ concatenation is best when you don’t care about performance (which is a lot of the time), or when you’re only doing one concatenation of 2 existing strings. It’s also worth considering the Buffer module, which was created for exactly these kinds of use-cases. However, for this particular function, the monitored reuse of bytes as a string, though not safe in the general case, is probably optimal. Also, dropping bounds checks is very dangerous and should very rarely be done.

2 Likes

You defeated your point as soon as you made it ;-). Indeed, the (^) operator is right-associative, so the expression a ^ b ^ c ^ d is actually a ^ (b ^ (c ^ d)).

Tangentially related, have a look also at this thread for a discussion and benchmarks on the subject

Buffer will force an additional allocation+copy on you at the end, because it doesn’t give you access to its internal bytes. So if you’re counting allocations, just stay away from it.

Buffer is still useful when you want to build a string gradually but don’t want to compute how big it will eventually be (which is to say, most of the time). It is unfortunate that we don’t have access to the underlying data - after all, Buffer is a Vector version of bytes - but it’s still more useful (and safer) than trying to predict how big your final bytes buffer should be.

2 Likes

That’s true, that’s the use case. If you’re going to allocate a string eventually it’s fine to use Buffer.

Thanks! That is an excellent explanation!

I totally overlooked the extra allocation and copy caused by Bytes.to_string in the second implementation.

It is also good to know all the unsafe functions are actually available through the Stdlib.Bytes.

I compared the performance of the function with bounds check vs without bonds check (using unsafe functions). Although the one with bounds check is significantly faster than the string implementation, it is still 20%~30% slower than the unsafe one. For this specific function, I am 100% sure the input is within the bounds. So I decide to take the risk.

1 Like