Could we move string and bytes to sliced types?

They’re blocks with a stored length, yes, but the the C API docs states (emphasis mine):

String_val(v) returns a pointer to the first byte of the string v, with type const char *. This pointer is a valid C string: there is a null byte after the last byte in the string. However, OCaml strings can contain embedded null bytes, which will confuse the usual C functions over strings.

I’m sure lots of FFI code depends on being able to pass String_val(v) as-is to C functions expecting null-terminated strings. And turning string into a slice would admittedly break this.

I even remember depending on this once for a wrapper to sqlite3_prepare*, which, despite taking a length argument, has the odd remark:

If the caller knows that the supplied string is nul-terminated, then there is a small performance advantage to passing an nByte parameter that is the number of bytes in the input string including the nul-terminator.

1 Like

Ha. Yes, I guess I forgot about this one.

Do slices in other languages guarantee null-termination on the slice? I have a hard time seeing how you could achieve that without doing any memory copy…

Essentially, the OCaml doc says that all char array coming from a OCaml string are eventually null-terminated, guaranteeing that you can unroll them until you reach NULL without segfaulting.

Baring any thing I missed with other language’s slices, this seems pretty much the same guarantee.

I guess the question is what default you prefer having: copy by default, or opt-in copy.

That being said, I checked[1] and it looks like while slices share their underlying data even on
reslicing (which can leads to strange results IMO considering slices aren’t immutable), for strings, reslicing (e.g., foo[:]) duplicates the data.

Have to say, I am confused.


  1. Quickly. Should be double checked. ↩︎

Java stopped using slices for String.substring in 1.7.0_06. Most of the analysis seems applicable to OCaml, so IMHO it seems likely that slices would reduce performance (aka. the performance impact needs to be measured). The author’s reasoning is on Reddit:


I’m the author of the substring() change though in total disclosure the work and analysis on this began long before I took on the task. As has been suggested in the analysis here there were two motivations for the change;

  • reduce the size of String instances. Strings are typically 20-40% of common apps footprint. Any change with increases the size of String instances would dramatically increase memory pressure. This change to String came in at the same time as the alternative String hash code and we needed another field to cache the additional hash code. The offset/count removal afforded us the space we needed for the added hash code cache. This was the trigger.
  • avoid memory leakage caused by retained substrings holding the entire character array. This was a longstanding problem with many apps and was quite a significant in many cases. Over the years many libraries and parsers have specifically avoided returning substring results to avoid creating leaked Strings.

So how did we convince ourselves that this was a reasonable change? The initial analysis came out of the GC group in 2007 and was focused on the leaking aspect. It had been observed that the footprint of an app (glassfish in this case) could be reduced by serializing all of it’s data then restoring in a new context. One original suggestion was to replace character arrays on the fly with truncated versions. This direction was not ultimately pursued.

Part of the reason for deciding not to have the GC do “magic” replacement of char arrays was the observation that most substring instances were short lived and non-escaping. They lived in a single method on a single thread and were generally allocated (unless really large) in the TLAB. The comments about the substring operation becoming O(n) assume that the substring result is allocated in the general heap. This is not commonly the case and allocation in the TLAB is very much like malloca()–allocation merely bumps a pointer.

Internally the Oracle performance team maintains a set of representative and important apps and benchmarks which they use to evaluate performance changes. This set of apps was crucial in evaluating the change to substring. We looked closely at both changes in performance and change in footprint. Inevitably, as is the case with any significant change, there were regressions in some apps as well as gains in others. We investigated the regressions to see if performance was still acceptable and correctness was maintained. The most significant performance drop turned out to be in an obsolete benchmark which did hundreds of random substrings on a 1MB string and put the substrings into a map. It then later compared the map contents to verify correctness. We concluded that this case was not representative of common usage. Most other applications saw positive footprint and performance improvements or no significant change at all. A few apps, generally older parsers, had minor footprint growth.

Post ship the feedback we have received has been mostly positive for this change. We have certainly heard since the release of this change of apps where performance or memory usage regressed. There have been specific developer reported regressions and a very small number of customer escalations performance regressions. In all the regression cases thus far it’s been possible to fairly easily remediate the encountered performance problems. Interestingly, in these cases we’ve encountered the performance fixes we’ve applied have been ones that would have have a positive benefit for either the pre-7u6 or current substring behaviour. We continue to believe that the change was of general benefit to most applications.

Please don’t try to pick apart what I’ve said here too much. My reply is not intended to be exhaustive but is a very brief summary of what was almost six months of dedicated work. This change certainly had the highest ratio of impact measurement and analysis relative to dev effort of any Java core libraries change in recent memory.

7 Likes

That’s a pretty strong piece of evidence, and 11 yo even.

Now I wonder if we could have a proper slice type in the standard library, to let users needing it having access to a well supported, potentially well optimized type.

I also wonder what if the cost (in terms of time and efforts) of hacking an PoC of sliced string.

Note that the bytesrw library precisely makes such a proposal albeit for bytes and tailored for byte oriented IO where I think it’s very much worth sharing buffers among byte crunching processors until you decode to high level datastructures.

Regarding strings, despite having designed a whole module for substrings with a nice (IMHO :–) graphical guide, I’m not convinced by their usefulness and usability. As many people already pointed out, they tend to bring problems you wish you didn’t have to care for.

2 Likes

Have to admit, I’m a little surprised there is such a strong difference made between bytes and string here :sweat_smile:.

That being said, a little out of scope of the discussion, but is there a strong rationale being preventing the creation of empty slices of bytes? :o

I’m not sure I entirely got your comment, but perhaps the point here is the notion of slice validity which, while applicable in other contexts, mostly make sense in the context of byte streams (and perhaps one day could be enforced via things like modes)

Since these slices bound their life-span in a way or another you don’t get into the problems @chshersh mentioned.

In the context of byte streams, yes. See the second part of this comment.

1 Like