String.sub memory usage

The manual says:

val sub : string -> int -> int -> string
String.sub s start len returns a fresh string of length len, containing the substring of s that starts at position start and has length len.

I’m wondering about the adjective “fresh”. Since Strings are immutable, I assume that it does not really copy the substring in memory, but instead just creates a pointer to the relevant part of the initial string. Is this correct?

1 Like

String.sub is implemented using the Bytes module. My reading is that it allocates fresh memory:

let sub s ofs len =
  if ofs < 0 || len < 0 || ofs > length s - len
  then invalid_arg "String.sub / Bytes.sub"
  else begin
    let r = create len in
    unsafe_blit s ofs r 0 len;
    r
  end

If you need substrings a lot, take a look at Astring

version       0.8.3
repository    default
url.src:      "http://erratique.ch/software/astring/releases/astring-0.8.3.tbz"
url.checksum: "md5=c5bf6352b9ac27fbeab342740f4fa870"
homepage:     "http://erratique.ch/software/astring"
bug-reports:  "https://github.com/dbuenzli/astring/issues"
dev-repo:     "git+http://erratique.ch/repos/astring.git"
authors:      "Daniel Bünzli <daniel.buenzl i@erratique.ch>"
maintainer:   "Daniel Bünzli <daniel.buenzl i@erratique.ch>"
license:      "ISC"
tags:         "string" "org:erratique"
depends:      "ocaml" {>= "4.01.0"}
              "ocamlfind" {build}
              "ocamlbuild" {build}
              "topkg" {build}
              "base-bytes"
synopsis      Alternative String module for OCaml
description
          Astring exposes an alternative `String` module for OCaml. This module
          tries to balance minimality and expressiveness for basic, index-free,
          string processing and provides types and functions for substrings,
          string sets and string maps.
          Remaining compatible with the OCaml `String` module is a non-goal.
          The
          `String` module exposed by Astring has exception safe functions,
          removes deprecated and rarely used functions, alters some signatures
          and names, adds a few missing functions and fully exploits OCaml's
          newfound string immutability.
          Astring depends only on the OCaml standard library. It is distributed
          under the ISC license.

thanks for the answer. That’s too bad; then. Do you know the reason for this?
To me, it sounds a bit inconsistent with immutability. For instance, String.copy was deprecated for this reason:

val copy : string -> string
Deprecated.Because strings are immutable, it doesn’t make much sense to make identical copies of them.

Thanks also for Astring. Astring.String.Sub seems to do exactly what I wanted.

I believe the reason is historic because strings became immutable only recently. A string contains length information and is more than just a pointer into an existing string. For this reason substrings cannot be easily shared without changing the string representation.

https://v1.realworldocaml.org/v1/en/html/memory-representation-of-values.html

Strings are standard OCaml blocks with the header size defining the size of the string in
machine words. The String_tag (252) is higher than the No_scan_tag, indicating that the
contents of the block are opaque to the collector. The block contents are the contents of the
string, with padding bytes to align the block on a word boundary.

+---------------+----------------+--------+-----------+
| header        | 'a' 'b' 'c' 'd' 'e' 'f' | '\O' '\1' |
+---------------+----------------+--------+-----------+
                L data                    L padding

I don’t know whether the compiler implements a potentially easier optimisation: the same string constant would need only one representation in the object file. In our own code many short string constants like "true" appear multiple times and since they are immutable, they could be all shared. Maybe it’s still difficult to implement this across modules.

1 Like

The current representation of strings does not make it possible to take a substring in constant time. Changing the representation to add an extra indirection would allow this, but it would incur a small additional performance cost to all existing string operations; it is not clear that it is worth it.

Some libraries propose a “substring” type that is designed for this (quite simply, a substring is a tuple of a string, an offset and a length), that you can use in the part of your algorithm that requires constant-time String.sub. For example Batteries has a BatSubstring module.

Going in that direction, you may want to consider weirder representations, in particular “ropes”, that improve the algorithm cost of more operations (concatenation in O(log n)), at the cost of making random access non-constant-time (typically O(log n)). ocaml-rope is one implementation of ropes. Again, the extra complexity in the structure incurs an extra overhead on all operations (typically higher than for “just” substring), so this is only worth it if your code is really sensitive to the complexity of those operations.

In many cases using just strings, and Buffer for repeated-appending operations, is the fastest option. In some cases it is helpful to keep a tree of strings around, to be concatenated at the end.

4 Likes

Even though constant-time String.sub could be implemented in theory, I guess that this would break a lot of programs that make use of Bytes.unsafe_to_string.

Idea: an assumption-stdlib-string-sub-is-fresh opam package that is empty but for one test that fails if String.sub introduces sharing. You can depend on that package in your own packages to inherit all the constraints it introduces on OCaml versions.

Idea: a collection of assumption-* packages.

2 Likes