val sub : string → int → int → string
String.sub s start len returns a fresh string of length len, containing the substring of s that starts at position start and has length len.
I’m wondering about the adjective “fresh”. Since Strings are immutable, I assume that it does not really copy the substring in memory, but instead just creates a pointer to the relevant part of the initial string. Is this correct?
let sub s ofs len =
if ofs < 0 || len < 0 || ofs > length s - len
then invalid_arg "String.sub / Bytes.sub"
else begin
let r = create len in
unsafe_blit s ofs r 0 len;
r
end
If you need substrings a lot, take a look at Astring
version 0.8.3
repository default
url.src: "http://erratique.ch/software/astring/releases/astring-0.8.3.tbz"
url.checksum: "md5=c5bf6352b9ac27fbeab342740f4fa870"
homepage: "http://erratique.ch/software/astring"
bug-reports: "https://github.com/dbuenzli/astring/issues"
dev-repo: "git+http://erratique.ch/repos/astring.git"
authors: "Daniel Bünzli <daniel.buenzl i@erratique.ch>"
maintainer: "Daniel Bünzli <daniel.buenzl i@erratique.ch>"
license: "ISC"
tags: "string" "org:erratique"
depends: "ocaml" {>= "4.01.0"}
"ocamlfind" {build}
"ocamlbuild" {build}
"topkg" {build}
"base-bytes"
synopsis Alternative String module for OCaml
description
Astring exposes an alternative `String` module for OCaml. This module
tries to balance minimality and expressiveness for basic, index-free,
string processing and provides types and functions for substrings,
string sets and string maps.
Remaining compatible with the OCaml `String` module is a non-goal.
The
`String` module exposed by Astring has exception safe functions,
removes deprecated and rarely used functions, alters some signatures
and names, adds a few missing functions and fully exploits OCaml's
newfound string immutability.
Astring depends only on the OCaml standard library. It is distributed
under the ISC license.
thanks for the answer. That’s too bad; then. Do you know the reason for this?
To me, it sounds a bit inconsistent with immutability. For instance, String.copy was deprecated for this reason:
val copy : string → string
Deprecated.Because strings are immutable, it doesn’t make much sense to make identical copies of them.
Thanks also for Astring. Astring.String.Sub seems to do exactly what I wanted.
I believe the reason is historic because strings became immutable only recently. A string contains length information and is more than just a pointer into an existing string. For this reason substrings cannot be easily shared without changing the string representation.
Strings are standard OCaml blocks with the header size defining the size of the string in
machine words. The String_tag (252) is higher than the No_scan_tag, indicating that the
contents of the block are opaque to the collector. The block contents are the contents of the
string, with padding bytes to align the block on a word boundary.
+---------------+----------------+--------+-----------+
| header | 'a' 'b' 'c' 'd' 'e' 'f' | '\O' '\1' |
+---------------+----------------+--------+-----------+
L data L padding
I don’t know whether the compiler implements a potentially easier optimisation: the same string constant would need only one representation in the object file. In our own code many short string constants like "true" appear multiple times and since they are immutable, they could be all shared. Maybe it’s still difficult to implement this across modules.
The current representation of strings does not make it possible to take a substring in constant time. Changing the representation to add an extra indirection would allow this, but it would incur a small additional performance cost to all existing string operations; it is not clear that it is worth it.
Some libraries propose a “substring” type that is designed for this (quite simply, a substring is a tuple of a string, an offset and a length), that you can use in the part of your algorithm that requires constant-time String.sub. For example Batteries has a BatSubstring module.
Going in that direction, you may want to consider weirder representations, in particular “ropes”, that improve the algorithm cost of more operations (concatenation in O(log n)), at the cost of making random access non-constant-time (typically O(log n)). ocaml-rope is one implementation of ropes. Again, the extra complexity in the structure incurs an extra overhead on all operations (typically higher than for “just” substring), so this is only worth it if your code is really sensitive to the complexity of those operations.
In many cases using just strings, and Buffer for repeated-appending operations, is the fastest option. In some cases it is helpful to keep a tree of strings around, to be concatenated at the end.
Even though constant-time String.sub could be implemented in theory, I guess that this would break a lot of programs that make use of Bytes.unsafe_to_string.
Idea: an assumption-stdlib-string-sub-is-fresh opam package that is empty but for one test that fails if String.sub introduces sharing. You can depend on that package in your own packages to inherit all the constraints it introduces on OCaml versions.