Reliability of String physical equality for interning

giltho · July 1, 2025, 11:08am

Hello,

Existing library for hashconsing (notably hashcons and hc) add a tag field which is a unique integer id that is used to decide equality between two objects: after hash-consing, two objects are equal iff they have the same tag.

Is there a reason why physical equality isn’t used here? The original hashcons paper mentions that two objects are equal “iff they are physically equal and iff they have the same tag”, but not why a tag is required in this case.

The [documentation] is kinda vague here:

e1 == e2 tests for physical equality of e1 and e2. On mutable types such as references, arrays, byte sequences, records with mutable fields and objects with mutable instance variables, e1 == e2 is true if and only if physical modification of e1 also affects e2. On non-mutable types, the behavior of ( == ) is implementation-dependent; however, it is guaranteed that e1 == e2 implies compare e1 e2 = 0.

Does that mean that I’m only guaranteed that physical equality will work if the data I’m hashconsing is mutable?

nojb · July 1, 2025, 11:29am

Note that hash consing should not be applied to mutable structures in general: the hash value of a mutable structure can change under your feet, which violates the basic invariant of hash consing.

Coming back to your initial question: apart from making it possible not to depend on the exact behaviour of ==, using a tag also provides for free a total ordering on your hash-consed terms, which is useful (eg for forming sets or maps on those terms).

Cheers,
Nicolas

jjb · July 1, 2025, 11:52am

Structural equality indeed does not always imply physical equality:

# "foobar" == "foo" ^ "bar";;
- : bool = false

However, it is possible to arrange a hash-consing implementation so that two values that are known to have been hash-consed are equal if and only if physically equal. To do so, you will rely on the hash-consing implementation’s “unique”/“intern” function to always return the same pointer value for an allocation that is under its control. This means that physical equality only needs to ensure to return true if given two copies of a pointer from a known single allocation site. An example of this sort can be found here.

jonathandoyle · July 1, 2025, 11:53am

Also, it is quite probable that comparing tags gives you better performance than making deep comparisons of nested structures.

nojb · July 1, 2025, 12:14pm

That’s the whole point of hash consing, but that’s not what the question is about. What the OP is asking about is why use tags instead of == on the hash-consed terms.

Cheers,
Nicolas

vlaviron · July 1, 2025, 12:45pm

It’s very hypothetical, but for immutable structures the compiler is allowed to unbox records and re-box on use. Example:

type 'a r = {
  id : int;
  v : 'a;
}

let mk v =
  let id = create_id v in
  { id; v }

let check_id x y = Int.equal x.id y.id
let check_phys x y = x == y

let test_id v =
  let r = mk v in
  check_id r r

let test_phys v =
  let r = mk v in
  check_phys r r

After inlining and optimisation, test_id can be compiled to:

let test_id v =
  let id = create_id v in
  Int.equal id id (* This can be optimised away too, but let's pretend it isn't *)

But for test_phys, a naive unboxing strategy might generate the following:

let test_phys v =
  let id = create_id v in
  { id; v } == { id; v }

This doesn’t occur currently for records, but you can see that this strategy is currently used for boxed numbers:

let[@inline] mk i = Float.of_int i
let[@inline] check_content x y =
  (* Ignoring weird things like Nan or negative zero *)
  Float.equal x y
let[@inline] check_phys x y = x == y
let test i =
  let f : float = mk i in
  (check_content f f, check_phys f f)
let () =
  Random.self_init ();
  let bits_eq, phys_eq = test (Random.full_int max_int) in
  Format.printf "bits_eq: %b@.phys_eq: %b@."
    bits_eq phys_eq

The above program will print true twice with the bytecode compiler, but true and false with the native compiler.

So comparing using the id is safer than using physical equality, but only in the sense of guarding against potential future transformations.

giltho · July 1, 2025, 10:38pm

I can find a use case for that, where you want to cache e.g. intermediate computation:

type tree_with_height = {
   tree: tree;
   height: mutable int option
}

let height t =
  match t.height with
  | None ->
      let h = Tree.height t.tree in
      t.height <- Some h;
      h
  | Some v -> v

let equal v1 v2 = Tree.equal v1.tree v2.tree
let hash v = Tree.hash v

Another reason is, I might add a mutable bool field to make my physical equality stable if that’s needed, but at this point I might as well carry an id (though possibly physical equality should be cheaper than id equality, since there’s no heap lookup?)

Interesting. So yes, I can’t rely on physical equality “naturally” holding for non-mutable stuff

Topic		Replies	Views
Physical equality safe to use for hash-consed variant type? Learning	6	968	May 29, 2019
(Physically) identifying immutable data structures Learning	6	223	February 26, 2026
Physical equality of functions Learning	3	903	December 10, 2021
Mnemonic for <> versus != Learning	11	577	October 11, 2025
Using `=` on cyclic data structure causes out-of-memory error Learning	3	368	July 6, 2023

Reliability of String physical equality for interning

Related topics