Existing library for hashconsing (notably hashcons and hc) add a tag field which is a unique integer id that is used to decide equality between two objects: after hash-consing, two objects are equal iff they have the same tag.
Is there a reason why physical equality isn’t used here? The original hashcons paper mentions that two objects are equal “iff they are physically equal and iff they have the same tag”, but not why a tag is required in this case.
The [documentation] is kinda vague here:
e1 == e2 tests for physical equality of e1 and e2. On mutable types such as references, arrays, byte sequences, records with mutable fields and objects with mutable instance variables, e1 == e2 is true if and only if physical modification of e1 also affects e2. On non-mutable types, the behavior of ( == ) is implementation-dependent; however, it is guaranteed that e1 == e2 implies compare e1 e2 = 0.
Does that mean that I’m only guaranteed that physical equality will work if the data I’m hashconsing is mutable?
Note that hash consing should not be applied to mutable structures in general: the hash value of a mutable structure can change under your feet, which violates the basic invariant of hash consing.
Coming back to your initial question: apart from making it possible not to depend on the exact behaviour of ==, using a tag also provides for free a total ordering on your hash-consed terms, which is useful (eg for forming sets or maps on those terms).
Structural equality indeed does not always imply physical equality:
# "foobar" == "foo" ^ "bar";;
- : bool = false
However, it is possible to arrange a hash-consing implementation so that two values that are known to have been hash-consed are equal if and only if physically equal. To do so, you will rely on the hash-consing implementation’s “unique”/“intern” function to always return the same pointer value for an allocation that is under its control. This means that physical equality only needs to ensure to return true if given two copies of a pointer from a known single allocation site. An example of this sort can be found here.
That’s the whole point of hash consing, but that’s not what the question is about. What the OP is asking about is why use tags instead of == on the hash-consed terms.
It’s very hypothetical, but for immutable structures the compiler is allowed to unbox records and re-box on use. Example:
type 'a r = {
id : int;
v : 'a;
}
let mk v =
let id = create_id v in
{ id; v }
let check_id x y = Int.equal x.id y.id
let check_phys x y = x == y
let test_id v =
let r = mk v in
check_id r r
let test_phys v =
let r = mk v in
check_phys r r
After inlining and optimisation, test_id can be compiled to:
let test_id v =
let id = create_id v in
Int.equal id id (* This can be optimised away too, but let's pretend it isn't *)
But for test_phys, a naive unboxing strategy might generate the following:
let test_phys v =
let id = create_id v in
{ id; v } == { id; v }
This doesn’t occur currently for records, but you can see that this strategy is currently used for boxed numbers:
let[@inline] mk i = Float.of_int i
let[@inline] check_content x y =
(* Ignoring weird things like Nan or negative zero *)
Float.equal x y
let[@inline] check_phys x y = x == y
let test i =
let f : float = mk i in
(check_content f f, check_phys f f)
let () =
Random.self_init ();
let bits_eq, phys_eq = test (Random.full_int max_int) in
Format.printf "bits_eq: %b@.phys_eq: %b@."
bits_eq phys_eq
The above program will print true twice with the bytecode compiler, but true and false with the native compiler.
So comparing using the id is safer than using physical equality, but only in the sense of guarding against potential future transformations.
I can find a use case for that, where you want to cache e.g. intermediate computation:
type tree_with_height = {
tree: tree;
height: mutable int option
}
let height t =
match t.height with
| None ->
let h = Tree.height t.tree in
t.height <- Some h;
h
| Some v -> v
let equal v1 v2 = Tree.equal v1.tree v2.tree
let hash v = Tree.hash v
Another reason is, I might add a mutable bool field to make my physical equality stable if that’s needed, but at this point I might as well carry an id (though possibly physical equality should be cheaper than id equality, since there’s no heap lookup?)
Interesting. So yes, I can’t rely on physical equality “naturally” holding for non-mutable stuff