(Physically) identifying immutable data structures

Hi, I would like to create a unique identifier for an immutable data structure.

On non-mutable types, physical equality is implementation dependent.

I wonder what’s the best approach:

  • Create an int Atomic.t and use Atomic.fetch_and_add to obtain a unique ID (downside: on 32 bit systems, this may overflow).
  • Use a unit ref, possibly as an extra field for the struct like in the following:
type identifiableImmutable = { x : float; id : unit ref };;
assert ({ x = 4.7; id = ref () } != { x = 4.7; id = ref () });;

Is that weird or okay to do? I would think that using ref () will create a unique identifier (physically distinct from any other created unique identifier) while being subject to garbage collection when the identifier is no longer needed, hence avoiding overflow.

I mostly ask out of curiosity. In practice, I might just stick to an atomic because don’t expect to run on 32 bit, but I’m curious anyway.

But perhaps there’s some general advice or another elegant solution that I didn’t consider.

Both the integer and reference solutions seem reasonable to me.
You could also use an open type if you wanted to:

type unique_id = ..
type identifiableImmutable = { x : float; id : unique_id }
let create x =
  let open struct type unique_id += Id end in
  (* Alternative syntax, available since OCaml 5.5 *)
  (* let type unique_id += Id in *)
  { x; id = Id }

It has the downsides of both the atomic integer and reference solutions (it allocates a block dynamically and increments a global integer counter), but if you end up needing more features than just a unique value it might prove useful.

1 Like

Using ref () is an idiomatic way of generating unique identifiers modulo physical equality. However, using an integer gets you a bit more as you get a total order for free (which is useful to build other data structures over your type).

Cheers,
Nicolas

3 Likes

Thanks for your replies. Both the open type is interesting (though might not be what I need in my scenario) as well as the note that with an atomic I get a total order for free.

I had another idea:

type identifiableImmutable = { x : float; mutable phantom : unit };;
assert ({ x = 4.7; phantom = () } != { x = 4.7; phantom = () });;

Wouldn’t that save me some memory (assuming the record gets allocated anyway)?


Afterall, unit ref is equivalent to { mutable contents : unit } (see definition in stdlib). So I guess if I want to make a data structure identifiable, it would introduce unnecessary overhead to include a unit ref field when I can instead do the same what unit ref does by adding a mutable phantom : unit myself. This should save me one allocation, right?

Is adding a mutable phantom : unit field to a custom data structure an idiom seen elsewhere in idiomatic code?


P.S.: Since I want to build a (possibly cyclic) graph, I just noticed I’ll need some sort of mutability anyway, so this solution is unlikely what I’ll use in the end, but maybe it has some useful application. Also, checking for equality isn’t sufficient to create a Set to memorize visited nodes, because a Set requires a total order. I might end up with an atomic.

Or is there a way to deduce a total order of values using their address in memory? I did not find anything like that.

Adding a mutable field to your record is a very good way to make it suitable for physical equality. You could lose a little bit of performance in some corner cases, as the later passes of the compiler tend to track mutability per-block and not per-field, but I would not consider this a problem.
I think that people usually make one of the existing fields mutable instead of adding a new one, as it saves a word of memory per value, but you may prefer ensuring that the regular fields can never be mutated.

No, in particular because the address in memory can change during garbage collection.
You can, however, create hash tables using addresses, as you only need a hash function and an equality function for that. A ('a, unit) Hashtbl.t can be used to implement the part of the Set.S API that doesn’t rely on ordering.

1 Like

Wouldn’t a mutable unit consume 0 bytes? Note it’s a unit not unit ref.

Can you explain?

But don’t I need to find a way to obtain the address? And if it changes during garbage collection (as you said above), doesn’t that cause problems? I know how to test for physical equality (using == or Repr.phys_equal), but I don’t know how to actually obtain a physical address. Also, the polymorphic hash function Hashtbl.hash is based on structural equality, not physical.


Additional question about the manual regarding Hashtbl.HashedType.hash:

val hash : t -> int

A hashing function on keys. […] Examples: suitable (equal, hash) pairs for arbitrary key types include

  • […]
  • ((==), Hashtbl.HashedType.hash) for comparing objects by physical equality (e.g. for mutable or cyclic objects).

Shouldn’t that read ((==), Hashtbl.hash) instead of ((==), Hashtbl.HashedType.hash)? As Hashtbl.HashedType is just a signature. And as I said above: The polymorphic hash function Hashtbl.hash is based on structural equality, not physical. It should still work as two different objects may have the same hash, but could downgrade performance if I have a lot of structurally equal types that are physically distinct, right?

All fields consume exactly one machine word, whatever their type.
If you want a hint at how the unit constructor () is represented, you can run this in the toplevel:

# let exception E of unit in raise (E ());;
Exception: E 0.

This little piece of code tricks the toplevel into printing a value for which it has forgotten the type, so it will look at the runtime value and guess. Here () is represented in the same way as the integer 0 so the toplevel chooses to print 0.

The garbage collector moves memory around from time to time. The main occurrence is when promoting values from the minor heap to the major heap, and compaction can also move blocks around. In both cases, the GC will ensure that any location containing a pointer to one of the old memory regions gets its contents updated to the new location.

The value you’re manipulating from OCaml is the address. Physical equality is about checking that two values are aliases to each other, or in other terms that they are pointers to the same address.

You don’t need a perfect hash function. The only constraint you need is that two values which compare equal must have the same hash; this is a bit tricky if you use the default hash and you compare mutable structures (for example let r = ref 0 in let h1 = hash r in incr r; let h2 = hash r in h1 = h2 will return false), but if your record contains any (deeply) immutable field you can restrict the hash to just that. If this ends up with too many collisions it will not be as efficient as it should but it should give correct results.

I think that the intent was to refer to Hashtbl.hash but the documentation generator picked the wrong item to link to.

Yes, exactly. The default hash function isn’t perfect either, you can get hash collisions for objects that are structurally different too, although the function was chosen so that collisions are uncommon for reasonable inputs. In the worst case (all elements in your hash table have the same hash) the hash table behaves like a regular list, with linear-time for most operations.

1 Like