How to enforce valid db IDs with phantom types?

I have been using phantom types for some time now, and it’s been a fantastic experience. My main use case is to create correct types that only my modules know how to deal with and enforce consistency and correctness. However there is a little edge case that I didn’t find a good way to handle and I didn’t find any article about the topic. It has to do with types that doesn’t have any particular validation rules other than being from a trusted source. My main use case are IDs, to avoid mixing IDs from one entity with IDs of other entities. This IDs doesn’t have anything in special, they are all uuids and the only way to know they are correct is by reading them from the DB or doing a round trip to the Db to check they exist. This all works fine when you are in the core of your app and you just parse the data from the database, however things get tricky when you need to take those IDs from other sources. For example, an incoming Rest request, a CLI input etc. One solution that comes to my mind is to create special serialization formats for those IDs and force clients to use them as I return them, but this may only work for clients that I directly control, and it’s probably not going to be very nice to deal with.
How do you approach such scenarios?
Regards

If I read you correctly, what you want is to ensure that any id you hold actually exists in your database ? This seems hard to enforce to me as it is a very dynamic property. What if the object is deleted in the database, or it was part of a transaction that is rolled back ?

IMO, your typing system should only ensure that the ids are valid “potential” ids. If you retrieve an id from an arbitrary source, you can then decode it to check it’s a valid id, but you won’t know if that id actually exist - and its existence is subject to change.

If your aim is to not mixup ids of different entities, you can always prefix them with a tag, eg. user:9189ed3a-cedb-48a5-9516-9e9d6814f9f0 and event:78f94f53-44f1-4415-88af-7a0c75b7fe99. This way you can ensure a serialized id for some object type is never interpreted as an id of another object type - this may be what you suggested with “special serialization formats” ? I think this is a good approach, as you can provide a save string -> (id, string) Result.t to force the user to check the validity of input, be it because the id is malformed (invalid uuid) or not for the right object type.

Internally we end up with signatures looking like this:

module Id : sig
  type 'a t

  val to_string : _ t -> string
end

module User_id : sig
  include module type of Id

  type t = User.t Id.t

  val of_string : string -> (t, string) Result.t
end

I should have explained myself better. Yes, the main purpose is to avoid sending an user ID in place of a Pet ID just because both are typed as strings. Yes, with special serialization I meant adding some kind of prefix or something as what you proposed.
My main worry was about the clients being able to arbitrarly generate id’s out of nowhere. Obviously I don’t expect the type system (not even the client side app) to validate if an ID exists, that’s backend work.

But it stills worries me the data consistency when more parties are involved. Imagine the client decides to save the data for later in indexed DB. Then the only way I will have to validate that the IDs are coming from valid sources is by adding certain metadata to the serialization and validate that on parse.
Obviously, if you have the json serializers and de serializers you can always get around this, but that is just straight bad usage and not something I worry about.

You could use an idea similar to CSRF tokens, i.e. sign your IDs with a secret known only to your module, then you can easily validate that the ID was created by your module just by checking the signature without having to do a DB lookup. And you can hide all that generation/validation inside a module and return a phantom type when you’ve got a valid signature.
You can use an HMAC or Ed25519 signature concatenated with UUID, and if that is too long you can always decide to truncate the HMAC signature as a tradeoff between id length and how many false lookups you have to make because a client was able to generate a collision on the truncated signature.
(see also SYN cookies - quite effective at what they do with just a 24-bit truncated hash)

However this can’t protect you against IDs that got deleted meanwhile, checking that will require some form of roundtrip to the DB (although you could optimistically assume it exists and raise an exception when you discover they do not?)

If the ids you send still need to be in UUID format then perhaps you can use the new UUIDv8 to define an application specific UUID format that holds the DB id, and truncated hash encoded into UUIDv8. If you want to avoid exposing the DB id directly then you can also encrypt the result with a block cipher (although that would consume some space for storing the nonce you used when encrypting).

1 Like

If you can both 1) generate tokens out of thin air 2) read them from an external string source, then an external source can always feed you wrong tokens. Even if you sign them as suggested by @edwin , it’s all too easy to take the valid id of an object and swap it around with another object’s id of a different type in a serialized format.

My opinion on this is that the typing system is there to protect you against programming errors, not malice. Users of your library will enjoy static type checking that they don’t accidentally mix-up identifiers. The issue you’re afraid of, I think, is that something is serialized on cold storage and de-serialized later and misinterpreted because of some format change. To me, that’s a problem with input data : serialized data must be versioned, backward compatible or this use case must be unsupported. It’s always better to sanitize the input, and if your library should reject invalid ids in the serialized data, even better reject ids for wrong object types at the wrong place, it’s a nice additional security but that’s best effort. In the end, GIGO. The typer gives you guarantees that ids are valid because it can prove in the runtime that they are never mixed in an unsound manner, but it cannot do such thing on externally serialized data that are entirely outside of its control.

If we’re trying to address malicious attempts at submitting altered data, then it’s another story altogether and you should probably sign the whole serialialized block, not worry about individual ids.

Cheers,

1 Like

As always very good answers.
My objective was never security, just avoid programming mistakes.
My original problem is almost always client side code having to provide IDs in isolation, or even worse, intermediate representations that require it to be a string (as creating URLs and that stuff). In many of those cases all what I was getting were strings, so imposing strict requirements in ID types made some pieces of code impossible to communicate with each other without providing cast methods from string to the target ID, which partially defeats the purpose and is why I opened the issue to begin with.
That said, I think this question will benefit the most from concrete examples, so I will try to gather some Ana come back with them.

Regards

I think the problem has two layers:

  1. You could structure the IDs you hand out to encode a type and maybe a checksum that let you verify syntactically that you receive an ID of the proper type and overall structure. That check is dynamic at runtime unless you use some form of typed protocol that ensures IDs are of the correct type. You could abuse UUIDs by reserving some bits to encode a type or invent a new string format.

  2. As others have said, if an ID refers to an object in a database you can’t avoid the check if that object still exist.

A trick that only works within an OCaml program is to create a token as in let tok = ref () and compare it using physical equality ==. The address of this token is unique - so it could be used to denote a certain type or identity.

If security is not a concern then the problem should be pretty easy, as others have suggested just tag the UUID with its type e.g. user:abcd, and parse it in the code to a valid ID object that you can use internally.

EDIT: if you want to be a little future-proof, use ASCII code 31 as the separator instead of : (unless of course these IDs will ever be part of URLs, in which case best to stick with :).

Do private types help you here? For database usage, they can fit quite nicely because they can be coerced to their underlying type (string, int, etc.) but they can only be created/converted by a function in your module.

e.g.

module User : sig
  type t = private string
  val of_string : string -> t
end = struct
  type t = string
  let of_string s =
    (* Validation code goes here *)
    s
end

so you get a certain level of safety that values of type User.t must be created via a validation function, but they can then be communicated to anything which just needs the string value without use of any further function calls.

1 Like

I’m not sure if I use them wrong but each time I tried to use private types I found them totally useless.

Since you have to write coercions even to pattern match on them, I don’t see any interest in revealing your representation. At that point are better off having an abstract type and a coerce function.

Did anyone find a useful way of using private types ?

I’m not sure if I use them wrong but each time I tried to use private types I found them totally useless.

Since you have to write coercions even to pattern match on them, I don’t see any interest in revealing your representation. At that point are better off having an abstract type and a coerce function.

Just a nitpick: I kind of agree on private aliases
(like here, type t = private int) being inconvenient, but not on
private type definitions (type t = private A | B of string) which are plain
awesome.

My personal experience:

private type definitions are amazing and I use them for many sum
types (along with “smart constructors” which enforce invariants at
construction time). You can match on them freely, you just can’t build
then without going through the smart constructors.

private type aliases are sometimes useful to help the compiler, too, I
think. type t = private int basically means that the compiler knows
the type (and can specialize equality, etc.) on it, but you still can’t
pass the wrong kind of int by mistake. It can be useful either for
packing units along with the type (one could imagine a nanosecond vs
microsecond timestamp types using that), or for unique identifiers that
dont mix. I really only use the latter though.

4 Likes

As @c-cube mentioned, you don’t need coercion for type definition, so you can prevent the user from building eg. a record type while still allowing pattern matching, field access, etc, getting IMO the best of both worlds.

For instance Timmy.Daytime exposes its nature as a record, but prevents arbitrary construction, guaranteeing that any Daytime.t is valid while still permitting fun { hours; minutes; _ } -> Printf.printf "it's %02i:%02i o'clock" hours minutes

can’t you pattern-match on aliases if you provide a(n unboxed) constructor for them? e.g.

module ...
  type t = private { get : int } [@@unboxed]
end

I was attempting to use private types to implement sort of a Validated.t type while keeping an efficient representation for integers (e.g. if I try to impose some more constraints on top of them, such as a range constraint).
However I found that in certain situations you can bypass any validation done by such a generic Validated.t functor module if your type has any mutable fields (or if it contains types that have mutable fields anywhere, or e.g. if the type itself is mutable such as a Queue.t or Hashtbl.t).
So even though you cannot construct arbitrary values, you can take one, and (accidentally or not) modify a field deep inside of it, breaking any invariants that Validated.t might’ve tried to ensure.
To build a truly generic Validated.t one would need some assurances from the compiler that the type passed in the functor is not mutable, but there is no type constraint or annotation that I know of to express that.

In fact that problem is present even if you don’t use private types, if you have a no-op identity function as a val get_raw: t -> raw and raw happens to have some mutable field inside of it that is accessible externally the invariants could be broken. Still hiding the type like that and providing some invariants is useful, even if it is not 100% bugproof (a comment in the functor that the type needs to be immutable including all of its fields should be enough in practice, we’re not trying to guard against malicious usage here, just try to prevent programmer error).

I see private types as a tool for optimization where needed: e.g. if the type is int then it is possible to compare it more efficiently than with the generic Stdlib.compare, similarly the compiler already knows to optimize compare when both arguments are statically known to be string to caml_string_compare rather than caml_compare, etc.

Also a ValidatedInt module where you can see the type is guaranteed to be int is useful because then you really can’t bypass that if you stay within the OCaml language (sort of using Obj. or other low-level trickery), and you know the invariant has been checked once at construction time, and you can avoid checking it every time the type is used (but again that is mostly useful as a performance optimization, you could perform those checks every time your integer is used, it is just that you wouldn’t have a convenient way of even handling any errors at that point).
You can also implement safe non-overflowing arithmetic with such a module (that detects and raises exceptions rather than permitting overflow), and although you can bypass that with to_int/doing arith on it/of_int if you follow certain conventions you can avoid having to worry about subtle integer wraparound bugs.

No, they are useful, but they don’t solve my problem.
I mean, they are useful for the other side of the problem: Once you have the type, coming from the trusted source, you can easily use it to interact with other parts of the code, like concatenating strings.
But my problem was about creating types that are coming from untrusted (or not guaranteed to be correct) sources and ensure they are of the right type without having any validation to perform on them.

Thanks all for your inputs, but I have to say, so far except maybe for a private variant type. Nothing mentioned here seems to be a big advantage over an abstract type with (inlined) coercion/accessors.

Note that to get the exact same power as a private type, you also need to expose a coercion functor:

module Conv: functor (X: sig type '+a t end) -> sig
  conv: private_type X.t −> base_type X.t
end

otherwise you lost the no-op conversion for covariant containers:

type t = private int
let noop_map l = (l : t list :> int list)
2 Likes