Avoiding extra matching with extensible GADTs

It is very inspiring that the extensible GADTs (the GADT types that could be extended like extensible variant types, cf. manual as well as this post) are now available in OCaml. They do bring a promise of typed data annotations that could be extended through modules, incrementally, by adding cases to the data description tag type. Unfortunately, there’s one snag with their practical application. There’s no way to eliminate the catch-all case in pattern-matching these types. Consider the following attempt

module Data_tags = struct

  type ('a , 'b) t = ..
  
  type ('a , 'b) t +=
    | Int   : ([> `Int], int) t
    | Float : ([> `Float], float) t

end


let inspect (type t) (tag : ([`Int | `Float] , t) Data_tags.t) =
  match tag with
  | Data_tags.Int -> "Int"
  | Data_tags.Float -> "Float"

Now, I could always add assert false to the bottom, but doing so seems a bit ugly and does not inspire confidence in reliability of such setup.

I have a number of questions:

  1. How may the problem of extending data representation be solved? In my context, I need to be able to supply a basic set of ints and floats as configs to a game, but also sometimes (in certain isolated contexts) I need to pass sprites, sounds, etc… I do not wish to copy the whole data type just to accomplish that. Instead, I would like to use extended variants of types, that important cases from modules dealing with sprites, sounds, etc.
  2. Is it possible to somehow, through clever mechanisms of the language, to avoid having that assert false case in the code?
  3. Is there a guarantee that this assert false case would never be invoked?
  4. How may I approach extending the processing functions? I don’t mind calling the function that could return a correct value for cases not accounted for, but somehow I need to actually dispatch it. Even then, I’m concerned that the assert false would be necessary within the pattern-match… or not?

I don’t think so. The whole point of extensible types is that they may be extended, so there is always the possibility that more cases may exist, even added at runtime (using dynlink).

If instead of representing the “tag” of the data by itself, you represent the tag together with the data, then polymorphic variants allow you to extend sets of cases without copying the definition and without needing a catch-all case.

module Data_tags: sig
  type t = [ `Int of int | `Float of float ]
end

module Other_module: sig
  type t = [ Data_tags.t | `Sprite of sprite ]
end

let inspect (tag : Data_tags.t) =
  match tag with
  | `Int _ -> ...
  | `String _ -> ...

let other_inspect (tag : Other_module.t) =
  match tag with
  | `Int _ -> ...
  | `Float _ -> ...
  | `Sprite _ -> ...

(* You can even dispatch on the subtype: *)

let inspect2 (tag : Other_module.t) =
  match tag with
  | #Data_tags.t as x -> inspect x
  | `Sprite _ -> ...

Cheers,
Nicolas

I’m using the tagged data approach to actually store the data. The use of a single tag type (which probably could be made nebulous through some annotation, as shown by use of @@unboxed at https://gist.github.com/polytypic/fc8d71ae29fac9c770b3b91d81bf5f4f with the GADT annotations (which are simple type-level mappings really) is there to allow myself to avoid packing items in places where they need not be packed (like inside the class attributes).

The code I’ve come up with looks more along the lines

      let tagged_data = begin
        match data_tag with
        | Game_data.Tags.Int ->
          (`Int data : [> `Int of t])
        | Game_data.Tags.Float ->
          (`Float data : [> `Float of t])
          (* etc. etc. *)
      end

While I could avoid global game data type declarations through the use of polymorphic variants, I still have a problem that the functional tags that I use still need to be fully enumerated. I guess, I could try to enumerate all tags for the entire game, and then subset. However, this is a bad approach, since we’re not always dealing with ints, floats, strings, etc., but also with module-specific enumeration types, or semi-opaque types like sprites, sounds, etc. etc…

How may I approach the problem of setting up a unified representation of various types in a project in a manner that would allow extension within the project? The source of my concern is the desire not to entangle Game_data and Game_sprite modules for instance.

So, from my understanding, extensible GADTs are a bad fit here, since as @nojb said, one can always extend the type from anywhere, in a manner you cannot predict and therefore your pattern matching has to account for that unknown case. For instance one could create type (_, _) Data_tags.t += Bad : ([>`Int], int) t and pass Bad to your inspect function, which would trigger the assert false.

But you don’t want full extensibility, but rather something limited to the scope of your project, to avoid creating dependencies between modules. Wouldn’t it be “simpler” (conceptually) to split your files between Game_data_typ and Game_sprite_typ (for instance) containing only type definitions, then create a central Data_tags module containing all the cases, then have Game_data and Game_sprite depend these with actual function definitions. If your Data_tags.t is not extensible anymore but finite, you may be able to remove completely unneeded cases in pattern matching. Of course this is not perfect and may not scale well if you have several such modules.

As an aside (and somewhat unrelated) the way to extend the behavior of functions such as inspect in the presence of polymorphic variants (or extensible variants) would be for it to take as extra argument the default case handler:

let inspect (type a t) ?(default=(fun _ -> assert false)) (tag : (a, t) Data_tags.t)  =
    match tag with
    | Data_tags.Int -> "Int"
    | Data_tags.Float -> "Float"
    | x -> default x

This means you can extend the behavior of inspect after its definition (by passing a suitable function to handle the cases that are not covered, but of course, the default behavior is still assert false and is still reachable.
[Edit: typos]

1 Like

Somewhat unrelated but just wanted to flag that new cases can be added at runtime without something as complex as dynlink. One common use case is type IDs, which is done in the stdlib here: ocaml/stdlib/type.ml at 00d89eafcc83f497aa380e2c1a6e52b567a269e6 · ocaml/ocaml · GitHub

1 Like

Is there a way to construct variants like `Int of int from variants like `Int. I guess some functionality in Sexp modules of Core might help, but I do not have a concept of how it may work specifically. I could convert to S-expression, and do some changes to it. However, how should the result translate to a new type? It’s one thing to translate simple variant type tags like from `Int to `Int_Special, and quite another to translate like from `Int to `Int of int.

@K_N We cannot really put `Int * int in the same collection with `Float * float. We need `Int of int and `Float of float instead. Also, when we form a term, we need to convey to the function that forms it what type is of the input data. How may we approach this without the GADTs, whence a type of a type variable could be controlled?

It seems that to define basic functions to tag or untag we need GADTs. Unless we start naming functions for each particular type, like tag_int, tag_sprite, etc…

Agreed.

Once again, agreed.

Sure. Especially considering that these types already exist, and the issue that I have come across is more about tagging the types rather than organization of the game data itself.

This is what I am trying to conceptually avoid as much as possible. Ideally, all the types related to sprites should be editable through sprite-related modules, and the generic modules should be aloof about their presence.

There’s a certain difference between freely tagging the data, like e.g. `Sprite of Game_sprite.t and defining a type case like type t += Sprite of Game_sprite.t. The use of polymorphic variants is in principle acceptable. However, my problem is not entirely solved, since I still need to find a scheme to tag the data, and thus to translate from `Tag into `Tag of data_type within a function that does the tagging. The attempted use of GADTs that allow simple type-mapping seemed like a nice trick, but perhaps I need to consider an entirely different approach.

The solution would probably involve partial matching through the use of GADTs and phantom labels. For me it’s more of a last-resort solution, since it breaks conceptual modularity (by relying on the centralized all-encompassing data-type that contains all relevant tags), and it seems that we’re missing some opportunity that may already have been in existence.

The conceptual problem for me with this approach is that it creates a landmine of sort. Someone could deliberately feed an input to my program that would “crash” it – which is the effect of assert false statement. I seek to avoid that. In the future, I may be the one inadvertently triggering that assert false case. It’s a fate that I wish to avoid as a software developer responsible for the codebase.

I also feel that this breaks abstraction, where we have slighly different modi (pl. of modus) of encoding type values in the different parts of the game, and it is appropriate for them to use their own data-types, albeit very similar. The aim I try to achieve is to eliminate code duplication involved in handling the cases that are common to all these types.

How would you approach the following. Let’s say we have an open GADT type. What we need in our code is to form from an unknown Some_tag tagging an unknown_type data value in this GADT another tag of the form `Some_tag of unknown_data. Is there a way to accomplish this without involving Obj?

Now, it’s possible to decide not to encode a tag of which I have now knowledge at the site where the function is defined. I wonder if such semantics could be acceptable? I would like to find a way to prevent accidental silent failures.

Now, since the open type in question would be tied to some modules, the perhaps we may add one more type variable into the GADT and parameterize it … well, with module identity (?!).

@zbaylin I appreciate the insight you shared about the use of private modules in the stdlib implementation. Thank you. May I inquire, what is your perspective on the possibility of silently skipping encoding unknown terms as a default semantics, and guarding against it through the introduction of an extra type variable?

I attempted to summarize the approaches we have available with data representation problem. I think that probably in the end, we may need to use a code generation solution rather than try to find a solution in terms of language constructs. The code generation would allow us to construct a type of tags, over which we may match efficiently, possibly using phantom types. I guess, it’s not too bad.

I wonder, if any ppx or other libraries exist that would allow me to generate type definitions (and also code) that extend upon existing variant types (and pattern-matches)? Basically, something along the lines

  type _ tag =
    | Int : [> `Int] t
    | Float : [> `Float] t


  module Item_params = struct
    (* item params data ... *)
    type _ tag =
      [%extend: _ tag]
      | Sprite : [> `Sprite]
      | Sound : [> `Sound]
  end

There’s no need for any type equality relationships between the instances inside the Item_params and outside, only creating a new instance of a type based upon template is sufficient.

This is tricky if you have to extend both the set of types (int, float, sprite, …) and the set of operations (inspect, …) in a modular way. It’s easy to support extension in only one direction: use records or modules or objects to allow extending the types, and variants or polymorphic variants to allow extending the operations.

Is there a reason these have to be in the same type? Why not compose the GADTs if you don’t need to extend the types at runtime? Something like:

type _ tag =
  | Int : [> `Int] tag
  | Float : [> `Float] tag


module Item_params = struct
  type 'a super_tag = 'a tag
      
  type _ tag =
    | Sprite : [> `Sprite] tag
    | Sound : [> `Sound] tag
    | Super : 'a super_tag -> 'a tag
end

let _ = Item_params.Sprite
let _ = Item_params.Super (Int)

Conceptually, the structure of the type is singular. The use of _ super_tag type runs counter to the conceptual notion of what the type should represent. This is not good, since the elegance is important in code for pragmatic reasons, and it really pays to make the representations from concepts to language costructs — types in particular — to be as direct as possible.

An alternative that I have thought about is the use of ppx infrastructure to effectively copy-paste the relevant type definitions and also the relevant expressions involved in pattern-matching these variants. After all, a pattern match is a function from a known variant type into another.

Another approach involves a construction of a central unified type, based on the code blocks available from other modules.

Note that in both cases we need a functionality largely equivalent to templating or preprocessing, of the sort that could be found in C or C++ case, but much more relatively tame than a direct text substitution. I wonder, if any ppx extensions exist that would allow me to construct new larger types from variants provided in another module?

What do you think is a better approach in terms of performance for constructing the processing functions? To have code blocks working on the separate types, or to have a large type and a larger code block working on it?

Conceptually, it’s a good thing to be sure about the resultant semantics. An attempt at extension could introduce background implicit logic that would linger and may create confusion and errors. On that basis, I would certainly agree that the types should be produced on a generative basis, with explicit type generators from explicit input types.

As I have been thinking about this case, I wondered (here and here) about whether the solution to this problem might be to introduce some basic templating functionality, which would amount to copying definitions from other modules and pasting them at the site of definition being perfromed.

Consider the following picture of how such an approach might look like:

module Scalar_data = struct
  type ('a , 'b) t =
    | Int : (int , [`Int]) t
    | Float : (int , [`Float]) t
  [@extensible name=Scalar]
end

(* ... *)

type Sprite_data = struct
    Item = struct
       type t = { gfx = (* ... *) }
    end
    type ('a , 'b) t =
      | Sprite : (Item.t , [`Sprite]) t
    [@extensible name=Sprite]
end

type Sound_data = struct
    Item = struct
       type t = { snd = (* ... *) }
    end
    type ('a , 'b) t =
      | Sound : (Item.t , [`Sound]) t
    [@extensible name=Sound]
end

module Tactical_item_data = struct
    type t =
    [@extensible.include Scalar, Sprite]
end

module Aircraft_data = struct
    type t =
    [@extensible.include Scalar, Sprite, Sound]
end

I would appreciate your insight and thoughts on this.

I suppose, we might have some other explicit operators to construct types, but ppx-like solution that amounts to some copy-pasting effectively seems like a solution in an exactly right niche.

What I mean is: whether modular extensibility is hard depends on whether you’re trying to extend in both dimensions, or just in one. I’m not referring to particular solutions like code generation.

It’s not clear from your description whether you need to add both new operations and new types, or just new types.

If you have a fixed set of operations (like inspect), and only need to add new types, then extension is straightforward, and there’s no need for fancy solutions like extensible GADTs or ppx code generators. You can define tag as a record of operations

type 'a tag = {
    inspect: string;
    size: int;
  }

and define functions like inspect to project out the corresponding fields:

let inspect : 'a. 'a tag -> string =
  fun {inspect} -> inspect

and then add new tags in different modules, like this:

module Data_tags =
struct
  let int : int tag = { inspect = "Int"; primitive = true }
  let float : float tag = { inspect = "Float"; primitive = true }
end

module Sprite_tags =
struct
  type sprite (* = ... *)
  let sprite : sprite tag = { inspect = "Sprite"; primitive = false }
end    

With this approach there’s no danger of incomplete pattern matches, assertion failures, etc.

However, if the set of operations is not fixed then this approach won’t work, and there isn’t currently a satisfactory approach that will.

2 Likes

What is the run-time cost of using a record here instead of module or an object? Would a module, or a first-class module be more efficient (assume compilation with flambda and -O3)?

I agree that the method you are suggesting is a good approach, and it would work every time where a generic operation on a single type is required.

The problem that appears in practice is related to the need to form property lists from relatively arbitrary subsets of types in a program. In case of a game, we may have battlescape weapons, aircrafts, munitions, loot, dumb items, artefacts, etc… Each may employ some of the game data types, but not others. The corresponding property lists — lists that contain tagged items, like e.g. [(`Ignores_pain , Bool true) ; (`Sprite , Sprite <abstract>); (* ... *)] — thus must be typed somewhat differently (sometimes we may need sprites, sometimes not, sometimes we need sound, …, and sometimes we need sprites with different handling, … etc.). In fact, with tags on the left, we actually need to distinguish each case: for some weapons we have combination power, range, for others we have power, powerDecay, etc… In general, we need arbitrary subsets of tags expressible in a type.

What is the best approach here? To attempt techniques like those described in this thread (notably e.g. here)? Note that even if we found the best type to express sets and subsets of properties — corresponding to possible tagged item values, — we are still facing a dilemma that the multiple tagged item types should be represented by a single type value in order to be usefully processed. Do you see how this case is different from the application of type-specific operations on a single type?

What we end up here once again is a case of a scheme with a global data tags type with phantom row types (whether objects or polymorphic variants) used for sub-setting the admissible tags.

The problem with this approach is that we are once again facing the need to extend the global type whenever we introduce a new game data type. For example, I may introduce an animation. Doing so would require me to manually extend the global type that represents the sum of all tags with a tag representing an animation.

The only solution that I see here is to generate textually the centralized type representation from the local definition of cases and operations on them. An alternative may be to use OO features and classes.

I wonder, if we need to pay the cost of polymorphic variants every time we need to put items in a collection?

With new operations, I think that I’ll need to change the interface that enumerates all the operations regardless. In that case, all that remains is to edit all instances to implement additional operations.

I think that extending the types themselves is a more difficult case here, particularly when we need to define operations on tagged variants, in cases where tagging is genuinely required. The solution you have suggested above works by eliding tagging altogether. However, we must ask a question:

:question: Could tagging be always elided?

While the seeming obvious answer appears to be «No», an interesting thought emerges. What if the cases where tags need to be added in a mechanical manner are only the simple cases, when we just need to enumerate types? Regardless, I still don’t see how we could generically operate on collections of properties without having those properties unified into a single type by tagging, and then put into some collection (e.g. a simple list).

I suppose, in the foreground I might have an interface that puts the items into an opaque type of my own in front of a generic collection. Then, I could have my own set of tags local to the internals of that opaque type. Something along the lines of:

  type t
  val add_attr : t -> (Tag.t , 'a) -> t

And now, the internal representation of types used as input to internal converters to external property list data might as well just rely on the internally-specified set of tags, and basic subsetting with phantom types…

Hmm… and we could actually have the attributes type to carry an extra phantom type to denote what set of attributes it carries.

The only interesting question remaining in this case is how may we automatically generate the desired local variant type representing internally the property list?

Tangential: Since the expression problem was mentioned in its defining content but not its name, let me chime in. The Expression Problem (code).

2 Likes

Thank you for sharing this! I glanced earlier at this lecture, and would take some time to work it out. I also noticed that there’s a bit more related information available at this section of your website.

Do you have some scholarly articles to recommend on this issue?

Also, may I inquire, what is your impression about approaching the issues of extensibility and code reuse in OCaml? There exists an opinion that objects are undesirable, even though in another post in the same thread an important feature is mentioned — the “open recursion” [1]. I do find objects very useful in cases when I need to define a family of slightly different records — in game development context that could be different categories of a weapon, which may be a rifle, a cannon, an auto-cannon, a crossbow, etc. with the set of relevant attributes largely intersecting, but not quite being the same. While, this issue may be tackled through a form of extensible records — and object types are more than sufficient here — the classes emerge as a natural solution to any basic code reuse problems. Why is there such an objection to the object systems in OCaml supported by prominent members of the community? Is there a good replacement, particularly the problem domain I consider — whence there are many distinct entities that have many closely related attributes?

[1] It seems to me that the “open recursion” problem is more of a problem of defining mutually recursive modules, solvable by the use of functors and recursive instantiation.

OCaml’s OO classes are great for implementing behaviors where there’s a default that can be gradually refined in different directions. I’d point to this example: Visitors by François Pottier. The reason that e.g. Jane Street fights against this style of programming is: when part of the system changes, the compiler no longer “interrogates” the programmer to adapt other parts that the changed part interacts with, because it will just assume some-or-other default behavior was intended.

Edit: One could coin a slogan “security via non-extensibility”. Sometimes it is desirable that when you add an extension, for the compiler forces you to go through all places that interact with the extended type – less likely to overlook a spot where the extension is mishandled.

3 Likes

Thank you for sharing this scholarly article. The concept is very interesting, and it appears to save us some functor applications for anything sophisticated.

To me, the objects feel almost like a family of implicit functors, that could be picked and named at the time the need for extension arises. That is, it feels almost like a mechanism for auto-generating functors.

Interesting! This sounds very much like the (essentially textual) template-based extensibility, alluded here and that popped up in a few other discussions on this forum (some even as far as 2018), but only done manually. It seems that the larger software companies tend to like to use more manual labor if that gains an edge.

Thank you for sharing this. I think, I’ll keep using the class types in my code then, certainly in the interfaces where the ability to construct definitions incrementally is desired.

In principle, we could have OCaml play a role of meta-language for writing interpreters that deal with processing data. The caveat is that for a good number of problem domains, it’s just easier to model things directly in the host language. The types of representations I have spoken about in the context of gamedev, for example, are really useful both for scripting the game and also developing its main logic, as well as its extensions.

I suppose, that I only wish to see a possibility of an efficient implementation of the method invocations — after all, often enough, at the call site it is clear which method is being called at the compile time.