To create or not to create a new type

jonathandoyle · June 7, 2021, 3:30pm

Recently, I was working on some Ocaml project, and after completion I experimented with some refactoring, which made me realize with some surprise that the code become shorter and simpler when two different types (type atom=Atom of int and molecule=Molecule of int) became coalesced in one single type (type atom_or_molecule = AoM of int).
In the new code, there are many places with comments like here we are dealing with an atom ... and here we are dealing with a molecule ... and initially I thought it was better to be “explicit” and “transparent” about this, but experience has convinced me of the opposite.

The new version has more “polymorphic” functions (involving atom_or_molecule arguments or return values) and what they do is less immediately clear than in the old version, but the overall structure of the program is simpler and clearer.

This makes me ask to the OCaml programmers out-there : do you have rules of thumb for when to create or not to create a type representing a “notion” appearing in the problem you’re dealing with ?

gasche · June 7, 2021, 4:07pm

Would you have a link to the commit(s) that perform the change, so that one can look at the diff out of curiosity?

jonathandoyle · June 7, 2021, 4:26pm

Not currently (the first version was made all in one go in an afternoon without using github, and a few days later the second version was made similarly. If I put anything on github it will be a polished form of the second version).

Chet_Murthy · June 7, 2021, 4:48pm

~~Hmm … I would have tried~~

~~type molecule = Molecule of atom~~

~~Then every operation taking atom as an argument, could be “lifted” to an operation on molecules.~~

As for rules of thumb: the real question is whether it is often the case that one might mistake a molecule for being an atom; if so, then it’s important to have different types, even at the cost of verbosity. I presume that the int is an index into some table, hence (Atom 45) is definitely not the same as (Molecule 45), yes? In which case, I’d opt to keep them separate types.

ETA: now that I think about it, the first part is silly. If the int is an index, then the first bit doesn’t make sense.

hyphenrf · June 7, 2021, 6:27pm

the fact that this refactoring was possible implies that while they are distinct concepts in real world, in your code they weren’t.
My personal approach (not even an advice) to this is that the types usually come from thinking about the data going through the program first (without assumptions), then from thinking about transformations on that data – as to not end up overstructuring.

Thanks to the advanced exhaustiveness checker we have, there’s no problem starting small and adding types as needed. The checker will walk you through all the areas where the new variation needs to be addressed.

jonathandoyle · June 7, 2021, 7:00pm

It is an index into some table indeed ; on the other hand, there is a unique table for both atoms and molecules and at any given time, any registered index is an “atom” or “molecule” state but not both.
So another possiblity would be using an “enum” type such as type state = Atom |Molecule and then type atom_or_molecule_index = state * int but it would be more or less equivalent to my initial solution.

jonathandoyle · June 7, 2021, 7:11pm

You hit it on the nail. Initially I noticed that a final, high-level function expand_molecule was a complicated recursive mixture of expand_atom, so I thought, “I must separate concerns and make two separate modules Atom and Molecule with the latter depending on the former” but this blinded me to the fact that notwithstanding this, the main record type holding all the intermediairy computations results (and the “methods” on it) becomes much simpler if one allows for atom/molecule polymorphism - for example, in the first version it had several atom*atom, molecule*atom, molecule*molecule fields, while the second version only has one int * int field.

Computation trumps functional/conceptual analysis, you might say.

Chet_Murthy · June 7, 2021, 7:57pm

If there is no significant sense in which you can mistake an atom for a molecule and vice versa, then I agree that it makes no sense to have different types for them. That is to say, in your code, if you use type a_or_m as in your OP, and you aren’t commonly checking in that table to see if the thing you’ve got in your hand is an atom, or a molecule, then sure, I agree with that definition.

ETA: I thought I’d write a little more about your larger question, which might say is “when do you represent structurally identical types with not-typechecker-equal types” ? Different people have different answers to this question, but to me, the answer is:

when there is a danger of confusing values of the two types, and they need to be treated differently for correctness. So we use the type system to -separate- them, forcing us to know which type of value we have.

I don’t worry much about verbosity: using

modules with a long name (Atom/Molecule) and also a short name (A/M), and short function-names, I think we can control that.
OCaml can already be so terse, that really, verbosity isn’t as much of a concern as in some other languages.

Others have different reasoning processes for making this decision, and I certainly cannot argue that the rule above is a universal.

UnixJunkie · June 8, 2021, 5:25am

When in doubt, use more types.

PS: this one is from me.

cjr · June 8, 2021, 6:36am

It’s not clear from your description that separate types is actually worse than a single type. It kind of sounds like you’re just rearranging where the complexity lies–either the individual functions are more clear or the overall structure of the program is more clear. If that’s the case, I’d think about the future you who has to do some debugging in 6 months.

OCaml’s type system makes it so easy and lightweight to create new types. But that doesn’t mean that you should always do it. It’s like the programmer who discovers OOP and creates a plethora of classes. When I first started working in OCaml, I was creating tons of types. What I’ve found is that it’s better to first think in terms of functions. What is it that I want to do? I’m starting with X and I want to end with Y, how do I get there? The type signatures follow from there.

The type “atom_or_molecule” is a code smell to me. If two concepts can be unioned together, they must represent a higher-level concept. It’s important to identify and name that. Are they really distinct? Can an atom be conceived of as a special type of molecule or vice-versa (as @Chet_Murthy suggested in his redacted comment)? The fact that they’re usually treated the same way suggests either that they are instantiations of the same thing or that one is a variant of the other. I’d try to model that.

thierry-martinez · June 8, 2021, 1:15pm

The two types type atom=Atom of int and molecule=Molecule of int do not allow polymorphic functions to be defined over atom and molecule, but there exist other ways to distinguish types that could allow polymorphism.

For instance, atom_or_molecule could be a type constructor with a phantom type variable to distinguish between atom and molecule when appropriate, while letting polymorphic functions to be defined over 'a atom_or_molecule.

module Atom_or_molecule : sig
  type 'a t
  type atom = `atom t
  type molecule = `molecule t
  val of_int : int -> 'a atom_or_molecule
  val to_int : 'a atom_or_molecule -> int
end = struct
  type 'a t = int
  type atom = `atom t
  type molecule = `molecule t
  let of_int x = x
  let to_int x = x
end

UnixJunkie · June 9, 2021, 4:27am

I am curious, what were you doing with molecules in OCaml?
I thought I am the only person on earth dealing with molecules in OCaml !

jonathandoyle · June 9, 2021, 5:52am

It’s just a fancy name, and probably rather different from what you do … I have items, some of which (“molecules”) are decomposable and union of others. Those who are not decomposable are called atoms.

Here goes a simplified version of my interface :

module type SEED : sig

type data
type selector
val test : selector -> data -> bool 

end 

module type REPEATEDLY_PARTITIONABLE = functor (Seed:SEED) ->
sig

type computation_state
val add_new_atom : computation_state ref -> Seed.data list -> unit$
val count : computation_state ref -> int
val decomposition_at_index : computation_state ref -> int -> int list 
val force_partition_at_index : computation_state ref -> int -> Seed.selector -> unit
val get_index_opt : computation_state ref -> Seed.data list -> int option
val set_of_atoms : computation_state ref -> int list


end

ivg · June 9, 2021, 3:39pm

Shorter and simpler doesn’t always mean better. It is important how many bugs your code contain, how long it takes to develop it and to debug it, and how easy it is to test and maintain it. Roughly, the quality of code is the spent effort related to the provided functionality.

Speaking in terms of lines of code, types always add extra cost, so we might think that it is better to use languages without types, like Forth, or with dynamic types like Python. But types make it easier to understand programs, which reduces cognition burden and makes us more efficient, enables static analysis that detects common programmer errors before runtime that reduces the debugging time (which commonly contributes to 80% of the overall lifecycle of a program), and make our programs faster, which makes our programs more competitive and allows us to spend less time on optimization. Therefore, while types make programs more complex and larger they overall contribute positively to the final revenue figure. Therefore, my personal moto is,

Going back to your example. Whenever I run into code like

type t = Foo of x | Bar of x

I immediately see a refactoring opportunity. I try to factor out the common type x as it is clear that we can write generic functions that ignore whether t is Foo or Bar but depend only on x. A common transformation¹ is,

type kind = Foo | Bar
type t = { kind : kind; data : x}

or even,

type foo
type bar
type 'a t = {kind : 'a; data : x} 
(* where 'a ranges over foo, bar *)

In case when we don’t need to dispatch over the kind at all, we can eschew it altogether, e.g.,

type foo
type bar
type 'a t = x

The trick type 'a t = x is called phantom typing. To make the trick work we need to rely on the module system. Here is a nearly complete example, where we decided to represent both molecule and atom as integers. But at the same time we made everything to prevent confusion between molecule and atoms. We also declared a polymorhic type 'a element for operations common between molecules and atoms. And we breached our abstraction a bit by declaring its implementation as int.

module Element : sig 
  type 'a t = private int
  val create : 'a -> int -> 'a t
  (* the rest of polymorphic element interface *)
end = struct 
  type 'a t = int
  let create _ x = x
end

type 'a element = 'a Element.t

module Atom : sig 
   type cls
   val create : string -> cls element
   (* the rest of the atom-specific interface *)
end = struct 
  type cls = string
  let registry = Hashtbl.create ()
  let create name = 
    let id = position registry name in
    Element.create name id
 ...
end

type atom = Atom.cls

module Molecule : sig 
  val create : atom element list -> molecule element
end = struct 
   type cls = atom element
   let registry = Hashtbl.create ()
   let create atoms = 
     let id = position registry atoms in 
     Element.create atoms id
end

type molecule = Molecule.cls

¹⁾ recall that algebraic types form algebra with sum and product types and compare it with the identity

(ax + bx) = (a + b)x

UnixJunkie · June 10, 2021, 2:05am

Surprisingly, wikipedia has no page for “phantom typing” !

It would be nice if phantom types are syntactically easier to use in OCaml.
I am such a lazy programmer, that very often when it comes to “you
should parametrize the thing by a module”; i.e. write a functor, I will
very likely not do this kind of refactoring…

atavener · June 10, 2021, 4:38am

Yeah, phantom types seem to not be very well documented, even though the idea appears across a few languages, and can be quite useful.

You don’t need to involve a lot of module work. The core idea is just that you have some abstract, otherwise empty, types… which are used to parameterize the actual shared type.

I’ve used it for colorspaces, where the common implementation type is a vector of three floats: type 'a color = private V3.t and a bunch of colorspace types which are empty: xyz, rgb, lab, lch, … then some conversion functions might go from rgb color -> lab color. Everything uses the V3.t internally, but the signatures can be more specific where necessary, so someone (me) doesn’t double-convert something that’s already in sRGB, for example. I didn’t use modules for that beyond the implementation being in a file which is a module by default… and having a corresponding .mli which exposed the right “phantom” signatures for functions, hiding the general implementation.

UnixJunkie · June 10, 2021, 4:52am

What is colorspaces? Some ocaml library?
Do you have some code to show your usage example?
IIRC, in Haskell, it was syntactically way easier to use phantom types.

atavener · June 10, 2021, 10:26am

Ah, sorry – colorspaces are relevant to image processing and graphics. In this case a Color module, but I don’t have anything shared. It’s just a case where there is a uniform underlying representation: three floats can represent red,green,blue… or other “tristimulus values” like hue,saturation,lightness… each different representation can have uses/advantages, and they can be converted between. Even though the internal implementation is mostly common/shared, lacking user-visible type distinction between them could lead to erroneous code with hard-to-identify misbehavior (colors might be subtly wrong).

Well… carving out a sample of this color module doesn’t look as trivial as it should…

module Color : sig

  type 'a t = private V3.t

  (* Colorspaces; eg. Color.(rgb t) *)
  type rgb
  type xyz
  type lab
  type lch

  val make_rgb : float -> float -> float -> rgb t
  val make_lab : float -> float -> float -> lab t

  (** Colorspace conversions *)
  val lab_of_lch : lch t -> lab t
  val rgb_of_xyz : ?profile:profile -> xyz t -> rgb t
  val rgb_of_lab : ?profile:profile -> lab t -> rgb t

  (** Linear blend between two colors (in the same colorspace) *)
  val blend : 'a t -> 'a t -> float -> 'a t

end = struct

  type 'a t = V3.t
  type rgb
  type xyz
  type lab
  type lch

  let make x y z = { x; y; z }
  let make_rgb = make
  let make_lab = make

  let lab_of_lch lch =
    let a = Calc.to_radians lch.z in
    { x = lch.x;
      y = lch.y *. cos a;
      z = lch.y *. sin a }

  let rgb_of_xyz ?(profile= !curr_profile) xyz =
    profile.gamma (M33.v_by_m xyz profile.m_xyz)

  let rgb_of_lab ?(profile= !curr_profile) lab =
    rgb_of_xyz ~profile (xyz_of_lab lab)

  let blend a b t = V3.lerp a b t

end

But really, I autogenerated the mli, then doctored up the signatures. The phantom-type machinery here is trivial. I included some examples of functions with restricted types (conversions), and then blend as an example of enforcing all color inputs and output have the same colorspace, though it can be any.

cvine · June 10, 2021, 1:57pm

Some things in ocaml seem quite verbose (modules and functors for example), but phantom types seem to me to be extremely simple. I not sure how they could be made syntactically easier and still remain intelligible.

EmileTrotignon · June 10, 2021, 6:54pm

It think it would be nice to release that. I did not really know how to use external libs at the time, but I remember trying to get nice colors on a mandelbrot project, and I had to implement colorspace conversions myself, which was not particularly fun.

Topic		Replies	Views
When should types (and associated functions) form their own modules? Learning module , style	4	1042	December 3, 2018
New Draft Tutorial on Polymorphic Variants Learning announce , ocamlorg , polymorphic-variants , learn-ocaml , tutorial	24	1505	November 30, 2023
Feedback on first library before release Learning	9	1310	May 22, 2021
Haskell's type-classes (Was: Why did you learn OCaml?) Community	6	2563	August 25, 2020
Applied numerical algebra, and type systems Ecosystem	18	1091	August 7, 2022

To create or not to create a new type

Related topics