[ANN] BAP 2.0 Release

The Carnegie Mellon University Binary Analysis Platform (CMU BAP) is a suite of utilities and libraries that enables analysis of programs in their machine representation. BAP is written in OCaml, relies on dynamically loaded plugins for extensibility, and is widely used for security analysis, program verification, and reverse engineering.

This is a major update that includes lots of new features, libraries, and tools, as well as improvements and bug fixes to the existing code. The following small demo showcases the modern BAP look and feel.
In this announcement we would like to focus on two very important features of BAP 2.0:

The Knowledge Base

The Knowledge Representation and Reasoning System, or the Knowledge Base (KB) for short, is the central part of our new architecture. The KB is a step forward from the conventional approach to staging multiple analyses in which dependent analyses (aka passes) are ordered topologically, based on their dependencies. The KB is inspired by logic programming and enables an optimal and seamless staging of mutually dependent analyses. Mutually dependent analyses are also present in the source-based program analysis but are much more evident in the field of binary analysis and reverse engineering, where even such basic artifacts as control flow graph and call graph are not available as ground truth (and in general are not even computable).

Object properties in the KB are represented with directed-complete partially ordered sets. The KB also imposes the monotonicity restriction that requires that all updates to the property are monotonic, i.e., each consequent value of the same property is a refinement of the previous value. These restrictions enable the KB to compute the least fixed point of any property, is computed. A property representation could be optionally refined into a complete lattice, which gives the KB users extra control on how properties are computed.

By storing all information in an external location the KB addresses the scalability issue so relevant to binary analysis and reverse engineering. In the future, we plan to implement a distributed storage for our Knowledge Base as well as experiment with other inference engines. Soon, it should also possible to work with the knowledge base in non-OCaml programs, including our BAP Lisp dialect. That, practically, turns the knowledge base into a common runtime for binary analysis. In the current version of BAP, the Knowledge Base state is fully serializable and portable between different versions of BAP, OCaml, and even between native and bytecode runtimes. The Knowledge Base state could be imported into an application and is integrated with the BAP caching system.

New Program Representation

Employing the tagless final embedding together with our new Knowledge Base we were able to achieve our main goal - to switch to an extensible program representation without compromising any existing code that uses the current, non-extensible, BAP Intermediate Language (BIL). The new representation allows us to add new language features (like floating-point operations or superscalar pipeline manipulations) without breaking (or even recompiling) the existing analyses. The new representation also facilitates creation of new intermediate languages which all can coexist with each other, making it easier to write formally verified analyses.

Links

11 Likes

The knowledge representation/query feature sounds exciting – thank you for announcement.

I had a very quick look at the BAP documentation “Knowledge representation and reasoning system” – the authors have spent a lot of time precisely documenting all the interfaces!

However, there are lot of very specific technical terms and those not steeped in the BAP way of doing things can get lost after reading a paragraph or two. The definitions are overwhelming.
In general, a lot of us learn from the “example method”. A small example is provided that allows the essential features to be communicated. After that it becomes easier to read the odoc documentation.

What would be awesome is a small example of the new knowledge representation and reasoning system. How can I add some knowledge to the KB and how can I query it?

Also, how are you going to deal with the scalability side of things. Usually a LOT of data can be generated for non trivial programs. How do you deal with the usual problems of large query times? Is there a storage backend for the KB?

3 Likes

Yes, I suggested to create something like Semmle QL (or Infer AL) query language, probably based on Datalog.

1 Like

That’s a very fair point. We need examples indeed. The main challenge here is that the Knowledge Library is so generic and operates with such abstract objects that it is hard to give a good example without unnecessary limiting the scope of the library, at least in the mind of the reader (on the other hand not giving any initial scope is probably worse).

Additionally, I didn’t find any impressive example that will showcase the power of the KB while still being easy to understand. I need something that involves partial knowledge, mutual recursion, and conflicting opinions :slight_smile: If you have any ideas, please share. I hope I will find something. Besides, we also have a blog, and we even started a series of blog posts about the knowledge base which is less formal and more concrete. We will continue soon.

In general, the usage is very simple, you define a class of objects, e.g.,

open Bap_knowledge
type student
let student = Knowledge.Class.declare "Student"

You can define properties, which objects of the class share, e.g.,

let name = Knowledge.Class.property student "name" Knowledge.Domain.string

The name value has type (student,string) Knowledge.slot and could be used to access the name property of any student.

To set the value of a property we use the provide operator, but we need a student to give it a name, so we first create a new student,

let new_student given =
  Knowledge.Object.create student >>= fun s ->
  Knowledge.provide name s given >>| fun () ->
  s

We can access the property value of a student using the collect, e.g.,

let student_name s = Knowledge.collect name s

And so on. Basically, the Knowledge Base provides a runtime for a language with objects. It could be also seen as a heterogeneous dictionary on steroids (besides, it is quite optimized underneath the hood and is faster and more memory efficient than Janestreet’s or OCaml maps or even hashtables - it is a quite convoluted binary tree underneath the hood).

In fact, the Knowledge library is indeed a runtime, it is a runtime for the BAP Lisp language, which we will release a little bit later. The main difference from a common runtime is that we do not really provide real mutability. What we have is a very limited form of mutability, in which you can only add more information to the value, e.g., you can go from None -> Some 42 but you can’t change it from Some 42 to Some 43. This limited form of mutability prevents the common problems of mutable state, while still making it useful. Due to monotonicity restriction (no information could be denied) and internal fixed point, at any point of time the program is always facing a consistent state.

This brings us to the scalability question. First of all, we already introduced an extra layer of indirection, by externalizing the state of our computation from the OCaml runtime to our own runtime. This lets us stop the computation and any moment, migrate to other processes and resume later. We can also pass knowledge between different processes and employ parallelization in cases when it is possible (e.g., in Binary Analysis we may analyze independent libraries in parallel). Right now we have a custom implementation of the runtime in OCaml, but we can later move to some other backends, like Redis for example (at least as the storage).

But eventually (and very soon), the knowledge base will grow beyond the acceptable size. Mostly because it also produces a lot of intermediate knowledge. Indeed, if we wrote our own runtime, then we probably need to implement a GC for it. And this the crucial idea, since our knowledge is monotonic, then instead of implementing a conservative garbage collector, which will negatively affect the performance story, we could just delete arbitrary objects, even if they are still referenced. Indeed, since the monotonicity is enforced by the knowledge base, we can eliminate any knowledge from the context without compromising the correctness. It is like forgetting something, but not worrying about it, because you can always repeat the process. In that sense, the knowledge base is more like a cache and the garbage collection problem turns into the cache invalidation problem. Which is also not a trivial problem but, in our case, it only affects performance, not the correctness (in the worst-case scenario if we have evicted all knowledge, we will just need to repeat the whole cycle of computations from the beginning). But this is all yet to be implemented, right now we are not cleaning our knowledge base and rely a lot on temporary scoped objects. So we can analyze rather big programs and still fit into the memory.

5 Likes

Thank you for your reply. Indeed the KB is quite abstract in scope so I understand the problems you have in explaining via concrete examples.

The wording of your response (and words like monotonicity) remind me of prolog or datalog. Is the KB intended to be a kind of datalog? My understanding/guess so far is that while doing binary analysis you will add facts to this KB about various aspects of this program. The facts you add can only be refinements to the previous facts that you added. So your results can only get more specific as more facts are added (hence the monotonicity).

Also do you have some sort of internal indexing therein? Otherwise running queries from the KB might get slow.

OTOH you talk about objects and then I start wondering if there are aspects of graph databases here…

I guess what would be helpful in absence of code (or even better than code because you can be more tentative in your language) would be something like: The KB can/is intended to be used for the following kind of applications (a) … (b) …

1 Like

Yes, it is heavily inspired by our five years long work on implementing a datalog engine for binary analysis. Thus KB is heavily influenced by logic programming.

Totally correct.

The Knowledge Base is not a deductive database, which is used as a backing for Datalog implementations. The concept of a knowledge base is derived from the expert systems, like CLIPS
for example. We do not store facts as a sequence of tuples, instead, we store object properties, and objects are forming ontologies (in other words we do not have a flat table of all objects, but each class of objects is stored in its own table or, following the runtime metaphor, each class is stored in its own heap). An object is a key in our database and is a scalar value (int63 in OCaml parlance). The value of an object is represented as a binary tree (WAVL tree), which is optimized for storing a small number of fields and compact representation.

The rules itself are OCaml functions and are not stored in the knowledge base, but are registered when the system is loaded. The rules are triggered only once, and only if an object property doesn’t have a value. If more than one rule is registered for a property, then they are merged, using the domain merge operator induced by the associated partial ordering relation (or provided by a user). If a rule is recursive or references another rule that has a rule that involves the current rule, then we compute a fixed point, building the dependency graph on the fly, inspired by François Pottier’s lazy fixed point paper. Once the least fixed point is computed there is no need to recompute it, because we already computed the least fixed point, so the stored rules can’t produce any more information.

Another important feature is that we employ temporary objects a lot when we are doing a query. For example, our typical query looks like, “Let x be an object of class program and has the binary representation s encoded with armv7 encoding in LittleEndian, what is the set of destinations reachable from this program?” The query then triggers disassemblers, lifters, and so on, to compute the set of destinations and then the object x is deleted. Thus there is no trace of x in the knowledge base and therefore no space is lost. Of course, then every time we will need to compute the set of destinations for the structurally same object we will repeat all computations. But in our case, it is faster to trigger the whole chain again then to store in the knowledge base semantics of each byte in the binary. However, once we finish the binary reconstruction (which is also a fixed point computation) we store semantics of those instruction objects that we proved to be reachable and executable. To summarize, using scoped (automatic) objects and some amount of domain knowledge we can find a balance between scalability and performance. Of course, ideally, it should be done by the knowledge base, but the knowledge base can’t do it for free.

Yet another feature of our representation is that not all properties are persistent. In fact, persistence is optional and we even store first-class modules in our knowledge base. It is again the user discretion which property should be stored when the knowledge base is serialized and persist during different runs of the program and which should be more ephemeral and exist only during the current run. This property essentially enables a, sort of, tree shaking optimization, when the knowledge base is stored and all unnecessary or easy to recreate facts are removed. And, as always, truth maintenance guarantees that we won’t lose the correctness (may lose time in the future, though).

3 Likes

Hi @ivg
I read the aforementioned documentation about your use of the tagless final style but I wasn’t able to grasp everything as some parts depend on BAP features (e.g. the knowledge base) or are there to fit BAP objectives (binary code analysis). I am currently contemplating rewriting some code to rely on the same style so I wondered whether you could sum up the salient points of your approach, perhaps illustrating it on some basic problem like a small language of arithmetic expressions…? Some topics I can think of are, for instance:

  • the tagless final style seems more daunting than a classical “initial” style when performing context-dependent transformations, I gather that you use some 'a knowledge data to address this but can you explain what it means in practice.
  • how do you handle variable bindings and, possibly, (maximal) sharing of subterms? Do you use de Bruijn or HOAS? At which time do you handle sharing: when lifting/parsing? Or later?
  • you say that your signatures don’t come with type declarations but it’s not really clear to me which alternative you use then.
  • I also see you talk about adding an 'a parameter to types, as in val add : 'a bitv -> 'a bitv -> 'a bitv. What is the intent? Is it to allow for any future, possibly unforeseen yet, “payload” in bitv?
  • I was also wondering what amount of well-formedness (for terms) you are able to achieve with types. More specifically, how would you evaluate the effort and leverage w.r.t, e.g., using GADTs?
  • Also, I’m not sure to understand the >>-> operator. Can you elaborate?
  • I gather the “theory” stuff is about having languages and transformations as components with required and provided services? Can you, here also, elaborate a bit more?
  • Finally, I’m not sure to understand the stuff about universal values to handle the fact that we must pass, in OCaml, an instance to a functor…
1 Like

I wrote some years ago an article (in french) where I explain the tagless-final style on a small language of arithmetic expressions, then I extend it with boolean and functions. I also did some comparisons with initial style with ADT or GADT.

2 Likes

Thanks for the pointer, a good reading, more synthetic than Oleg’s papers. I have actually read Oleg’s tutorials so my question was rather on topics that seem a bit different in BAP, or related to topics addressed by Oleg in the context of Haskell “purity” (like handling maximal sharing in a way for which we may have other easy ways in OCaml), or that correspond to a real-life example rather than some tutorial stuff.

Let me try first to address your questions inline and then I will try to give a walkthrough that will showcase how can we start with the approach proposed by Oleg and gradually improve it until we reach our final representation, which is theoretically equivalent but suits our needs better.

The Knowledge library is not specific to BAP neither to binary or program analysis. It is a generic library for developing knowledge reasoning systems. In short, it provides two features - efficient universal values and a dictionary of universal values hidden behind a monadic interface. Underneath the hood it is an error monad, spliced with the state monad. We narrowed the interface a little bit to enable future extensions. We’re thinking about adding the Continuation monad to enable retraction. But so far we’re good without them.

Yes, it is daunting and albeit it is possible to write many analyses in the final style (and for some analyses, it is even easier) we decided that forfeiting the classical direct style is not an option for us. One of the design constraints that we have, is that existing analyses shall work without any changes. In fact, we had a few design constraints that shaped our design, however, these constraints do not come from the field of program analysis and are not domain-specific, so they should be applicable to other projects as well. Our constraints came from the software design universe - we need even more extensibility than what tagless final was providing.

But back to the good old initial style. If you will examine carefully the modular and composable optimizations work, you will notice that every optimization is building a small DSL of the form

type 'a term = 
 | Domain : int term
 | Specific : string term
 | Constructors : unit term
 | Rest : 'a Other.term -> 'a term

and all analyses are still written in the convenient classical style with the pattern matching. The analyses that require state or context are implemented using corresponding monad, e.g., type 'a repr = ctxt -> 'a term and type 'a repr = ctxt -> 'a term * ctxt.

In BAP we managed to reify this design pattern into OCaml representation. With monads it is simple - we just provided a common monad that will satisfy everyones needs - the knowledge monads. For the common representation we used universal values, but later on them.

To summarize, in BAP you can write a fold-style analysis in the final form. If you wish, you may pick a representation and write an analysis in the initial style. If your representation is initial, then your analysis will be as generic as the final one. Otherwise, you may lose some information.

We use HOAS, but still alow users to introduce variables with arbitrary names. We handle it at lifting/parsing, i.e., when we reflect concrete representation into abstract terms. And of course, it is possible to introduce indexed variables later, there is no limitation on that. We also distinguish between read-only pure variables, which are introduced using val scoped : 'a Value.sort -> ('a var -> 'b pure) -> 'b pure and mutable fresh variables. There are also global variables, which are not HOASed, and denoted by their names.

We represent abstract types with existentials, aka universals. Theoretically, it is the same. In practice, it lets us to express our intents to the type checker in a much clear way. In OCaml abstract data types are too unrestricted, which limits their applicability.

Here 'a is the type index that denotes the modulus of bitvector arithmetic, so that we can express, that add is defined only for bitvectors of the same width and that add : u8 bitv -> u8 bitv -> u8 bitv is not the same as add : s32 bitv -> s32 bitv -> s32 bitv. Besides, other than modulus we encode signedness of the bitvector. This is all part of the knowledge framework which lets us build hierarchies of classes each partitioned into sorts, and make our constructors well-typed, but at the same time substantially polymorphic (so we do not need to have add8, add32 and so on up to infinite number of constructors).

In general, we can express arbitrary properties for terms, which in general will require dependent types. The only caveat is that these properties could be introduced abitrary by a module which define that property. This basically, implies that for each property there is a trusted kernel, which basically has an abstract interface:

module type Prop = sig 
     type prop 
     val prove : 'a term -> prop sort term option
end

Where prove is the only function that can construct values of type prop sort and performs a check (which could be unsound) or whether this property holds for the given term. This is the only weak point.

Concerning GADT. First of all it is possible to implement the prove function using a GADT representation for a term (since the final form allows for any representation, it is also possible to represent a term as a GADT). Then if we can prove the property, the type checker will be able to verify our proof. Of course, only a limited set of properties could be proved this way.

Another issue with GADT is that we can’t really serialize the proved property as it usually doesn’t have any physical representation. Thus, when we unparse/load a stored program we have to repeat all proofs from scratch.

In BAP we use nominal witnesses to store term properties, so we can store and reconstruct them back.

It is just a shortand for “evaluate this term, extract its sort, and call a function on the sort an term evaluation”. It is common that the semantics of an operation depends on the sort of therm, e.g., the signedness of the representation, the address space of the memory, the size of the storage, etc.

No, it is about manipulating theories (aka algebras, aka structures, aka analyses) as first-class values.

Now we’re ready for the long part of the answer :slight_smile:

The long road towards the final style

Let’s start with the classical representation of the tagless final style using functors, given the signature

    module type SYM = sig
      type 'a repr
      val int: int -> int repr
      val add: int repr -> int repr -> int repr
    end

we can build terms in that signature using a functor,

    module Ex2(I:SYM) = struct
      open I
      let res = add (add (int 1) (int 2))
                    (add (int 3) (int 4))
    end

But functors are syntactically heavy. They are also not first-class values, and one of our constraints is that the set of analyses is not known in static time. We employ a plugin system, and want a plugin to be able to register an analysis pass in our pipeline, without a need for recompilation or relinking.

This is another constraint that we have - we need first-class algebras. Ok, we have first-class modules, and indeed, the syntactically heavy functor Ex2 could be represented as a function that is parametrized by a packed module, e

let ex2 (module I : SYM) = 
  let open I in 
  add (add (int 1) (int 2))
      (add (int 3) (int 4))

This won’t type-check as the type I.t escapes its scope. There are no closures for types in OCaml. No problem, we know that we can extend the scope by quantifying it, usually we do something like,

let ex2 (type t) (module I : SYM with type repr = t) : t = 
   let open I in 
   add (add (int 1) (int 2))
       (add (int 3) (int 4))

Wait, repr is not a ground type in SYM but a parameterized type, in other words, it is a type scheme that denotes a whole family of types. We cannot quantify 'a repr = t for all 'a and this is a fundamental problem, which highlights the difference between functors and functions parametrized by modules.

We could probably turn 'a repr into a monotype, but we want our algebras to be many-sorted and, in general, to rely on typed representation, rather than coercing everything to the single type. So this is not an option for us.

So is this problem that fundamental or, maybe, we just don’t know the right syntax to persuade OCaml to accept our definition? Indeed, given the current signature, it is not immediately obvious what can go wrong, as we can construct only int repr terms1. Let’s make our example a little bit more illustrative, and add string repr to our signature

module type SYM2 = sig
  type 'a repr
  val int: int -> int repr
  val add: int repr -> int repr -> int repr
  val out : string repr -> unit repr
  val str : int repr -> string repr
end

Now, we have an obvious problem,

let ex3 (type t) (module I : SYM with type 'a repr = t) : t = 
   let open I in 
   out (str (add (int 1) (int 2)))

We have our type constructor t unified with unit repr, string repr, and int repr at the same type. So, is it a dead-end and we have to choose another host language?

Not really, probably we chose a wrong representation, as indeed we’re asking too much of polymorphism in the ex2 function. So, let’s revisit the functor representation

    module Ex2(I:SYM) = struct
      open I
      let res = add (add (int 1) (int 2))
                    (add (int 3) (int 4))
    end

What if fundamentally says is that res is parametrized by a Sigma-algebra with many-sorted carrier 'a repr and operation add and int. To represent this we need bounded polymorphism and a way to specify those bounds to OCaml, otherwise, we will fall into the same pit, OCaml will either try to overgeneralize our types (i.e., quantify them over types that we do not need) or undergeneralize them (i.e., will lose polymorphism). Bounded polymorphism brings existential quantification into our memory and we immediately recall that there are a few other representations of existentials in OCaml. So, maybe instead of keeping 'a repr abstract and bounding it with a signature, we shall turn it inward and represent repr as a universal (well indeed, we all know that abstact types have existential type)

type 'a sort = string (* for demo purposes *)
type 'a repr = Term : {
    sort : 'a sort;
    repr : 'r;
  } -> 'a repr

With this representation, we explicitly stated that only the representation 'r is existential, and now we can define a signature without 'a repr,

module type SYMR = sig
  val int: int -> int repr
  val add: int repr -> int repr -> int repr
end

and, finally, we can type our example

let ex2 (module I : SYMR) =
   let open I in
   add (add (int 1) (int 2))
     (add (int 3) (int 4))

and even our extended example,

module type SYMR2 = sig
  include SYMR
  val str : int repr -> string repr
  val out : string repr -> unit repr
end


let ex3 (module I : SYMR2) =
   let open I in
   out (str (add (int 1) (int 2)))

The only problem is that we made our representation too abstract, so it is now impossible to write even a meta-recursive interpreter, e.g.,

module Ru : SYMR = struct
  let int x = Term {sort = "int"; repr = x}
  let add (Term {repr=x}) (Term {repr=y}) = Term {
      sort = "int";
      repr = x + y; 
      (*     ^^^^^ *)
     (* `x` and `y` are existential and of different types *)
    }
end

So we need to add a type witness parameter to our representation which will allow us to prove that representations are of the same type,

type 'a witness = ..
type ('a,'b) eq = Equal : ('a,'a) eq

type 'a repr = Term : {
    sort : 'a sort;
    tkey : 'r witness;
    repr : 'r;
  } -> 'a repr

module Ru : SYMR = struct
  type _ witness += Int : int witness
  let tkey = Int
  let int x = Term {sort = "int"; repr = x; tkey}
  let add (Term {repr=x; tkey=Int}) (Term {repr=y; tkey=Int}) =
    int (x+y)
end

And now we can finally write our metacircular interpreter. With one caveat, that our add is no longer total. Indeed, since we moved 'a repr to the global scope, the Ru.add function could be applied to the representation of any other interpreter. We will resolve this issue later, hang on, there are other serious issues that we have to resolve before that.

Let’s go back to the original modular optimization framework in the final style. It is denoted as

module type Trans = sig
  type 'a from
  type 'a term
  val fwd : 'a from -> 'a term  (* reflection *)
  val bwd : 'a term -> 'a from  (* reification *)
end

module SYMT(X:Trans)(F:SYMOBS with type 'a repr = 'a X.from) = struct
      open X
      type 'a repr = 'a term
      type 'a obs  = 'a F.obs
    
      let int x   = fwd (F.int x)
      let add x y = fwd (F.add (bwd x) (bwd y))
    
      let observe x = F.observe (bwd x)
    end

Again, we see the good old bounded polymorphism, nothing that we can’t handle. But there is also a problem that contradicts with our design requirements. Actually with two of them. In this framework, we have to explicitly and statically specify the transformation pipeline. E.g., we need to know in which order the transformations should be applied, moreover, we need to know the analysis that we want to use at compile-time.

This breaks our modularity, as we want to be able to specify an analysis in a plugin and the load it dynamically. Most importantly, analyses in program analysis (sorry for a tautology) do not really form a pipeline. They do not even fit into the DAG abstraction. In general, and in binary analysis specifically, they form a graph with cycles. And this graph builds dynamically, so any analysis may depend on any other analysis and our framework shall detect this graph automatically and run analyses in the correct (and optimal order) to obtain the fixed point solution as fast as possible.

So we naturally (maybe not that naturally, but I’m already writing this post the whole day so I have to fast-forward) come to our solution. Our universal values are bounded to be domains (optionally upgradeable to lattices). Moreover, we run all our analyses (or all analyses applicable to the context) in parallel, so our universal value is a key-value dictionary underneath the hood, or a universal record, where each entry denotes a particular representation (or semantics, or denotation) of a term.

The theories themselves, are also domains, so we can store theories as values and manipulate them, join them, even compare their partial orders. In addition, not only theories are first-class values, but the contexts which define theories are also reified, so we can also manipulate them and run analyses in different contexts (here, by context, I’m understanding the context of the theory application, in the Haskell sense of this word, i.e., the context which defines which type class instance, or, in our case structure, should be applied to the given terms).

I’m sorry that is is a little bit messy, I just didn’t have enough time to write it in a more concise and straightforward way :slight_smile: Feel free to ask questions so that we can refine this explanation.


1) It is not true actually, there is no evidence that 'a I.repr could be only created with constructors from I. It is only possible to provide such evidence using (G)ADT, e.g., with signature

module type S = sig
  type 'a repr = 
    | Int : int -> int repr 
    | Add : int repr -> int repr -> int repr
end

we can at least persuade type checker that values of type _ I.repr could only be formed with I.Int and I.repr constructors, but it won’t help us a lot and will hamper the extensibility (and this is our main design constraint - to be able to add new language forms).

5 Likes

Your solution to work around the lack of higher-kinded polymorphism is interesting, but I have two questions.

How do you solve this issue ?

And why don’t you use a solution such as lightweight higher-kinded polymorphism?

1 Like

Thank you very much for the detailed answer. Lots of food for thought. I’m pretty sure I haven’t got all the measure of what you wrote but I already have a bunch of questions indeed, since you ask :slight_smile:

A few questions/remarks then:

  • (you declared the classic equal type but did not use it, was it just the result of a copy/paste or did you intend to add something that you forgot to?)
  • As far as I can see, using a record with existentially-typed fields is really essential if you want to have a polymorphic repr. For instance, in order to allow the addition of arbitrary new constructs. OTOH, my current work, for instance, addresses a language with a small, closed set of sorts, so in my experiments, I have just used as many monomorphic repr-like types as there are sorts. Granted, it’s a bit cumbersome, syntactically, to be forced to add as many with type ... constraints as there are sorts, when needed, but it’s doable and it seems sufficient in a closed world.
  • Also, I fail to see whether and how you can avoid to have a catch-all clause when pattern matching the _ witness open sum type ?
1 Like

We’re using universal values from Knowledge as our representation. Roughly it is an ordered tuple of representations, where each representation is a domain, so it has the predefined empty value. This makes the Value.get function total as we always have a concrete representation for the absence of knowledge for each representation.

First of all, because it won’t help us to type first-class modules :slight_smile: We still can’t unpack a module with a parametric type. If you will look into the Modular Optimizations course, you will notice that higher was used their extensively for the Trans module. But the Trans module suffers from the same problem, we can’t unpack it. And even if we will remove the first-class requirement, typing ex3 would still be painful and won’t scale. Unlike examples in the paper, this function is not really universally polymorphic, but we have concrete three types unit repr, string repr, and int repr, so we need to establish three arrows and then bind them to the corresponding type variables. It would be really messy and in general won’t scale. Especially for us, where we have infinite families of sorts, like 's bitv where 's is a type index of the vector size.

Besides, this library really needs an update. There is no need to use Obj.magic nowadays to implement it. So, let’s reimplement it without magic in pure OCaml,

type ('p, 'f) app = ..

module Newtype1 (T : sig type 'a t end) () = struct
  type 'a s = 'a T.t
  type t
  type (_,_) app += App : 'a s -> ('a, t) app
  let inj v = App v
  let prj (App v) = v
end

module Newtype2 (T : sig type ('a,'b) t end) () = struct
  type ('a,'b) s = ('a,'b) T.t
  type t
  type (_,_) app += App : ('a,'b) s -> ('a, ('b,t) app) app
  let inj v = App v
  let prj (App v) = v
end

And now we can easily see parallels between higher and type witnesses that we see in existentials. In essence, the universal value is the final representation for us, since for each algebra we can build a homomorphism from its carrier set to the universal value. Therefore, instead of creating N*N arrows between types, we can create one arrow per type with a well-defined reflection and reification operations.

3 Likes

I just simplified the code, so it became unused. But, at the end of the day, eq is very useful, because it is a reification of type equality, which you can pass between functions, or just instantiate in the scope and forget about it.

Well, the level of extensibility that we have achieved, is that you can add a new language form and/or add a new analysis and then all existing analyses will still work without recompilation or relinking and still be able to cooperate with new analyses that leverage the new language forms.

If you have the case, where you can (and need to) specify the language transformation pipeline statically, then you indeed can use abstract types and you don’t even need to use sharing constraints or stick to monomorphic types, your repr type could be a type scheme.

I did not :slight_smile: The pattern match in add is non-exhaustive, that’s why I said that add is no longer total.