Hashconsing an AST via PPX

[up-front (so nobody gets the wrong idea): I’m not pushing Camlp5.
Rather, I’m just noting that this sort of thing is really easy to do,
and I encourage someone to do something similar using the PPX
infrastructure.]

I didn’t want to derail the “Future of PPX” thread, so I thought I’d
post separately to answer ivg@ 's issue about hashconsing of ASTs
using PPX. It’s actually [uh, I think] really, really easy to
implement hashconsing of ADTs, using a PPX extension. On a lark, I
decided to do it today, and while the code I’ve got isn’t sufficient
to use, I think it’s not very far away, and I have the perfect
use-case already in-mind. It took me two hours to implement the
rewriter and the testcase, on top of the other infrastructure, which
has no support for hashconsing of any sort.

Here are some examples of data-types and functions that are
automaticaly hash-consed. The idea is that in the pattern-match the
pattern is annotated with a variable (in this example, “z”); the
expression that is supposed to be hash-consed against that pattern is
annotated with that same variable. [The code that descends to the
expression is a little weak right now, but I think that’s easily
fixable.] The algorithm goes as follows:

(1) “decorate” the pattern with “as z_” variables everywhere
in constructors. This allows us to refer to parts of the original
value.

(2) then find each expression that is marked with that same varable.
Structurally descend the pattern and the expression in parallel and
generate code to compare sub-structure and hashcons where appropriate.

And that’s really it. I’m sure this can be implemented using the PPX
tools.

Some comments: (1) what’s nice, is that we can just take already-written
code like List.map and annotate it; that generates a hash-consed
version. And since the generated code never uses deep structural
equality (only pointer-equality) it should be only marginally slower
than the original implementation.

(2) The variable in the annotation (“z”) is used as the base for
generating a whole slew of fresh variables, and I don’t bother (yet)
to check for clashes; this (again) is straightforward, but hey, I
started two hours ago.

type t = Leaf of int | Node of t * int * t

module HCList = struct

let rec map f = function
    [][@hashrecons z] -> [][@hashrecons z]
  | (a::l)[@hashrecons z] -> let r = f a in ((r :: map f l)[@hashrecons z])

end

let deep =
let rec deep = (function
  Leaf n[@hashrecons z] -> Leaf n[@hashrecons z]
| Node (l, n, r) [@hashrecons z] -> 
  Node (deep l, n, deep r) [@hashrecons z]
  )
[@@ocaml.warning "-26"]
in deep


type sexp = 
  | Atom of string 
  | List of sexp list

let sexp_deep =
  let rec deep = function
      Atom s[@hashrecons z] -> Atom s[@hashrecons z]
    | List l[@hashrecons z] -> List (HCList.map deep l)[@hashrecons z]
  in deep

Links: First, at the commit, so they won’t change

the testcase file: https://github.com/chetmurthy/pa_ppx/commit/5dd6b2ef3ca3677e11a0ad696074200101bd661f#diff-e6dffe78fc6c27bdffa41970c4a7f1ca

the “ppx rewriter”: https://github.com/chetmurthy/pa_ppx/commit/5dd6b2ef3ca3677e11a0ad696074200101bd661f#diff-24aeaf51366017948f5735727f001c85

Second, the files with human-readable names, etc.:

the testcase: https://github.com/chetmurthy/pa_ppx/blob/master/tests/test_hashrecons.ml

the “ppx rewriter”: https://github.com/chetmurthy/pa_ppx/blob/master/pa_hashrecons/pa_hashrecons.ml

The projects:

chetmurthy/pa_ppx: A reimplementation of ppx_deriving, all its plugins, ppx_import, and a
few others.

chetmurthy/camlp5: Camlp5, version pre-8.00 on which the above is based. This is on the branch 26.attempt-pa-deriving .

1 Like

I experimented with this some time ago for ML workshop.
The idea was to provide function: t -> htbl -> htbl * t which rewrites value of type t by removing equal subtrees. Essentially it is just a fold over data type.

If you wanna use a hashtable (and, I presume, Obj.magic) you can write a single function that does the trick for all immutable data-types, right? But if the goal is merely to re-share unchanged data after a single pass over an AST, you don’t need to go that far.

Yeah, probably you are right.
But the idea was to reuse existing framework where you can override transformation behaviur per constructor.

Yes, we have some magic @mbouaziz code in Infer that does this to create as much sharing as possible as values are Marshaled out.

2 Likes

Could you provide an end user minimal example in the README.md on how to use these extensions e.g. with pa_deriving.enum and Dune-based “hello world”?

Hi, I just pushed the code here: chetmurthy/pa_ppx .

There’s a README that explains how

  1. to build and use the code. Right now, it requires a pre-release version of camlp5, and I’m not sufficiently well-versed in the opam repository package-lifecycle to know how I could make this version of camlp5 available as an opam package without disrupting the already-released version.
  2. how to use the package with Make.
  3. how to use the package with dune.

For the latter, I include a patch for the yara-ocaml package’s dune file which suffices.

I’d be very interested in any feedback, howsoever elementary. Things that seem obvious to me aren’t obvious at all to others, so even just build/install problems are worth reporting.

1 Like