Generating a map at compile time?

Yes, it works and fits the intended use: the ability to save on disk “your own computation or cache”.

2 Likes

Sounds too easy. I guess that if I modify my program after marshalling to disk it may fail when loading it again.

Since it will be already another executable.

That was one of the failure mode of unison when it was using the Marshall module with the closure flag to exchange data: it was necessary to always use exactly the same version of unison compiled with the same compiler and the same libraries.

Good to know. I “love” debugging failures caused by upgrading libraries !

This is essentially what lead to my failure above, since what I was doing was generating the marshal string with one version of the program, and editing the program to include that very string produced ‘another executable’ which failed to read in the output from the original.

I realize now that this behavior is documented in the docstring for Marshal.to_channel, but not the docstring for the Marshal.extern_flags type or the other to_* functions (which simply say to go read the one above them), so it’s kind of easy to miss if you’re going quickly

Note that the docs of to_channel include:

In this case, the output of marshaling can only be read back in processes that run exactly the same program, with exactly the same compiled code. (This is checked at un-marshaling time, using an MD5 digest of the code transmitted along with the code position.)

So at least these situations (almost always) lead to predictable error messages rather than crashes.

On the other hand, without closures enabled, if you change the type definition of the values that are marshaled and then unmarshal data produced by an old version, no builtin check catches the issue and you are welcomed to segfault city.

2 Likes

I have been tempted to write a ppx that would generate a digest from the relevant type definition to be included in the marshaled value and checked when unmarshaling, but have not gotten around to it. It is good practice to do something similar manually, but it is easy to forget to bump the version numbers…

I wasted some time this morning trying to write a PPX that I could use in the generator program to help print the text of the generated module such that the types would necessarily agree between one and the other. I think my current feeling is that this is more work than it’s worth, considering how often I expect things to change (not very) and how immediately such a segfault would show up in testing (basically instantly). It would also not really be that nice to use, since you need to write out the fully qualified name yourself for the generated module to work (or make sure the generated module opens the same things)

It definitely feels like ppx should be able to help with some of these pain points, but it might just be out of reach for my skill level there

If you can use a regular Stdlib’s Hashtbl then its definition is quite simple:

type ('a, 'b) t =
  { mutable size: int;                        (* number of entries *)
    mutable data: ('a, 'b) bucketlist array;  (* the buckets *)
    mutable seed: int;                        (* for randomization *)
    mutable initial_size: int;                (* initial array size *)
  }

and ('a, 'b) bucketlist =
    Empty
  | Cons of { mutable key: 'a;
              mutable data: 'b;
              mutable next: ('a, 'b) bucketlist }

You could generate (à la Printf) a module with this exact definition followed by a literal definition of your final hashtable:

let the_big_one : (what, ever) t =
  { size = 123123123 ;
    data = [| Cons { key = "glop" ; data = "pas glop" ; next = Cons ... }}}} ;
                  Cons { ... } |] ;
    ... etc ... };

Then throw this module to ocamlopt and see how it manages.
Then at runtime use Obj.t to make it a real Hashtbl.t …

[at the bottom, maybe I’m disbelieving that it’s OK to Marshal Hashtbl.t.]

I’m not a big user of Jane Street’s libraries, so maybe I’m missing something, but … this seems like the crux of the biscuit. One of nice things about (at least the core of) ML-family languages, is that -data- is separate from code. And so it’s natural that you can Marshal most things, and not worry about closures being in the way.

Heck, it’s sort of implied by the fact that OCaml’s built-in equality doesn’t like you if you try to compare functional values:

# let f = (fun x -> x);;
val f : 'a -> 'a = <fun>
# (f = f) ;;
Exception: Invalid_argument "compare: functional value".

[this last part is my own opinion, and hey, everybody’s got one, and I’m sure not gonna push it
on anybody else.]
I feel like if you’re going to make library abstractions that are meant to replace the standard ones, you should ensure that there’s a seamless way to treat them as “just data”. But hey, I guess whatevs.

ETA: But now that I think about this, I remember that there’s that rule from other languages (most famously Golang) that one should not count on the hash-function being consistent across runs of a program. Maybe that’s going too far, but certainly counting on a hash-function being consistent across versions of a program, even when the types are unchanged, seems … less-than-ideal (though my memory is that OCaml explicitly guarantees that).

And now that I think about it some more, I have some vague memory that polymorphic variants aren’t guaranteed to be identical values between different builds of a program? Is that right? I forget.

So maybe I should retract my suggestion/belief that Marshal-ing a Hashtbl.t is OK.

This is true in Rust’s stdlib, too. Though part of the Hashtbl.t posted above is the seed, which presumably makes things stable.

But this does seem like a reason to keep doing the round tripping through an associative list, at least if I’m using Marshal. Converting that back to core’s hashmap seems to be plenty fast anyway

This feels a little like I’m wasting (your) brain cycles (as they supposedly call it at Akamai) black-holing on this, but … maybe a compromise might be to store a Map – a sorted AVL tree (or whatever ordered data-structure it is) ?

And make your in-memory data-structure be a hashtable that stands in front of the AVL tree, and caches lookups? Two refinements:

(1) if you rarely get failures, then the hashtbl can just cache successes. So the hashtbl value type is the same as the value type of the Map. A failure to find in the hashtbl causes a lookup in the Map.

(2) if you get failures often enough, then you can make the value type in the hashtbl be the Map.value_type option, so you can cache lookup-failures in the hashtbl too.

I did this to speed up AVL tree lookups decades ago, b/c an AVL tree can be “frozen” with a single pointer-save, but is slower than a hashtbl.

But this is probably going too far down that rabbit-hole.

If I understand correctly, you are trying to cache the map as a performance optimization…so, does it really matter too much if there is a cache miss ie unmarshalling fails due to a different compiled version? You can just build a new cache in that case. This should happen rarely enough that it should still be a performance win when amortized over multiple application starts, no?

Hashtbl.t explicitly supports being marshalled to persistent storage. And historically, the OCaml developers have gone to some lengths to support this property across changes in the runtime. See Remove pre-4.00 generic hash function by xavierleroy · Pull Request #9763 · ocaml/ocaml · GitHub and the comments therein.

Cheers,
Nicolas

I think that’s wrong; polymorphic variants are allowed to be marshalled, so their representation must be consistent across compiler versions.

Constructors of extensible types like exn get refreshed when unmarshalled. So for instance

try raise Marshal.(from_string (to_string Not_found []) 0) with Not_found -> ()

an exception called Not_found but different from Stdlib.Not_found will get raised and will not be caught.

1 Like

Not quite, the map is currently completely generated when the module loads and then effectively frozen. My hope is that moving the generation to build time lets me stop shipping the generating code to users entirely, so a “miss” due to some representational problem would manifest in a bug

Not quite your suggestion, but I did try having the code generate the literal definition of the associative list rather than a marshal string (using ppx_deriving.show). The generated .ml file is 154,629 lines long, and ocamlopt manages it, but not particularly well (dune basically hangs with ~10 jobs remaining, which I am assuming is this file and it’s dependencies). I assume the result would be similar if I went all the way to generating a instance of the stdlib’s Hashtbl

It does seem slightly faster at runtime, but it also adds nearly 1.5mb to the final executable (down to ~.5mb extra after stripping), which I find remarkable. Is it really possible that the marshal format is that much more compact than object code for the same list?

Unrelated, but: you can pass --trace-file=output_file and then view output_file at https://ui.perfetto.dev/ to see exactly how dune is building stuff (order of invocations across parallel units and timing and other stats, etc.). I’ve had to do some build engineering in the past and this has been extremely helpful insight into what’s going on. You assumption is certainly correct, but this is an underrated feature of dune that I felt was worth sharing.

2 Likes

In general upstream thinks the compilers are made to ingest reasonable, human crafted, source files. I think their stance is not unreasonable, but the problem is that it’s not always exactly clear what reasonable is.

The unicode libraries I’m maintaining (e.g. uucp and uunf) which rely a lot of on source code generation of unicode data have been notoriously breaking (out of memory) in all sorts of constrained environments (e.g. on rpis).

It’s a bit difficult to help you because you have been quite misterious about the nature of your data (which I was waiting for you to disclose before chiming in :–). I still don’t know but AFAIR here are a few things you could try:

  1. Unless things have changed I don’t think the compiler performs string interning (or arrays, if you use them immutably) so let defining these common strings and arrays may help (see for example this uucp patch by @pqwy)
  2. IIRC this discussion should have a few tips about about what is good to do or not for generating large sequences of values
  3. In the future keep an eye on this.

HTH

1 Like