Marshal determinism and stability

Hello all,
We are considering the possibility of using marshal to serialize/deserialize data; however we have two questions:

  1. Is Marshal deterministic? If its output has dependencies on OS or machine architecture etc., can we enumerate those dependencies?
  2. How much does marshal’s output vary from version to version of OCaml? Does it break every release, once every 10 years, never?

Thanks!
Daniel Hines

Most of these questions are answered in Marshal’s documentation.

In particular:

The format for the byte sequences is compatible across all machines for a given version of OCaml.

So you only get a format guarantee per OCaml version. Basically except for the cache of a program or for communicating between processes that you tightly control I wouldn’t recommend it.

Also bear in mind that Marshal is not type safe and a bit flip in the data can make your program segfault. So if you still want to go that route it’s a good idea to prefix your marshalled data by a magic number and/or a CRC of the data and check these before inputing the value and casting it to the type you expect.

The docstring of Marshal.to_channel has the discussion about 32 and 64 bits compatibility.

3 Likes

Basically, don’t use Marshal. Use something else, there are plenty of options, from JSON to Protobuf.

2 Likes

This. So much this. Also, this. And finally, this.

For data that will be “at rest” for some period of time, it’s important that you have a PL-independent (at least in principle) description that can be used to write a demarshaller in some other PL. The longer that data might stay at rest, the more important this requirement becomes.

A lot of people have gotten screwed by forgetting this rule. Heck, I remember a well-known OCaml/PL figure telling me about how he took a genealogy database written by a family member decades ago in Turbo Pascal, and had to reverse-engineer it by more-or-less writing a demarshaller and running it, observing where data came out comprehensible and where it was garbage, lather/rinse/repeat, until he got the entire database to be comprehensible.

That’ll work when it’s an ancestry database for your family; for something bigger, or something where you can’t eyeball the demarshaller output to see if it’s right, you’re screwed without a language-neutral IDL description.

2 Likes

awesome term. For the long terms ultimatively I came to insist on plaintext-like storage (xml/s-expr/…).

1 Like

I disagree.

Until the time where we have something fast (Marshal is at least one order of magnitude faster than all other propositions) which preserve sharing and handle cycles gracefully, Marshal will have a place.
That’s why it’s the format used for client/server communication in Eliom, and it has worked very well for this.

One thing I think we agree on is that Marshal is for data exchange, not storage. If you are wondering about stability in time, it’s already not the right solution for you.

1 Like

Interesting. I didn’t know ocaml stdlib Marshal was OCaml version dependent. I wonder if janestreet bin_prot is also OCaml version dependent. @bcc32 ??

This used to cause a hard-to-work-around problem whereby the same Unison version coming from two distinct distributions would fail to speak to each other for no reason obvious to the user.

1 Like

And if you have ever used Unison to synchronize files, you know the pain of trying to synchronize with a server that happens to have a different version of the OCaml compiler.

I haven’t used Unison, but knowing about ocaml stdlib Marshal now I imagine it being a maintenance hassle. How did they solve it?

So given that ocaml stdlib Marshal is not very stable from ocaml version to version, what is - are - the binary serialization alternative to it? Another one that I have used a little bit is janestreet bin_prot but not sure if it has the same ocaml version issues as Marshal.

A quick look gives me these results. None of which I have yet used.

  1. biniou
  2. [ Binary module of data-encoding] (Nomadic Labs / data-encoding · GitLab)
    … ??

B.

It is not. There is also a notion of a bin-digest, which is basically a hash representing the “shape” of a particular type’s serialization, and which you can test to make sure you don’t accidentally change a protocol when refactoring, for example.

2 Likes

It has been a while that I have not used Unison, but I do not think they ever solved it at the time.

It might be worth asking the maintainers whether it is feasible to strengthen the stability guarantees, to make Marshal more useful for what it can be good at (exchange). There is a wide range of intermediate guarantees between “this can break at every minor release” and “this will never change”.

I don’t agree there either, actually. Case in point: see messages in this thread talking about Unison.

There is an open issue on GH, with I believe some development in progress. IIRC replacing Marshal with a safer serialization alternative.

Tons. Protobuf, Msgpack, Thrift, Cap’n’Proto, JSON+zip, those are just the language-neutral serialization formats off the top of my head. If you want to restrict yourself to OCaml-specific formats there’s atd and friends. If you want a storage format you could use SQLite or even Irmin.

2 Likes

Very nice. Thanks for confirming and of course as expected of janestreet libs. :+1:

This should be relatively straightforward to detect, no? If the initial session-setup for Unison sent information about distro/arch/ocaml-version, and refused to work (with a “–force” switch for overriding) unless they were identical at both ends, wouldn’t that be enough? [asking b/c I don’t know the details of the problem …]

The problem isn’t at that level. More details in the Marshal module doc–programs compiled with different OCaml versions can’t understand each others’ Marshal serialized data.

Right, so what I’m suggesting is, during session-setup, use messages that are not serialized using Marshal. And at a minimum, one might send the hash of a canonicalized string containing OS version, arch, OCaml version, etc. Both sides check that the hash matches their own (or better, send the string) and if not, reject and stop.

Then and only then, start the actual protocol, and there, sure, use Marshal.

An analogous thing is done in some RPC systems – session-negotiation involves sending “callstream version” information, to ensure that the RPC runtimes on both ends are compatible.

That still leaves the issue I mentioned. Think about it–you have Unison on a Windows laptop and another Unison on a Linux desktop. Unless they’re the same version and compiled with the same OCaml version, they can’t exchange files. This is still a problem even if it has a nice hash system that does a version check. And it doesn’t matter if the version is ‘better’–it won’t work unless it’s exactly the same.

OK, yes, this “version” of the algorithm requires a perfect version-match. We -can- do better, and that better involves a design of a version-string specification and contract for up/down-compatibility.[1] For example, the OCaml development team makes guarantees about up/down compatibility of AST versions – minor releases will not modify the AST type. Similar assumptions/guarantees could be made/used. Of course, this involves work, whereas what I suggested didn’t. For instance, if each end sends a hash + human-readable version-string, then the ends can compare hashes, and when they don’t match, print out the human-readable version-strings (which don’t need to be canonicalized, hence can be more human-readable) for the invoker to decide if they want to override.

I was making a -cheap- suggestion for how to improve Unison: a suggestion that didn’t involve a lot of design, but would catch many of the obvious error-cases. Specifically, when the two ends reject, it gives the human invoker a chance to have a look and decide if it’s OK to proceed. Which automatically means that the human is paying attention when/if things go awry.

[1] every step we make towards a structured version-string increases complexity and pushes towards just going with some RPC system/IDL-compiler that already has this sort of thing built-in, or is OCaml-version-independent. But that’s a much bigger cost than just exchanging/comparing version-hashes.

If one were going to do this, one would (of course) want to prepend a “protocol version” to the beginning of the stream, so that one could decide later to change to a different scheme. IIRC Thrift has something like this – you can version types, but you can also version the protocol framing. Might be worth looking at that to see how they did it.

It would be impossible for the user to manually override and go ahead with the exchange–the Marshal module simply doesn’t understand serialized data across different versions of OCaml, as mentioned earlier. Anyway, this is a moot point; as I said earlier, there is already development work done to solve this issue in a general way going forward. See unison wire protocol depends on ocaml version · Issue #375 · bcpierce00/unison · GitHub