Filename extension convention for marshal'ed data?

io

#1

Is there any convention about how to name files containing data created with Marshal.to_bytes or Marshal.to_string?


#2

No one has replied. Someone might still reply, but for the time being I’m going to take the answer to be “No–there is no convention.”

If anyone wants a convention for marshalled OCaml data files, I propose:

.mldata

Yeah, it’s long, but it’s only one more character than R’s data file extension, .rdata.
(No one’s using MS-DOS.)

Alternatives:

.mldat
.mld
.odata
.odat
.dat (* uninformative--could be any kind of data *)

or:

(* read the docs *)

(If there’s been no convention for decades, I don’t expect to successfully start one now. :slight_smile: )


#3

Camomile uses .mar extension, but I found it is used for MS Access…


#4

This one is nowadays being used by odoc for ocamldoc fragments.


#5

I think it tends to be an internal format for the same program, not typically exposed to other programs. Having a convention for file extension, might give a false illusion.

If your program does expose these files for the consumption by other programs, then maybe you should have a more unique extension.


#6

Those are reasonable points, although I still like the idea of a default convention that’s likely to understood by outsiders. .txt file and .csv files can contain different kinds of data, but anyone reading your code or seeing the files knows something about them immediately. In some cases the same program might want different extensions for different marshalled data files.

If odoc is using .mld, that’s a good precedent for me. The “ml” conveys that it’s something OCaml-specific. .mar make me wonder what the meaning is, I think. (Also, given my name, might sound, uh, me-specific. They’re my data files! :slight_smile: )


#7

@mars0i I’m not sure what you are chasing is worth it.

As @orbifx mentions Marshaled data should in general not be used except for internal program data. It is very brittle and unsafe: a single bit flip in the data can make your program segfault. If you do use it protect it at least with a header that identifies your program and the ocaml version.


#8

Thanks @dbuenzli

Well, I don’t think the filename issue is very important, but I thought I’d ask.

I think that in a sense my application may be very different from many things people do in OCaml, e.g. most of the things listed e.g. in the “What are you hacking on this week” discussions. (I should post my case just to be an outlier.) I’m not creating a general-purpose tool for others, although at some point others might be interested. I’m doing scientific modeling. A lot of data is generated in one simulation experiment, and I want to be able to go back and examine parts of the data later after I look at overview results. utop is great data exploration tool. repls in general are good for that. I wouldn’t necessarily want to do all of the exploration in one utop session, though. Writing data out in an OCaml format is also a way to work around the difficulty in getting ocamlnet working with the libraries I need. I can run a native executable that generates data files, and then pull the data into utop to examine it.

Or I might want to write the data out to a generic format like csv later for use in another program, but there’s no need to generate large and inflexible csv files from the start. I’d rather read in the marshalled data and then write out csv’s as needed. I’d really rather not convert lists of matrices in and out of csv files or sexps or something else text-ey if in the cases where all I need is to store lists of matrices for later examination.

(I guess I’ve said similar things before, but it seemed relevant.)

I really love that it’s so easy to marshal data into files. This is a great feature. I understand that I have to be careful about type signatures when I read it in. I’m still exploring this idea, though. Maybe it’s not so easy. The point about small changes segfaulting seems worth worrying about, but at present it’s not clear to me that this could be an issue for me, as long as I haven’t changed the type I expect when I read the data back in. It might be that as I play with marshalling more, I’ll realize that there’s more danger than I think.


#9

i have a similar use case but so far i have always converted to csv, which is quite some extra hassle – the benefit is that i can read it in from python and use the rich and familiar visualization options there.

besides Marshal, there seem to be some other serialization options which could be a good compromise between ease of use and a bit more safety, but i’ve never taken the time to explore. what would be the minimal step from Marshal to something a bit safer?


#10

One option is Msgpck.


#11

The main issue with using Marshal is that you have no guarantee you will be able to reproduce the environment that can read your files, if you by accident update your ocaml compiler, or perhaps just related tooling.

You could use csv, but that also has a lot of issues (no standardized way of conveying type information, no standardized specification).

The standard for scientific computing appears to be HDF5, there’s an ocaml module here that appears to offer a reasonably easy to use interface for that: https://github.com/vbrankov/hdf5-ocaml


#12

I didn’t realize that.

If you by accident update your ocaml compiler, or perhaps just related tooling.

I will definitely do that at some point. Most likely after I’ve forgotten that it can cause a problem.

The standard for scientific computing appears to be HDF5, there’s an ocaml module here that appears to offer a reasonably easy to use interface for that: https://github.com/vbrankov/hdf5-ocaml

I didn’t know about HDF5. Thanks. Sounds good.

It looks like using HDF5 isn’t as trivially easy as Marshal, which makes sense, but I would guess that HDF5 isn’t difficult after learning its model.


#13

There’s a variety of other serialization libraries you could also have a look at (here are some):

  • yojson for json;
  • biniou;
  • sexplib (s-expressions);
  • capnproto (not sure how finished that implementation is)

See also ppx_deriving which makes it easy to integrate with your OCaml types.