Changing OCaml's allocation model when creating new values and accessing data


#1

Is it possible to change the OCaml values allocation model, so that everything is stored in secondary memory instead of using primary memory and the program’s heap? The use case scenario is dealing with big data and keeping all the already existing code as it is (backward compatibility and reuse), so that when I have to write an OCaml function I don’t have to worry on the amount of data I’m dealing with (e.g. streamed from an external source, without loading all the data in primary memory), and that any returned value will be automatically stored in secondary memory, preferebly into a key/value store or in a relational database.


#2

The term “secondary memory” isn’t really a standard one, so I have to guess at what you mean. If you mean disk or the like, the answer is no, you’re asking for a complete change in the way the compiler would work, and one that would slow the system down by many orders of magnitude. I know of no language that does this. On the other hand…

…there are plenty of database and key/value store libraries for OCaml.


#3

Well, if you use Memory Mapping in C, that is very similar: the file is mapped as it was in primary memory and could accessed via pointers, while on the other hand that is only a virtualization of the file representation. I know that this approach is slow, but in the big data scenario I cannot possibly load all my data in Primary Memory, and therefore I have to partially load it in chunks and then persist it once such chunks are updated. Memory Mapping exploits the cache at the operative system level in order to overcome such access problems.

I know, but the access to the database in my use case must be opaque to the programmer, that would ignore where the values are stored and how to access to it. If I want to store discriminated union values into a relational database, I must marshal and unmarshall constantly its representation. Moreover, I know no Ocaml library that is able to stream the data from the relational database without allocating a list. Even such approach is very costly for primary memory. In other languages such as C or Java you have pointers to scan the database results, but I was expecting something more “functional” than that.


#4

mmap is possible in OCaml, too, but that’s not really what you were describing.


#5

Yes, I know that mmap is also in OCaml. The fact is that in C I could simply serialize any struct in primary memory in secondary memory, and then read it as it were primary memory struct. On the other hand, I cannot have the same 1-1 mapping for discriminated union values in OCaml, as I would first need some marshalling/unmarshalling. As to say, is it possible to represent in OCaml a mmap_file as a stream of discriminated union values of the same type?


#6

Perhaps you should use C then. Good luck!


#7

The task of doing that in C is very troublesome for handling pattern matching… I would have to solve more problems, I think, that is also implementing a pattern matching algorithm. Still,


#8

You do know your OS provides virtual memory already, yes?


#9

Yes, I think it was clear from the beginning that my problem is implementing my solution in OCaml for the reasons I stated above :slight_smile:


#10

You haven’t actually explained what you need to do, but I have a suspicion you’re designing a solution in advance for a problem you haven’t yet encountered. This is usually a mistake. I recommend designing your solution when you actually find you need it. Good luck!


#11

That is true only if you are accustomed to solve more practical problems. On the other hand, I did not provided additional evidence on the problem I wanted to solve.

I’m implementing a tableaux algorithm for finite model represented by an instance of relationships stored in a relational database. The FOL formulae are in primary memory, as they are not heavy to load, just a buch of constraints. The generation of all the possible models built up from the relational database in my use case takes all the primary memory.


#12

Maybe what you are looking for:
the ancient library in opam
“Allows to use in-memory data structures which are larger than
available memory and so are kept in swap”
or, maybe, look at the BigArray module and their mmap feature.
I wonder if ancient was not superseded by something newer recently.


#13

Also, if you just want to store key-values on disk, this library might be useful:


There are others also in opam (dbm for example).


#14

I’ve just noticed that there is some library that does the mapping for RocksDB, and that is good, given that secondary memory data store is quite efficient.

The problem is that BigArray only supports native types (Strings, ints, floats) but not union types. I’ll try to search something more recent than “ancient”.

On the other hand, I tried to continue with the quest, and I think that this library using ObjectRelational Mapping meets my requirements, given that the result is not allocated in a list, but uses the same concept of Java’s ResultSet-s: http://www.cs.bham.ac.uk/~aps/research/projects/qanat/examples/pgdriver/tutorial.html. On the hand, the approach is not functional at all. Therefore, I’ll try to mimick the functional behaviour by creating a stream wrapper.


#15

Plus, can you do parallel read/writes to a PG database?
It looks like ancient was exactly developed to match your use case.


#16

Yes! If I use the parallel version of OCaml, I might also benefit of the Postgres locking system:

https://www.postgresql.org/docs/9.1/mvcc.html

Therefore, the multithreaded approach must be preserved by the client, and therefore: