I am asking this question with regards of specific OCaml implementation and I will prefer a solution which is not strictly the best in terms of efficiency, but one where libraries already exist and it will be optimal for me to implement fast.
I have data which is dispersed in three dimensions:
By date
By administrative unit: Country, Province, County, Village
By type of event like births, deaths, road accidents etc.
I need to be able to do queries of type “All road accidents older than 2015-05-01 in Province X”. Or the same “for year 2015” etc.
Here are the constraints/requirements for my software:
Every time the software starts, it will read all the data, build the data structure which will change very slightly over time (or not at all for the first version). So I can help with feeding sorted data upfront if needed etc.
Data will fit in memory and compact memory representation is not extremely important, though it would be nice if I don’t go overboard.
I will need to maintain (or build ad-hoc) sums per administrative unit which are cumulative by time. Eg, road accidents per year (by administrative unit) that also accumulate over time, so I have 3 accidents for the year 2015 and a total of, say 56 since the beginning (till the end of 2015)
Data is dispersed mainly by the time dimension. Administrative units are several hundred in total and event types are no more than 10-20 per run. I probably can store sums for the events on every layer of the tree (update them on write) and this will be efficient enough as memory goes.
I prefer not to use a database, as probably the overhead of creating/destroying sqlite files or creating/truncating databases in a database server will be too big and I also want to be able to heavily rely on OCaml type safety and not have to constantly serialize/deserialize data.
Everything you are describing in terms of your query flexibility makes this sound like a database problem. You don’t want to use a database though. (There are also good OCaml database libraries.)
What sort of performance needs do you have that force you to avoid one? Is this system really going to need every possible ounce of performance?
Mostly in terms of scalability with regards of “total amount of data”. I need to be able to run this many times on many, many datasets (source files) and if I created a database file (or even worse, a db on a server) for every dataset, it will become an infrastructure nightmare very fast.
Isn’t having to re-implement SQL in some ad hoc way worse than creating some sqlite files on the fly and throwing them away? You don’t need a database server or anything like that. But it’s your project and you probably know it best.
It’s not that easy as I also want it working in the browser. I feel it will be much better in terms of type safety and “cross-platform safety” if I don’t leave OCaml’s boundaries.
Even if I have a flat array of entries, and having to do all the calculations on every query by simple folding, this still feels “better” to me right now than having to do all the query calculations on the backend or having to write different FFI for client and server.
Owl is an option, but what it provides are n-dimensional arrays, so you only get numerical data, and you would have to translate queries into array operations. You wouldn’t get the sort of type safety you want, since a number is a number–and not a number representing particular sorts of info.
There has been discussion of adding dataframes with labeled columns, etc.
I’m not sure how much of Owl can run in browsers at this point, but it’s one goal of the project.
Also: I’m pretty sure I remember reading about an in-memory SQL library for OCaml a while ago. If that does exist, type safe queries are possible, too: What is a good and maintained SQL toolkit for ocaml?
(Maybe I’m thinking of a library for another language, though.)