For the next few months I will be working on Irmin, focusing on improving general usability. The goal of this effort is to make Irmin more accessible to potential users and clean up the rough edges for existing users.
One of the biggest problems I see right now is that the documentation is out of sync with the current implementation. I’ve just been getting starting refreshing the documentation and tutorials, however here are a few more projects that @samoht and I have discussed:
Better RPC API:
We need a better way to access an Irmin database remotely. The REST API is currently available, but not ideal.
Cap’n Proto? This would be ideal for binary data.
Graphql? It has already been started but will need some further attention.
Language bindings
Fix javascript bindings and add bindings for other languages.
This, in conjunction with the point above, will allow non-OCaml applications to access data from Irmin in a well-defined manner.
Clustering
Planning for this is still in the very early stages, so any ideas in this area are more than welcome!
Allow JSON datatype on the command line
We still need to understand how to make this convenient for the user.
Import/export interoperability with other key/value stores
Redis
LMDB
RocksDB
I would love to hear your thoughts on the above list but am also interested to hearing your own suggestions, either from experience or just based on what you’d want/expect from a project like Irmin.
Glad to have you hacking on Irmin, @zshipko! For my own use cases involving the opam-ci, it would be useful to have a JSON CLI to add/remove metadata to a remote git repository. It would also be amazing to have a Redis “cache” of the database for quick’n’dirty queries from web interface.
A working copy of the OCaml Build Infra logs is at https://github.com/avsm/obi-logs (in various branches that are force pushed to), for your reference.
A big thing that really holds Irmin back from it’s potential is it’s sync primitives not they are inherently bad, there are plenty of git like replication protocols out there and it doesn’t stop couch, the thing is that it’s built in an intentionally minimal manner with the expectation of building your own xxx with the library, and most people don’t want to understand how git works enough to scale it in other ways besides setting it up in a tree topology, because git is old and taken for granted.
I think implementing an anti entropy protocol, some sort of peer service, and the only thing that I would actually be included in the default rpc server should be an api endpoint to configure replica topology and whether push after watches are triggered, or who to pull from, and adding UTP as a transport for ocaml-git (Irmin would be incredibly useful for p2p networks).
Agreed. Handling replication/clustering in a more complete manner is definitely something I’m interested in working on. I like the idea of keeping it customizable, but providing higher level synchronization tools.
@avsm pointed me to this thread. We are using irmin as the storage solution in our project (https://github.com/me-box). In terms of usability, ezirmin provides a decent level of abstraction over irmin for us. It would be great if it could be updated/maintained to support all the great work going on.
Honestly, I think a good way to do replication, that scales well, and goes with Irmins model, and is already common with things that use CRDTs, is an epidemic anti entropy protocol that’s implemented as a functor that takes a module for discovery and peer selection and the fanout, starts a listener and a periodic task that will push their data to F receivers then the receiver merges with the data sent out, then sends their data the senders where they will merge.
Possible optimizations would be something where each peer would maintain a table with the reference hash of the last point of sync with the other peers, then diff it so it can skip the initial procedure you go through in git pulls to calculate the commits you need, using a reliable udp transport like UDT or UTP, since we want something that can deal with WANs well, something where you have the performance of UDP but without the unreliability, or the message size cap, and a permissive congestion control algorithm.