Do we have a pure OCaml, high-performance, persistent key-value store in 2023?

UnixJunkie · January 10, 2023, 12:41am

Any recommendation?

Storing 1M entries should be no problem.
Storing 1B entries should not crash, bonus points if it is still fast at doing that.

Chet_Murthy · January 10, 2023, 3:52am

I have to ask: is there a reason you wouldn’t just wrapper RocksDB and call it a day?

ETA: I mean, if I were programming in Golang or Rust, I wouldn’t balk at doing the same.

UnixJunkie · January 10, 2023, 4:59am

I suspect pure OCaml code is less brittle accross ocaml versions than some C code with an OCaml interface.

orbitz · January 10, 2023, 7:06am

What do the Ocaml versions have to do with it? The FFI is quite stable across versions, otherwise the community would have to rewrite their FFI every version. RocksDB or even SQLite are made for these types of things, it probably makes sense to use them.

bluddy · January 10, 2023, 7:14am

This kind of application requires a lot of contributors and expertise, and the field is extremely competitive. OCaml just got multicore, so you’re not going to find a competitive OCaml solution (and it’s not clear OCaml could be competitive with a C++ or Rust-based solution).

UnixJunkie · January 10, 2023, 7:58am

There is this one on github:

So, the question is not so crazy: people have done it.

There is also this one, from the same author:

Chet_Murthy · January 10, 2023, 8:18am

I dug deep into Leveldb for a Fast Paxos-based distributed system about ten years ago. Writing such a thing is not for the faint-hearted, and to write one that is both fault-tolerant and high-performance is a significant undertaking. It requires specialized knowledge that is absolutely non-overlapping with the skill-sets of most OCaml hackers and experienced OCaml developers. Why? B/c when you’re writing systems with that kind of performance, that need that kind of fault-tolerance, you don’t tend to want them to be slow, and so you’re going to write them in C++. It comes with the territory.

Maybe someday Rust will supplant C++ for this sort of work. But even that’s not certain, b/c Rust … well, let me put it this way: I haven’t seen an AVL tree implementation in Rust that didn’t have to jump into unsafe mode for … a significant bit of code. So I have my doubts that Rust is ready for building leveldb from scratch.

Also, if OCaml’s FFI is that unstable that one can’t properly use wrappered C code with multiple OCaml versions, that in itself is a problem that needs to be addressed. I’ve wrappered lots of C and C++ code (C++ code is actually easier, b/c you can use template metaprogramming on the C++ side to generate most of the wrapper) and haven’t had any problems.

ETA: an example of the complexity: even just the record-log (that is used to hold update records – the write-ahead log) is seriously complicated. It’s organized as series of 32k blocks. Each block may contain one or more updates; an update can be split across more than one block. blocks start and end with checksums and eyecatchers, so that if data corruption occurs it can be detected and resynchronized, even at a loss of some updates. That’s what I remember from 10yr ago, but I’m sure there was more. It’s complex stuff, and you need to get it all right.

Oh, and then when writing, you need to avoid flushing after every update. So you need to queue up transactions that have written their changes to the log, but for which the log hasn’t flushed. Then as the log flushes, you can release those trans. Now, the problem is, what if the process crashes during flushing? Then you might have records 1,2,3 fully-written, then 4 partially-written, then 5 missing, then 6, 7 fully-written. So during recovery you have to detect this situation and truncate at record 3. Etc.

The complexity goes on and on. And this is just the WAL: we haven’t gotten to the highly-concurrent skip-list for caching tuples in memory, or the SSTable implementation (b/c today, nobody uses BTrees – they’re bad for write performance even on SSDs – they use some form of log-structure).

EricGT · January 10, 2023, 9:51am

My 2 cents.

For large K-V stores of billions of records one almost certainly needs more space than main memory can hold (think disk space). The problem is that if disk space is used, doing a key retrieval can be expensive especially for collisions. One of the beauties of K-V stores like RocksDB and LevelDB is that they use a Bloom filter or similar which can dramatically cut down the need to do a search (think disk access).

Ulugbek · January 10, 2023, 10:40am

Maybe you find this interesting: GitHub - fpottier/stores: An OCaml library that offers several implementations of (in-memory) stores.

EricGT · January 10, 2023, 10:47am

It would help if you noted if the Key-Value store can fit in memory, needs to be in memory or even needs to be shared among multiple processes, i.e. your other question Is it safe to use a global hash table in an ocaml-5 parallel program?

Those factors are essential for choosing possible solutions.

dinosaure · January 10, 2023, 10:54am

You can take a look on rowex which is an implementation of an persistent adaptative radix tree. It follows several papers and specially this one: RECIPE - Converting Concurrent DRAM Indexes to Persistent-Memory Indexes. Note that the project has already been designed for multicore (read and write access processes are in different fork()).

c-cube · January 10, 2023, 3:13pm

Maybe someday Rust will supplant C++ for this sort of work. But even that’s not certain, b/c Rust … well, let me put it this way: I haven’t seen an AVL tree implementation in Rust that didn’t have to jump into unsafe mode for … a significant bit of code. So I have my doubts that Rust is ready for building leveldb from scratch.

Unsafe rust is not worse than C++. And there are DBs in rust already: look at https://sled.rs for example.

The complexity goes on and on. And this is just the WAL: we haven’t gotten to the highly-concurrent skip-list for caching tuples in memory, or the SSTable implementation (b/c today, nobody uses BTrees – they’re bad for write performance even on SSDs – they use some form of log-structure).

For modern KV stores, that’s true, I’m sure. But sqlite uses B-trees and
is the most widely deployed database in the world, so “nobody” is a bit
of a stretch

Chet_Murthy · January 10, 2023, 3:35pm

Sorry, but no. An analogy: in flying aircraft, there’s a well-known problem that when autopilot can do most tasks, it means that pilots get less and less actual practice of piloting. And so, when they’re called to perform, esp. under stress, they fail. William Langewiesche had a great article about the effects of this trend on that Air France crash off Brazil many years ago. And we see it happening with Tesla and “Full Self ~~Deception~~ Driving” today. Partially-assisted autopilot/self-driving is worse than manual flying/driving, b/c it produces inattentive and inexperienced pilots/drivers.

When you write C++, you are always thinking about memory-ownership and avoiding memory-errors. When writing Rust, even for someone experienced with C++, you relax your guard and write code with much more lack of concern, b/c “the borrow checker will make sure I don’t screw up”. And so when you come to use unsafe mode, you’re doing something unfamiliar, and can make mistakes. But also, since there’s no checks at all that what you’re doing abides by any of the memory-ownership rules of the surrounding code, it’s difficult to know if what you’re doing is unsafe-but-correct, or unsafe-and-incorrect.

Chet_Murthy · January 10, 2023, 3:37pm

So does mysql, berkeley db, and … DB2. All legacy DBs, all predate SSDs and “disk is the new tape”.

ETA: oh whups! I forgot that Sqlite4 is log-structured – so, like leveldb.

c-cube · January 10, 2023, 3:59pm

Sorry, but no. An analogy: in flying aircraft, there’s a well-known problem that when autopilot can do most tasks, it means that pilots get less and less actual practice of piloting. And so, when they’re called to perform, esp. under stress, they fail. William Langewiesche had a great article about the effects of this trend on that Air France crash off Brazil many years ago. And we see it happening with Tesla and “Full Self ~~Deception~~ Driving” today. Partially-assisted autopilot/self-driving is worse than manual flying/driving, b/c it produces inattentive and inexperienced pilots/drivers.

Programming is not driving/piloting :-). It doesn’t progressively become
automatic, nor is it real-time (unless you do live music with it I
guess).

When you write C++, you are always thinking about memory-ownership and avoiding memory-errors. When writing Rust, even for someone experienced with C++, you relax your guard and write code with much more lack of concern, b/c “the borrow checker will make sure I don’t screw up”. And so when you come to use unsafe mode, you’re doing something unfamiliar, and can make mistakes. But also, since there’s no checks at all that what you’re doing abides by any of the memory-ownership rules of the surrounding code, it’s difficult to know if what you’re doing is unsafe-but-correct, or unsafe-and-incorrect.

I think (modern) C++ has the same issue, you don’t have to think much
about lower level aspects until you do (e.g. you rarely have raw
pointers and non-RAII managed objects in “normal” code). In rust, when
you enter an unsafe block, you know you have to pay more attention
(and it’s encouraged to write a comment about why the code is actually
sound, with like // SAFETY: <…>). The dangerous parts are literally
delimited syntactically and are generally as narrowly scoped as possible.

Now I agree that unsafe rust is a bit of a different beast, more
difficult than safe rust, and it has its own learning curve. But again,
so does C++, and you don’t know where you need to pay extra attention.
In both cases, you have raw pointers, UB to think of, exception safety,
etc. to consider.

UnixJunkie · January 11, 2023, 1:07am

My current requirements:

does not need to fit in memory (so that we can store as many things as the disk can handle)
does not need to be shared between multiple processes

thomas_ridge · January 11, 2023, 5:15am

I would consider both kv-hash and tjr_btree as unfinished experiments unfortunately. Neither is ready for production use.

tjr_btree was intended to be part of ImpFS (GitHub - tomjridge/imp_fs: ImpFS, a new filesystem.), which was planned to be implemented in Isabelle. The core of tjr_btree is a B-tree written in Isabelle (GitHub - tomjridge/isa_btree). The codebase really needs a complete overhaul at this stage - there is a lot of work to do to get it to a fit state.

kv-hash was intended as a replacement for mirage/index (GitHub - mirage/index: A platform-agnostic multi-level index) as used in Irmin (GitHub - mirage/irmin: Irmin is a distributed database that follows the same design principles as Git). It is a mix of ideas from B-trees and persistent hash tables. After a few months of development of kv-hash, Irmin independently changed to use “minimal indexing” which meant that the issues in mirage/index that kv-hash was supposed to address were no longer present. At this point work on kv-hash ceased.

I would also echo bluddy’s comments: “This kind of application requires a lot of contributors and expertise, and the field is extremely competitive.” A basic B-tree gives you a lot, but modern implementations of key-value stores have a lot of additional technology to improve performance further.

EricGT · January 11, 2023, 9:25am

Sorry for more questions,

Which Operating Systems (OS) are needed to run the code?
Which build system is needed?
Which API can be used? [C, C++]
What type of software license is needed?

I ask because I (EricGT) too have a very similar need (related SWI-Prolog work) and RocksDB checks most of the boxes but getting it to compile natively for Windows is still a work in progress (vcpkg might be the answer and MSYS2 is also of value) and I hope that for Mac OS it is close enough to Linux to not be a problem, I have never programmed for Mac OS so can only guess for Mac OS.

As choosing a build system for multiple OS and/or external libraries it seems the trend is to use CMake and for any other build system you have to do most of the leg work and may find yourself in a group of one for certain subtasks.

If you want to learn more about RocskDB from a programmer perspective see the RocksDB blog entries, there are lots of unofficial blogs and info on RocksDB but many of those can be misleading, gloss over a needed detail, be out of date, be specific to a particular problem, etc.

While RocksDB does have a C API, the full complement of functionality can not be done through that API, to get the full complement of functionality the C++ API must be used.

I also agree with others that you should not build such a K-V store from scratch on your own, I would not, it would be worse than shooting yourself in the foot, use time tested and production quality code used by many others.

EDIT

For real world numbers using a large real world data set (SemMedDB) with SWI-Prolog and RocksDB as the Key-Value store (ref)

Cumulative writes: 
   339M writes, 
   339M keys, 
   339M commit groups,
   1.0 writes per commit group,
ingest:
   34.14 GB,
   0.70 MB/s

If you are interpreting that 34.14 GB of data was loaded then you are reading it correctly.

mndrix · January 11, 2023, 10:08am

I’ve often found LMDB easier to embed and work with than LevelDB or RocksDB. It’s also simple enough that reimplementing it from scratch in OCaml is doable. For example, BoltDB and bbolt are reimplementations in Go.

UnixJunkie · January 12, 2023, 1:08am

Which Operating Systems (OS) are needed to run the code?

Linux and Mac OS, so unix-like.

Which build system is needed?

I don’t care.

Which API can be used? [C, C++]

The API must be OCaml.

What type of software license is needed?

I don’t care.

Topic		Replies	Views
Advantages of OCaml over Rust Community	76	32050	September 4, 2019
Changing OCaml's allocation model when creating new values and accessing data Learning	15	1409	December 10, 2018
Interesting OCaml Articles Community web , blog , video , podcast , social-media	109	19933	October 18, 2023
[ANN] BAP 2.1.0 Release Community announce , bap	26	4580	September 22, 2020
[ANN] reed-solomon-erasure 1.0.1 Ecosystem announce	13	1795	August 1, 2018

Do we have a pure OCaml, high-performance, persistent key-value store in 2023?

Related topics