B Trees in Ocaml via Fmlib_std 0.3.0

I am pleased to announce the release (0.3.0) of fmlib, a group of functional
libraries with managed effects.

The main new feature of release 0.3.0 are B trees in the package fmlib_std. B trees can be used to implement finite sets and finite maps. Fmlib_std’s B trees have functionality similar to the modules Set and Map of the standard library.

The modules Set and Map of the standard library are based on AVL trees. B
trees offer the same functionality but have on modern processors a better cache
performance and have better data locality.

The current B tree implementation in Fmlib_std implements B trees by using arrays
which are guaranteed to fit into a cache line. The design of B trees is
described here. The API can be found here.

The group of libraries Fmlib has four main components:

  • Fmlib_std Standard Datatypes: This component offers some modules from Stdlib
    with additional functionality. E.g. Fmlib_std.Array offers functions to
    insert elements into arrays, remove elements from an array and binary search
    in a sorted array. It has the modules Result and Option which can be used
    to avoid exceptions and use exceptions in a more structured way. The modules
    Result and Option in Fmlib offer a complete monadic interface and offer
    the let* operator to write well readable monadic code.

  • Fmlib_pretty Pretty Printing: Print tree like structures in a nice way and use
    the library completely functional. The library does not assume a specific IO
    method. The pretty printer generates a lazy stream of characteres which can be
    written by all io functions.

  • Fmlib_parse Combinator Parsing: Easily parse textual input by the use of
    combinators. The library supports indentation sensitivity and can therefore be
    used to parse yaml files, haskell, python, etc. Furthermore no input method is
    assumed. The generated parsers are sinks of tokens (or characters). You can
    choose any input method and push the tokens/characters into the parsers. The
    generated parsers are fully incremental. Parser can be stored at any position
    of the input stream and in case of interactive editing, parsing can be resumed
    from any point of the stream.

  • Fmlib_js Interface to Javascript: This components contains primitives to
    interface to javascript via js_of_ocaml.

Fmlib can be installed via opam:

opam update
opam install fmlib
opam install fmlib_std
opam install fmlib_pretty
opam install fmlib_parse
opam install fmlib_js

The source code of the library is located at github

5 Likes

(I made this comment on reddit but I’m not sure which one is best for discussing, so replaying it here.)

Fmlib seems to be growing in a monolithic library that does many unrelated things. I like parsing, and pretty-printing, I think javascript interop is fine, and B-trees are interesting, but they really have no business getting mixed together! Have you considered splitting the project into specific projects each with a scope that is narrower and more clearly delimited?

1 Like

(the links do not seem to be working for me)

Thanks for the comment, but the library is not monolithic. The packages are completely separated and can be used separately. The four libraries Fmlib_std, Fmlib_pretty, Fmlib_parse and Fmlib_js are separate opam packages. The only common thing is that they are stored in one repo and the libraries Fmlib_pretty, Fmlib_parse and Fmlib_js use Fmlib_std.

There is no other dependency.

Sorry. I hope the links do work now.

As you said, the core idea of B-trees is to fit as many keys as possible into a single cache line, so that finding one is fast. So, I was expecting to find the type Key.t array in the implementation (with Key.t = int hopefully). But the implementation contains (Key.t * 'a) array, which seems to completely defeat the point of B-trees, as keys are thus stored in separate cache lines.

I don’t want to sound like I’m criticizing the work in Fmlib. It sounds like nice work, although I have not looked at the details for now. But it definitely feels like a monolithic library to me, for example:

  • it has a single name, Fmlib, that is used to name all components
  • you talk of it as a single library, and announce all its subcomponents together

The fact that you keep the packaging dependencies separated is nice (you avoid the problem of people avoiding your library because it links in things that they don’t want/need). But I think there are many other aspects of “being a monolithic library”, that are maybe more psychological, that are less positive:

  • Each module has a weak identity/branding. If I’m looking for a parser combinator library, I will remember to look at a couple libraries that were advertized to me as “parser combinator libraries” (for example angstrom), and I’m much less likely to remember and consider fmlib_parse. (Because I just remember Fmlib and I forgot half of what’s in there.)

  • Relatedly, it makes it harder for me to get a sense of which parts are moving fast, which parts are in maintenance mode, which parts are very good and which parts are okay but less pleasant to use in practice than other alternatives.

I tried to think of whether I have concrete suggestions on what I think would make things “feel less monolithic”. A couple early/naive/dumb proposals:

  • I would drop the fmlib prefix, which to me suggests one (monolithic) library. If you like to have a common umbrella for your software, maybe just fm- ? It may be irrational but fm-parse and fm-btrees sound less monolithic than fmlib_parse and fmlib_btrees. (fm- reads like the name of a group, while fmlib is one library).
  • Announcing the libraries separately.
2 Likes

@silene
Your point is valid in some cases. If Key.t = int then it would be better to work with two arrays Key.t array and 'a array. This is, because ocaml stores ints in one machine word without boxing. If e.g. Key.t = string then the advantage is questionable, because a string is a pointer to a string object i.e. it is boxed.

Boxing is the normal case for ocaml values. There are only some exceptions like ints, chars, tag only variant types etc.

The truth can only be found out be making benchmark measurements. The change of the implementation in Fmlib from (Key.t * 'a) array to Key.t array * 'a array to store key value pairs would be straightforward and easy.

But even with (Key.t * 'a) array the cache efficiency and data locality is better than with AVL and redblack trees, because there is much less indirection. There are always between 16 and 32 values which you can address with one offset. In the AVL or redblack trees you have to navigate through the nodes.

However I would be happy to change the implementation if there is strong evidence for a more efficient design.

1 Like

I agree, benchmarks are the way.
You are welcome to add fmlib to

where there’s already a bunch of libraries being compared :-).

1 Like

I would also add that it’s more effective from the community’s perspective to use one of the existing standard libraries and enhance them. For example, Containers could use a B-Tree implementation, which would then be easily found by all Containers users and would then also be tested thoroughly in real-world settings.

Similarly, I’m not sure we need yet another set of enhancements to the standard library modules. We have Containers and Batteries for that. That doesn’t mean your efforts are unwanted, but simply that I recommend thinking about how to make sure your contribution reaches the maximum number of users in the community. How many users are likely to use Fmlib_std rather than one of the other standard library enhancements? Probably not too many. How many users would benefit from your contributions to Containers? A lot more.

That is not true. (Key.t * 'a) array adds an extra indirection, which makes it just as costly as any binary tree.

@silene: What is your proposal to make it more efficient? The current definition of the datatype is

    type 'a pairs = (Key.t * 'a) array

    type 'a t =
        | Leaf of 'a pairs
        | Node of 'a pairs * 'a t array


Is your proposal


    type ‘a pairs = Key.t array * ‘a array

?

Regards
Helmut

Yes, but I would even inline the pairs type, otherwise you are still suffering from a superfluous indirection. It only happens when entering a new node, but that is still two extra indirections with respect to binary trees. By inlining, you are down to only one extra indirection per node, which should be more than alleviated by the fact that lots of keys (or their addresses for non-integer keys) are now guaranteed to be stored into consecutive cache lines, contrarily to binary trees.

type 'a t =
   | Leaf of Key.t array * 'a array
   | Node of Key.t array * 'a array * 'a t array
1 Like

Guillaume is right, but the very first thing to do is to create solid
performance measurements (benchmarks) to be able to assess the
performance and see if changes actually accelerate code.

2 Likes

Main-memory databases are remarkably tricky to make cache-friendly. Throw in NUMA, and it gets even worse. Two thoughts come to mind:

(1) maybe plan on storing keys/values in serialized form
(2) maybe think about using a C++ implementation ?

But above all, yeah, good benchmarks are really important.

@gasche: Thanks for the comment. With respect to the naming I absolutely agree with your point. I am unhappy as well with fmlib_parse, fmlib_pretty and would prefer as you suggested fm_parse, fm_pretty.

Unfortunately I don’t know how to change the names without having obsolete packages in opam and having new packages in opam which have the same functionality and just a different name.

Let me explain why this unfortunate naming exists in the first place. My main project is a compiler project. Since a functional and monadic style is good practice in compiler writing, I have separated some general functions which have nothing to do with my specific compiler into a directory called fmlib (standing for functional monadic library).

Since the fmlib functions are generic, I decided to extract them into a standalone library, write documentation for them and release them in the hope that they might be useful not only for me.

The original fmlib had several modules like Std, Parse, Pretty etc. These modules are pretty unrelated. Therefore, also triggered but some comments like your comment, I decided to make several independent packages from the independent modules. Unfortunately I have used the prefix fmlib instead of the better fm to avoid namespace pollution in the namespace of the opam packages.

But as you said, I don’t want fmlib to be viewed as a library (therefore the lib is unfortunate), but more as a group of libraries which share some common programming paradigm, namely functional monadic programming, but have separate use cases.

This is another case where opam scopes (namespaces) would solve the problem perfectly: Proposal: opam namespace (scope) for self-published packages

@silene: Maybe you are right. This seems to be the flattest structure.

If I just want the B-tree data structure, should I install just the fmlib opam package?
After trying, no:
Fmlib_std.Btree.Set and
Fmlib_std.Btree.Map.
So, the library to advertise as providing a B-tree is fmlib_std, not fmlib…

1 Like

@UnixJunkie: That’s right. Sorry for the inconvenience. As already discussed, the naming is somewhat unfortunate. fmlib is just the umbrella for fmlib_std, fmlib_pretty, fmlib_parse and fmlib_js. fmlib_std contains standard datatypes like Btree.Map and Btree.Set.