I am pleased to announce the release (0.3.0) of fmlib, a group of functional
libraries with managed effects.
The main new feature of release 0.3.0 are B trees in the package
fmlib_std. B trees can be used to implement finite sets and finite maps. Fmlib_std’s B trees have functionality similar to the modules
Map of the standard library.
Map of the standard library are based on AVL trees. B
trees offer the same functionality but have on modern processors a better cache
performance and have better data locality.
The current B tree implementation in
Fmlib_std implements B trees by using arrays
which are guaranteed to fit into a cache line. The design of B trees is
described here. The API can be found here.
The group of libraries
Fmlib has four main components:
Fmlib_std Standard Datatypes: This component offers some modules from
with additional functionality. E.g.
Fmlib_std.Array offers functions to
insert elements into arrays, remove elements from an array and binary search
in a sorted array. It has the modules
Option which can be used
to avoid exceptions and use exceptions in a more structured way. The modules
Fmlib offer a complete monadic interface and offer
let* operator to write well readable monadic code.
Fmlib_pretty Pretty Printing: Print tree like structures in a nice way and use
the library completely functional. The library does not assume a specific IO
method. The pretty printer generates a lazy stream of characteres which can be
written by all io functions.
Fmlib_parse Combinator Parsing: Easily parse textual input by the use of
combinators. The library supports indentation sensitivity and can therefore be
used to parse yaml files, haskell, python, etc. Furthermore no input method is
assumed. The generated parsers are sinks of tokens (or characters). You can
choose any input method and push the tokens/characters into the parsers. The
generated parsers are fully incremental. Parser can be stored at any position
of the input stream and in case of interactive editing, parsing can be resumed
from any point of the stream.
Fmlib can be installed via opam:
opam install fmlib
opam install fmlib_std
opam install fmlib_pretty
opam install fmlib_parse
opam install fmlib_js
The source code of the library is located at github
(I made this comment on reddit but I’m not sure which one is best for discussing, so replaying it here.)
(the links do not seem to be working for me)
Thanks for the comment, but the library is not monolithic. The packages are completely separated and can be used separately. The four libraries
Fmlib_js are separate opam packages. The only common thing is that they are stored in one repo and the libraries
There is no other dependency.
Sorry. I hope the links do work now.
As you said, the core idea of B-trees is to fit as many keys as possible into a single cache line, so that finding one is fast. So, I was expecting to find the type
Key.t array in the implementation (with
Key.t = int hopefully). But the implementation contains
(Key.t * 'a) array, which seems to completely defeat the point of B-trees, as keys are thus stored in separate cache lines.
I don’t want to sound like I’m criticizing the work in
Fmlib. It sounds like nice work, although I have not looked at the details for now. But it definitely feels like a monolithic library to me, for example:
- it has a single name,
Fmlib, that is used to name all components
- you talk of it as a single library, and announce all its subcomponents together
The fact that you keep the packaging dependencies separated is nice (you avoid the problem of people avoiding your library because it links in things that they don’t want/need). But I think there are many other aspects of “being a monolithic library”, that are maybe more psychological, that are less positive:
Each module has a weak identity/branding. If I’m looking for a parser combinator library, I will remember to look at a couple libraries that were advertized to me as “parser combinator libraries” (for example angstrom), and I’m much less likely to remember and consider fmlib_parse. (Because I just remember
Fmlib and I forgot half of what’s in there.)
Relatedly, it makes it harder for me to get a sense of which parts are moving fast, which parts are in maintenance mode, which parts are very good and which parts are okay but less pleasant to use in practice than other alternatives.
I tried to think of whether I have concrete suggestions on what I think would make things “feel less monolithic”. A couple early/naive/dumb proposals:
- I would drop the
fmlib prefix, which to me suggests one (monolithic) library. If you like to have a common umbrella for your software, maybe just
fm- ? It may be irrational but
fm-btrees sound less monolithic than
fm- reads like the name of a group, while
fmlib is one library).
- Announcing the libraries separately.
Your point is valid in some cases. If
Key.t = int then it would be better to work with two arrays
Key.t array and
'a array. This is, because ocaml stores
ints in one machine word without boxing. If e.g.
Key.t = string then the advantage is questionable, because a
string is a pointer to a string object i.e. it is boxed.
Boxing is the normal case for ocaml values. There are only some exceptions like
chars, tag only variant types etc.
The truth can only be found out be making benchmark measurements. The change of the implementation in
(Key.t * 'a) array to
Key.t array * 'a array to store key value pairs would be straightforward and easy.
But even with
(Key.t * 'a) array the cache efficiency and data locality is better than with AVL and redblack trees, because there is much less indirection. There are always between 16 and 32 values which you can address with one offset. In the AVL or redblack trees you have to navigate through the nodes.
However I would be happy to change the implementation if there is strong evidence for a more efficient design.
I agree, benchmarks are the way.
You are welcome to add fmlib to
where there’s already a bunch of libraries being compared :-).
I would also add that it’s more effective from the community’s perspective to use one of the existing standard libraries and enhance them. For example, Containers could use a B-Tree implementation, which would then be easily found by all Containers users and would then also be tested thoroughly in real-world settings.
Similarly, I’m not sure we need yet another set of enhancements to the standard library modules. We have Containers and Batteries for that. That doesn’t mean your efforts are unwanted, but simply that I recommend thinking about how to make sure your contribution reaches the maximum number of users in the community. How many users are likely to use
Fmlib_std rather than one of the other standard library enhancements? Probably not too many. How many users would benefit from your contributions to Containers? A lot more.
That is not true.
(Key.t * 'a) array adds an extra indirection, which makes it just as costly as any binary tree.
@silene: What is your proposal to make it more efficient? The current definition of the datatype is
type 'a pairs = (Key.t * 'a) array
type 'a t =
| Leaf of 'a pairs
| Node of 'a pairs * 'a t array
Is your proposal
type ‘a pairs = Key.t array * ‘a array
Yes, but I would even inline the
pairs type, otherwise you are still suffering from a superfluous indirection. It only happens when entering a new node, but that is still two extra indirections with respect to binary trees. By inlining, you are down to only one extra indirection per node, which should be more than alleviated by the fact that lots of keys (or their addresses for non-integer keys) are now guaranteed to be stored into consecutive cache lines, contrarily to binary trees.
type 'a t =
| Leaf of Key.t array * 'a array
| Node of Key.t array * 'a array * 'a t array
Guillaume is right, but the very first thing to do is to create solid
performance measurements (benchmarks) to be able to assess the
performance and see if changes actually accelerate code.
Main-memory databases are remarkably tricky to make cache-friendly. Throw in NUMA, and it gets even worse. Two thoughts come to mind:
(1) maybe plan on storing keys/values in serialized form
(2) maybe think about using a C++ implementation ?
But above all, yeah, good benchmarks are really important.
@gasche: Thanks for the comment. With respect to the naming I absolutely agree with your point. I am unhappy as well with
fmlib_pretty and would prefer as you suggested
Unfortunately I don’t know how to change the names without having obsolete packages in opam and having new packages in opam which have the same functionality and just a different name.
Let me explain why this unfortunate naming exists in the first place. My main project is a compiler project. Since a functional and monadic style is good practice in compiler writing, I have separated some general functions which have nothing to do with my specific compiler into a directory called
fmlib (standing for functional monadic library).
fmlib functions are generic, I decided to extract them into a standalone library, write documentation for them and release them in the hope that they might be useful not only for me.
fmlib had several
Pretty etc. These modules are pretty unrelated. Therefore, also triggered but some comments like your comment, I decided to make several independent packages from the independent modules. Unfortunately I have used the prefix
fmlib instead of the better
fm to avoid namespace pollution in the namespace of the opam packages.
But as you said, I don’t want
fmlib to be viewed as a library (therefore the
lib is unfortunate), but more as a group of libraries which share some common programming paradigm, namely functional monadic programming, but have separate use cases.
This is another case where opam scopes (namespaces) would solve the problem perfectly: Proposal: opam namespace (scope) for self-published packages
@silene: Maybe you are right. This seems to be the flattest structure.
If I just want the B-tree data structure, should I install just the fmlib opam package?
After trying, no:
So, the library to advertise as providing a B-tree is fmlib_std, not fmlib…
@UnixJunkie: That’s right. Sorry for the inconvenience. As already discussed, the naming is somewhat unfortunate.
fmlib is just the umbrella for
fmlib_std contains standard datatypes like