Document management

I would like to do some document management in an OCaml program.
It can be basic, or become progressively complex, mainly because of the structure of the repository.

The requirements are:
to be able to:

  • define a document repository (name, size, subjects, mime type, and various key words).
  • save a document with required values (document_name_, original_file_name, author, document_responsible,upload_date, source, comment, version, etc.), given or computed.
  • delete a document.
  • search for a document
    1/ with criteria (keyword, size, mime type) that do not depend on the file content
    2/ AND with keywords related to the file content : this requirement refers to indexing capabilities (a plugin is needed to read the file and store key/values or whatever more valuable).
  • edit a document, with its relevant editor, with versioning capability.
  • log each event on documents.
  • not rely on a document management software but only on required libraries.
  • handle performance requirements from the beginning regarding an important number of documents (of possibly large size), that may be edited, searched and served over the internet.

Solutions?
Sure that I can save the documents as files in some directories.
But I would like to avoid using a rigid directory naming/structure because the repository structure will evolve.

A database could also be a solution. Storing (possibly large) binary objects (pdf, etc.) seems possible. However, there are many discussions in various forums where opinions are opposed:
no you should not because it will use to much CPU/RAM,
no you should just store path to files in a dedicated file system,
no you should not because a network file system will be more efficient,
yes you can as I’m doing for years with some M docs,
yes you can and documents can be streamed without using to much RAM,
yes you really should do it because your DB transactional mechanisms will ensure integrity,
etc.

Indexing files content requires plugins for each kind of file. And an indexing store (key/value or better approach).

Based on these requirements and supposed solutions, what would the best way to do that in OCaml?

Has someone successfully used a DB to store files/binary? Especially in Postgresql? (I see bytea an oid that may be used for that).

Can recent packages such as irmin do the job? (I think about its native git-like versioning capabilities).

What are the difficulties I may not have seen?

Thanks.

2 Likes

Have you checked out Irmin? Sounds like it might be what you’re looking for: https://irmin.org/

It’s a NoSQL DB built around the principles of git commits and branches, so might be good for document management. You can probably get diffs and merges for free.

(Written as an OCaml library btw)

2 Likes

Oops sorry you already mentioned it :joy:

Yes, but not enough to make a decision. It looks like you need to create the indexes that you usually get from a RDB. And probably other stuff I’m not clearly aware of.

A question was asked in May on irmin github about “using irmin Vs RDB” (in general, not document management related), but it got no answer.

Maybe the irmin guys can enlighten us a little bit?

1 Like

From reading through your application requirements, Irmin may indeed be a good fit – that depends on how high-level you require the API to be.

What Irmin will provide you

  • A simple API for modifying tree-like objects in the Lwt monad, where all modifications are inherently tracked as a history of "commit"s with associated metadata.

  • Features for forking the store into independent versions (sharing the same heap of objects) that can later be reconciled via merging.

  • A variety of backend implementations and ways to access the store (GraphQL server, HTTP server etc.), with differing performance characteristics. I suspect that you would want to make use of the Git backend for Irmin, which does the obvious encoding of OCaml trees onto a file-system and exactly replicates git's format for version control. That is: I can cd into an irmin-git store, run git log, and literally see all of the changes to the store as a graph of Git commits with appropriate timestamps and messages.

    Personally, I like the idea of having my database exist in such an easily-readable format, but whether this backend is suitable for your use-case will depend on how large your “possibly large” files will be; Git is not designed to handle very large files efficiently and so we have written other backends intended to meet those use-cases.

What Irmin will not provide you

  • Indices over your data, or first-class ways to construct efficient “views” of that data. These would need to be implemented as part of a higher-level API.

  • First-class support for storing multiple types of content data, or directly encoding content “metadata” within the store (keywords, MIME type etc.). This would need to be built on top of the existing homogeneous tree API.

  • A pre-provided tree structure that works efficiently for your application semantics. Using Irmin requires thinking about how you want to encode the structure of individual snapshots as a tree, exactly as one would have to do with a directory hierarchy.

I hope that helps clarify the situation somewhat :slightly_smiling_face:

3 Likes

The 3 features mentioned are fine.

I guess something like a 100 GB max store size and a 100 MB max file size, for a correct performance. But it should be ok to handle larger files from time to time with reduced performance, if it stays safe.

From your experience, can you indicate the max number of files we can handle with file system/irmin-git, and especially what is a reasonable maximum file size, and a reasonable max store size?

These missing features are key features.
It looks like this is more or less main features offered by a RDB.

Do you have an estimate of the work required to do that on top of irmin?

Can you point to some implementation examples of what could look like these features? (the irmin tutorial now looks nice but sticks to basic features).
Thanks.

People regularly use relational databases to solve this problem. Unless your files are going to be in the gigabyte range, an RDB should be just fine for size. And for read-caching, you can always cache files in the filesystem and serve from there, to short-circuit around the RDB.

But the real reason to use an RDB is this: unless you have a baked-in-stone idea of what your document metadata model is, and the queries you’re going to need to run, an RDB will allow you to make changes much more easily than if you use a hierarchical (== “a single index”) data-store.

It also helps that you can port from one RDB to another much more easily than from one low-level store to another.

Also, if you suffer from a disease of success (ha!), an RDB is going to be easier to scale with things like replication to readonly replicas, etc.

P.S. Last: IIUC, Postgresql has good support for JSON and at least some kind of binary-JSON: you may find this useful also.

I see no need to store files in the Gb range. It should be an exception. But I see around E6 ordinary files in the 1-100 Mb range, with a resulting store size in the 1-100 Tb. It seems very large for a “single application” but it seems very small compared with search engines capabilities.

Do you mean that a RDB is fine for that sizes?

As I mentioned, the metamodel defining the structure of the repository will evolve, which will allow additional views.

That’s a funny expression! Yes, that a kind of disease I suffer from, which drives my interest in established RDB that offer replication and so on, and which makes me ask this question.
But I’m interested in knowing use cases that fits better to irmin or a to RDB such as Postgresql.

Agreed. If these features are key to you, I advise using a relational database and not Irmin – which caters to a different use-case. It doesn’t seem worth trying to implement an RDB feature set on top of Irmin if you have no real need of the specific advantages that it offers.

Can you present some use cases where irmin perfectly fits the storage requirements of an OCaml program and appears to be more interesting than a RDB?

Irmin is especially well-suited to applications that care about:

  • Content-addressability, either for its performance characteristics (i.e. heap sharing) or for its application semantics (i.e. “proof”-like objects and heap-exchange mechanisms). e.g. a distributed cache of build artefacts or a concensus mechanism.

  • Divergent eventually-consistent state, rather than the more strict consistency models offered by traditional DBs. e.g. a highly-decentralised application protocol or a fault-tolerant real-time web application.

  • Platform-agnosticism, in order to be able to deploy infrastructure on a wide range of hardware (including unikernels via MirageOS).

A good example of an application that fits all of the above criteria is Tezos.

In short: Irmin is suitable for applications that want a distributed, versioned heap and RDBs are suitable for applications that want a centralised store of schematic data with auto-derived indices and a transaction system.

These seem to me to be fairly mutually-exclusive goals, and so I don’t find the comparison to be particularly enlightening in and of itself; I would expect a choice of high-level DB architecture to be more or less forced by the requirements of the application. In your case, I would look into a document-oriented solution that meets your key requirements w.r.t. metadata, indexing and versioning. While it’s certainly possible to implement these in terms of Irmin, I don’t encourage it as a good use of your time :slightly_smiling_face:

5 Likes